DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

DeepSeek-Prover-V1.5：利用证明助手反馈进行强化学习与蒙特卡洛树搜索

📄 arXiv: 2408.08152📅 2024-08-15PDF

翻译进度84 / 84 段 (100%)

中文摘要

DeepSeek-Prover-V1.5 利用证明助手（Lean 4）的反馈信号进行强化学习和蒙特卡洛树搜索（MCTS），在形式化数学证明任务上取得重大突破。该模型能够自动探索证明策略空间，通过反馈信号不断优化证明路径。在 ProofNet 和 MinF2F 等基准上达到领先水平。

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

【摘要】DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search - 本文介绍了DeepSeek-Prover-V1.5的架构、训练方法和实验结果。

原文: Huajian Xin* Z.Z. Ren Junxiao Song Zhihong Shao Wanjia Zhao Haocheng Wang Bo Liu Liyue Zhang Xuan Lu Qiushi Du Wenjun Gao Qihao Zhu Dejian Yang Zhibin Gou Z.F. Wu Fuli Luo Chong Ruan DeepSeek-AI https://github.com/deepseek-ai/DeepSeek-Prover-V1.5 Abstract We introduce DeepSeek-Prover-V1.5, an open-source language model designed for theorem proving in Lean 4, which enhances DeepSeek-Prover-V1 by optimizing both training and inference processes. Pre-trained on DeepSeekMath-Base with specialization in formal mathematical languages, the model undergoes supervised fine-tuning using an enhanced form...

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

验证系统的能力。即使GPT-4等先进模型在复杂形式化证明方面仍面临困难，凸显了编码和数学的复杂性。形式化定理证明模型不仅需要理解数学概念，还需要掌握形式化语言。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: of the verification system. Even advanced models like GPT-4 (OpenAI, 2023 ) struggle with complex formal proofs, underscoring the intricate nature of both the coding and the mathematics involved. A formal theorem proving model must not only grasp the syntax and semantics of formal systems like the Lean theorem prover but also align abstract mathematical reasoning with precise formal representation. Language models in formal theorem proving typically employ two strategies: proof-step generation (Polu and Sutskever, 2020 ; Jiang et al., 2022a ; Lample et al., 2022 ; Yang et al., 2023 ; Wu et al....

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

考虑到整个证明生成的复杂性和计算效率，我们在DeepSeek-Prover-V1.5中开发了一种统一方法。该方法通过截断和恢复机制，结合了证明步骤生成和整个证明生成的优势。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: plicity and computational efficiency of whole-proof generation, we have developed a unified approach in DeepSeek-Prover-V1.5. This method combines the strengths of both proof-step and whole-proof generation techniques through a truncate-and-resume mechanism. The process begins with standard whole-proof generation, where the language model completes the proof code following the theorem statement prefix. The Lean prover then verifies this code. If the proof is correct and complete, the procedure terminates. If an error is detected, the code is truncated at the first error message, and any subseq...

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

主要目标和完成后续证明步骤（主要目标）。在强化学习阶段，给定不完整的定理证明和来自Lean证明器的真实策略状态，我们展开微调模型生成多个证明。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: bjective) and complete the subsequent proof steps (main objective). In the reinforcement learning stage, given an incomplete theorem proof and ground-truth tactic state from the Lean prover, we roll out the fine-tuned model to generate multiple proof candidates, which are then verified by the Lean prover. The verification results for these candidates are used as binary (0-1) rewards to further optimize the model and enhance its alignment with the formal specifications of the verification system. For model inference, we offer two alternatives: single-pass sampling and Monte-Carlo tree search. 1...

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

强化学习：我们采用GRPO算法对监督微调模型执行来自证明助手反馈的强化学习（RLPAF）。Lean证明器的验证结果作为奖励监督，增强模型证明能力。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: t Learning : We employ the GRPO algorithm (Shao et al., 2024 ) to perform reinforcement learning from proof assistant feedback (RLPAF) on the supervised fine-tuned model. Verification results from the Lean prover serve as reward supervision, enhancing the model’s alignment with the formal specifications of the verification system. • Monte-Carlo Tree Search : We advance the tree search method in formal theorem proving by introducing a novel abstraction and a corresponding search algorithm. Our truncate-and-resume mechanism acts as a state-action abstraction, seamlessly integrating the tree sear...

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

这涉及在包含代码和自然语言数学内容的高质量数据集上进行训练。我们特别关注证明助手中广泛使用的形式化语言，如Lean、Isabelle和Metamath。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: 2024 ) . This refinement involved training on high-quality datasets that include both code and natural language mathematical content. We specifically focused on formal languages widely used in proof assistants, such as Lean, Isabelle, and Metamath. We designate this improved model as DeepSeek-Prover-V1.5-Base. 2.2 Supervised Fine-tuning In this section, we explore the methodology and processes involved in the supervised fine-tuning (SFT) of DeepSeek-Prover-V1.5. Specifically, we augment the proof dataset from DeepSeek-Prover-V1 by adding detailed explanatory comments. This enhancement aims to ...

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

蒙特卡洛树搜索的截断和恢复机制数据。最终的证明数据集由9,645k序列组成。思维增强证明生成：在DeepSeek-Prover-V1中，我们发现了一个显著差距。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: se data for the truncate-and-resume mechanism for Monte-Carlo Tree Search (details in Section 3.1 ). The resulting proof dataset consists of 9,645k sequences. Thought-augmented Proof Generation. In DeepSeek-Prover-V1, we identified a significant gap between problem-solving strategies in natural language and theorem proving in Lean. In natural language, models generate detailed deduction steps to construct proofs, whereas in Lean, they often rely on a sequence of high-level tactic calls to brute-force solutions. These high-level tactics, while effective, obscure their internal workings and outc...

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

蒙特卡洛树搜索的截断和恢复机制，我们需要从模型生成的代码中提取策略信息。我们从Lean库增强了Lean REPL（读取-求值-打印循环）的数据提取工具。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: cate-and-resume mechanism for Monte-Carlo Tree Search, we needed to extract tactic information from the code generated by the model. We enhanced the Lean REPL (Read-Eval-Print Loop; Leanprover Community, 2023 ) with data extraction tools from the LeanDojo (Yang et al., 2023 ) project. This allowed us to extract tactic information in triples, which include the position of each tactic, as well as the tactic states before and after its application. This information helps us identify the specific tactic code that triggers verification errors (used in the expansion step for tree search, see Section...

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

作为训练提示。我们选择DeepSeek-Prover-V1.5-SFT在多次尝试中生成正确证明具有中等成功率的定理。这确保模型有改进空间，同时仍能接收正面反馈。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: et as training prompts. We select theorems for which DeepSeek-Prover-V1.5-SFT has a moderate success rate in generating correct proofs upon multiple attempts. This ensures that the model has room for improvement while still being able to receive positive feedback. After filtering, we retain approximately 4.5k unique theorem statements. Each theorem is prefixed with both CoT and non-CoT guiding prompts to enhance the model’s proof generation capabilities in both modes. Rewards. When training LLMs via RL, a trained reward model typically provides feedback signals. In contrast, formal theorem pro...

1 Introduction

1 引言大型语言模型的最近进展显著影响了人工智能中的数学推理和定理证明。尽管在自然语言领域取得了显著进展，语言模型仍面临重大挑战。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Recent advancements in large language models have significantly influenced mathematical reasoning and theorem proving in artificial intelligence. Despite notable progress in natural language domains, language models still encounter substantial challenges in formal theorem proving, e.g. using Lean (Moura and Ullrich, 2021 ) and Isabelle (Paulson, 1994 ) , which requires rigorous derivations satisfying formal specifications of the verification system. Even advanced models like GPT-4 (OpenAI, 2023 ) struggle with complex formal proofs, underscoring the intricate nature of both the coding and the ...

1 Introduction

证明状态。这种顺序性质引入了复合错误的风险，其中单个误解可能导致与有效证明路径的重大偏差。更具体地说，自回归模型可能在早期步骤中引入错误。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: roof state. This sequential nature introduces the risk of compounding errors (Ross et al., 2011 ) , where a single misinterpretation can lead to significant deviations from a valid proof path. More specifically, the auto-regressive model may have incorrect believes on intermediate tactic states when generating long proofs. To seamlessly integrate intermediate tactic states in proof-step generation while maintaining the simplicity and computational efficiency of whole-proof generation, we have developed a unified approach in DeepSeek-Prover-V1.5. This method combines the strengths of both proof...

1 Introduction

利用证明助手反馈并生成多样化解决方案候选。图2：整体框架。DeepSeek-Prover-V1.5通过预训练、监督微调和强化学习进行训练。在监督微调期间。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: utilize the proof assistant feedback and generate diverse solution candidates. Figure 2: Overall Framework. DeepSeek-Prover-V1.5 is trained through pre-training, supervised fine-tuning, and reinforcement learning. During supervised fine-tuning, the pre-trained model receives an incomplete theorem proof ending with a tactic state comment keyword. The model is trained to predict the content of this tactic state (auxiliary objective) and complete the subsequent proof steps (main objective). In the reinforcement learning stage, given an incomplete theorem proof and ground-truth tactic state from t...

1 Introduction

我们使用DeepSeek-Coder V2 236B注释自然语言思维链评论与Lean 4代码，将形式化定理证明与自然语言推理对齐。第二，我们插入中间策略状态信息。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: we use DeepSeek-Coder V2 236B (Zhu et al., 2024 ) to annotate natural language chain-of-thought comments alongside Lean 4 code, aligning formal theorem proving with natural language reasoning. Second, we insert intermediate tactic state information within the Lean 4 proof code, enabling our model to leverage compiler feedback effectively. The resulting dataset is then used to fine-tune the pre-trained model. • Reinforcement Learning : We employ the GRPO algorithm (Shao et al., 2024 ) to perform reinforcement learning from proof assistant feedback (RLPAF) on the supervised fine-tuned model. Ver...

1 Introduction

验证集上25.4%，测试集上25.3%。树搜索技术的整合进一步提高了这些结果，在验证集上达到25.4%，测试集上达到25.3%的新最先进通过率。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: % on the validation set and 23.7% on the test set. The integration of tree search techniques further enhanced these results, achieving new state-of-the-art pass rates of 25.4% on the validation set and 25.3% on the test set.

1.1 Contributions

1.1 贡献我们提出了一个全面的框架来开发基于语言模型的形式化数学证明器，整合了几个关键组件：大规模数学预训练、形式化数学语料库构建和增强、在线强化学习。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: We present a comprehensive framework for developing a language model-based formal mathematics prover, integrating several key components: large-scale mathematical pre-training, formal mathematics corpus construction and augmentation, online reinforcement learning from proof assistant feedback, and a tree search methodology for long-term planning in theorem proving. The pre-trained model, supervised fine-tuned model, and reinforcement learning model, along with the code for the Monte-Carlo tree search algorithm, are publicly available for further research and application. • Pre-Training : We en...

1.1 Contributions

整个证明生成框架。我们提出了RMaxTS，一种创新的蒙特卡洛树搜索算法，利用RMax策略解决稀疏奖励证明搜索问题中的探索挑战。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: he whole-proof generation framework. We present RMaxTS, an innovative Monte-Carlo tree search algorithm that leverages the RMax (Brafman and Tennenholtz, 2002 ) strategy to tackle exploration challenges in sparse-reward proof search problems. By assigning intrinsic rewards, this algorithm encourages the prover agent to generate diverse planning paths, thereby fostering extensive exploration of the proof space.

1.2 Summary of Evaluations and Metrics

1.2 评估和指标总结 • miniF2F：在单次通过整个证明生成设置中，DeepSeek-Prover-V1.5在miniF2F测试集上达到了60.2%的通过率，比DeepSeek-Prover-V1的50.0%提高了10.2个百分点。

原文: • miniF2F : In the single-pass whole-proof generation setting, DeepSeek-Prover-V1.5 achieved a pass rate of 60.2% on the test set of miniF2F, marking a significant improvement of absolute 10.2 percentage points over DeepSeek-Prover-V1’s 50.0%. Incorporating tree search techniques further elevated the pass rate to a new state-of-the-art 63.5% . • ProofNet : DeepSeek-Prover-V1.5 also demonstrated strong performance in the single-pass whole-proof generation setting for ProofNet, with pass rates of 21.6% on the validation set and 23.7% on the test set. The integration of tree search techniques fur...

2 Model Training

2 模型训练 2.1 预训练为提高语言模型生成形式化证明和通过数学语言推理的能力，我们进一步预训练基础模型。这涉及在高质量数据集上进行训练。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: 2.1 Pre-training To enhance our language model’s proficiency in generating formal proofs and reasoning through mathematical language, we further pre-train our base model (Shao et al., 2024 ) . This refinement involved training on high-quality datasets that include both code and natural language mathematical content. We specifically focused on formal languages widely used in proof assistants, such as Lean, Isabelle, and Metamath. We designate this improved model as DeepSeek-Prover-V1.5-Base. 2.2 Supervised Fine-tuning In this section, we explore the methodology and processes involved in the sup...

2 Model Training

条件证明数据。在每次迭代之间，我们使用DeepSeek-Coder V2 236B注释证明代码前的思维过程作为评论。最后，我们为蒙特卡洛树搜索定制这些数据。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: ditional proof data. Between each iteration, we use DeepSeek-Coder V2 236B (Zhu et al., 2024 ) to annotate the thought process before the proof code as comments. Finally, we tailor these data for the truncate-and-resume mechanism for Monte-Carlo Tree Search (details in Section 3.1 ). The resulting proof dataset consists of 9,645k sequences. Thought-augmented Proof Generation. In DeepSeek-Prover-V1, we identified a significant gap between problem-solving strategies in natural language and theorem proving in Lean. In natural language, models generate detailed deduction steps to construct proofs,...

2 Model Training

非CoT模式的证明代码补全。两种模式的输入和输出示例可在附录A中找到。策略状态信息提示增强：为实现蒙特卡洛树搜索的截断和恢复机制。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: non-CoT mode for proof code completion. Examples of input and output in both modes can be found in Appendix A . Prompt Augmentation with Tactic State Information. To implement the truncate-and-resume mechanism for Monte-Carlo Tree Search, we needed to extract tactic information from the code generated by the model. We enhanced the Lean REPL (Read-Eval-Print Loop; Leanprover Community, 2023 ) with data extraction tools from the LeanDojo (Yang et al., 2023 ) project. This allowed us to extract tactic information in triples, which include the position of each tactic, as well as the tactic states ...

2 Model Training

4证明器。此RL过程的具体细节如下。提示：在强化学习阶段，我们使用监督微调数据集中的一部分定理陈述作为训练提示。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: 4 prover. The specifics of this RL process are detailed below. Prompts. In the reinforcement learning stage, we use a subset of theorem statements from the supervised fine-tuning dataset as training prompts. We select theorems for which DeepSeek-Prover-V1.5-SFT has a moderate success rate in generating correct proofs upon multiple attempts. This ensures that the model has room for improvement while still being able to receive positive feedback. After filtering, we retain approximately 4.5k unique theorem statements. Each theorem is prefixed with both CoT and non-CoT guiding prompts to enhance ...

2 Model Training

训练设置：我们基于SFT模型进行RL训练，该模型既作为初始模型也作为施加KL散度惩罚的参考模型。我们使用5e-6的恒定学习率。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: ning Setting. We conduct RL training based on the SFT model, which serves as both the initial model and the reference model for imposing the Kullback-Leibler (KL) divergence penalty. We use a constant learning rate of 5e-6, and the KL penalty coefficient is set to 0.02. For each theorem, we sample a group of 32 candidate proofs, with maximum length set to 2,048. The training batch size is configured to 512. 2.4 Evaluation Benchmarks. We evaluate theorem-proving performance on the following benchmarks to compare model capabilities after each training stage: • MiniF2F (Zheng et al., 2022 ) focus...

2 Model Training

51.6% ± 0.5% 18.2% ± 0.5% 图3：不同训练阶段模型能力比较。CoT和non-CoT指使用两种引导提示的评估。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: rcent 51.6 percent 0.5 51.6\%\pm 0.5\% 18.2 % ± 0.5 % plus-or-minus percent 18.2 percent 0.5 18.2\%\pm 0.5\% Figure 3: Comparison of model capabilities at different training stages. "CoT" and "non-CoT" refer to evaluations using two guiding prompts. The shaded region represents the range of standard deviations around the mean values. The notation μ ± σ plus-or-minus 𝜇 𝜎 \mu\pm\sigma indicates the average accuracy μ 𝜇 \mu and the standard deviation σ 𝜎 \sigma . Prompting Configurations. For each proof attempt of DeepSeek-Prover-V1.5-Base, we independently sample three proof demonstrations from ...

2 Model Training

表通过率，使用3-shot提示解决了miniF2F基准测试测试集上近三分之一的问题。监督微调阶段产生的DeepSeek-Prover-V1.5-SFT显著优于基础模型。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: table pass rate, solving nearly one-third of the problems on the test set of the miniF2F benchmark using 3-shot prompting. The supervised fine-tuning stage, resulting in DeepSeek-Prover-V1.5-SFT, significantly outperforms the base model, with Pass@128 accuracy increasing by approximately two-thirds on miniF2F and doubling on ProofNet. The subsequent reinforcement learning stage further enhances the model’s performance, improving Pass@ K 𝐾 K accuracy across all values of K 𝐾 K . In contrast to findings in natural language mathematics, such as those reported in DeepSeekMath (Shao et al., 2024 ) ...

2.1 Pre-training

2.1 预训练为提高语言模型生成形式化证明和通过数学语言推理的能力，我们进一步预训练基础模型。这涉及在包含代码和数学内容的高质量数据集上进行训练。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: To enhance our language model’s proficiency in generating formal proofs and reasoning through mathematical language, we further pre-train our base model (Shao et al., 2024 ) . This refinement involved training on high-quality datasets that include both code and natural language mathematical content. We specifically focused on formal languages widely used in proof assistants, such as Lean, Isabelle, and Metamath. We designate this improved model as DeepSeek-Prover-V1.5-Base.

2.2 Supervised Fine-tuning

2.2 监督微调在本节中，我们探讨DeepSeek-Prover-V1.5监督微调（SFT）的方法和过程。具体来说，我们通过添加详细解释性评论来增强来自DeepSeek-Prover-V1的证明数据集。

原文: In this section, we explore the methodology and processes involved in the supervised fine-tuning (SFT) of DeepSeek-Prover-V1.5. Specifically, we augment the proof dataset from DeepSeek-Prover-V1 by adding detailed explanatory comments. This enhancement aims to improve the alignment between natural language descriptions and Lean 4 code, thereby facilitating better formal mathematical reasoning. Additionally, we incorporate intermediate tactic state information as an auxiliary prediction task to support the truncate-and-resume mechanism used in the Monte-Carlo Tree Search process. We refer to th...

2.2 Supervised Fine-tuning

自然语言中，模型生成详细推理步骤来构建证明，而在Lean中，它们通常依赖一系列高级策略调用来暴力求解。这些高级策略虽然有效，但掩盖了其内部工作原理。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: tural language, models generate detailed deduction steps to construct proofs, whereas in Lean, they often rely on a sequence of high-level tactic calls to brute-force solutions. These high-level tactics, while effective, obscure their internal workings and outcomes, hindering the model’s ability to resolve complex proof goals with structured mathematical reasoning. To address this issue, we develop an approach that incorporates natural language reasoning before generating theorem proof code. Similar to Lean-STaR (Lin et al., 2024 ) , which performs isolated chain-of-thought reasoning (Wei et a...

2.2 Supervised Fine-tuning

例，包括每个策略的位置，以及应用前后的策略状态。这些信息帮助我们识别触发验证错误的特定策略代码（用于树搜索的扩展步骤）。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: les, which include the position of each tactic, as well as the tactic states before and after its application. This information helps us identify the specific tactic code that triggers verification errors (used in the expansion step for tree search, see Section 3.2 ). For each tactic in a generated valid formal proof, we insert the tactic state returned by the verifier as a comment "/- tactic state: … -/". During training, we use all tokens following "/- tactic state: " as responses to calculate the supervised fine-tuning loss, while the tokens before this comment is used as prompts and do not...

2.3 Reinforcement Learning from Proof Assistant Feedback

2.3 来自证明助手反馈的强化学习强化学习（RL）已被证明能有效增强监督微调语言模型的数学推理能力。为进一步推进DeepSeek-Prover-V1.5-SFT，我们整合了强化学习。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Reinforcement learning (RL) has been proven effective in enhancing the mathematical reasoning capabilities of supervised fine-tuned language models (Shao et al., 2024 ) . To further advance DeepSeek-Prover-V1.5-SFT, we incorporate a reinforcement learning phase, resulting in the model DeepSeek-Prover-V1.5-RL. This phase leverages RL to enhance performance based on verification feedback from the Lean 4 prover. The specifics of this RL process are detailed below. Prompts. In the reinforcement learning stage, we use a subset of theorem statements from the supervised fine-tuning dataset as trainin...

2.3 Reinforcement Learning from Proof Assistant Feedback

条件评论家模型。具体来说，GRPO为每个定理提示采样一组候选证明，并根据组内输出的相对奖励优化模型。我们的提示选择策略设计为可能同时包含成功和失败的证明。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: ional critic model. Specifically, GRPO samples a group of candidate proofs for each theorem prompt and optimizes the model based on the relative rewards of the outputs within the group. Our prompt selection strategy is designed to likely include both correct and incorrect proofs among the candidates, aligning well with the group-relative nature of GRPO and thereby enhancing the training process. Training Setting. We conduct RL training based on the SFT model, which serves as both the initial model and the reference model for imposing the Kullback-Leibler (KL) divergence penalty. We use a const...

2.4 Evaluation

2.4 评估基准测试：我们在以下基准测试上评估定理证明性能，以比较每个训练阶段后的模型能力： • MiniF2F专注于高中水平练习和竞赛题的形式化问题解决技能。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Benchmarks. We evaluate theorem-proving performance on the following benchmarks to compare model capabilities after each training stage: • MiniF2F (Zheng et al., 2022 ) focuses on formal problem-solving skills for high-school level exercises and competitions, such as AMC, AIME, and IMO, with an emphasis on algebra and number theory. The benchmark includes 244 validation and 244 test problems, originally in Lean 3 and manually converted to Lean 4.9.0, based on the version provided by Yang ( 2023 ) . • ProofNet (Azerbayev et al., 2023 ) evaluates formal theorem-proving capabilities at the underg...

2.4 Evaluation

和标准差σ。提示配置：对于DeepSeek-Prover-V1.5-Base的每次证明尝试，我们独立从验证集采样三个证明演示来构建少样本提示。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: nd the standard deviation σ 𝜎 \sigma . Prompting Configurations. For each proof attempt of DeepSeek-Prover-V1.5-Base, we independently sample three proof demonstrations from the validation set to construct the few-shot prompts. For the miniF2F benchmark, we use human-written proofs from Yang ( 2023 ) , while for the ProofNet benchmark, we use correct proofs generated by DeepSeek-Prover-V1.5-RL as few-shot demonstrations. For DeepSeek-Prover-V1.5-SFT and DeepSeek-Prover-V1.5-RL, we employ two types of guiding prompts: one that encourages chain-of-thought (CoT) reasoning before each proof step, ...

2.4 Evaluation

所有K值的Proving Pass@K准确性。与自然语言数学中的发现（如DeepSeekMath报告）相反，其中强化学习主要从ToP中提升正确响应。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: roving Pass@ K 𝐾 K accuracy across all values of K 𝐾 K . In contrast to findings in natural language mathematics, such as those reported in DeepSeekMath (Shao et al., 2024 ) , where reinforcement learning primarily boosts the correct response from TopK, we observe a genuine enhancement of fundamental capabilities in formal theorem proving. This improvement is evident not only with a small sample budget but also remains stable as the sample budget increases. This conclusion is further supported by later Monte-Carlo Tree Search experiments with larger sample budgets, as discussed in Section 4.2 ...

3 Exploration-oriented Monte-Carlo Tree Search

3 面向探索的蒙特卡洛树搜索 3.1 策略级树抽象为在整个证明生成设置中实现树搜索方法，我们引入了证明树抽象来定义定制的状态和动作空间，利用截断和恢复机制。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: 3.1 Tactic-level Tree Abstraction To implement the tree search method in the whole-proof generation setting, we introduce a proof tree abstraction to define the tailored state and action space, leveraging a truncate-and-resume mechanism. Roughly following the paradigm of Yao et al. ( 2023 ) , we begin by decomposing an incomplete proof into a sequence of tree nodes that correspond to individual proof steps, and then we utilize the partial content stored in these tree nodes to continue the proof generation process. Figure 4 illustrates the process of constructing a proof search tree from whole-...

3 Exploration-oriented Monte-Carlo Tree Search

微调模型已训练为识别和利用此格式，使用增强有策略状态评论的不完整代码来指导后续证明生成。图4：MCTS扩展步骤中的截断和恢复机制。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: he fine-tuned model (see Section 2.2 ) has been trained to recognize and utilize this format, using the incomplete code augmented with tactic state comments to guide subsequent proof generation. Figure 4: Truncate-and-Resume Mechanism in the Expansion Step of MCTS. (a) After selecting a node, we trace its corresponding incomplete proof code prefix, which includes the file header, initial statement, and successfully applied tactics from the ancestor nodes. (b) The language model then generates the subsequent proof based on this prefix along with a comment block containing the current tactic sta...

3 Exploration-oriented Monte-Carlo Tree Search

探索和开发之间的平衡。树节点s的树策略通过选择从有效操作集中最大化价值的动作来计算。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: een exploration and exploitation (Kocsis and Szepesvári, 2006 ) . The tree policy at a tree node s 𝑠 s is computed by selecting the action that maximizes the value from the set of valid operations: T r e e P o l i c y ( s ) = arg ⁡ max a ∈ C h i l d r e n ( s ) ∪ { ⊘ } Q U C B ( s , a ) , 𝑇 𝑟 𝑒 𝑒 𝑃 𝑜 𝑙 𝑖 𝑐 𝑦 𝑠 subscript 𝑎 𝐶 ℎ 𝑖 𝑙 𝑑 𝑟 𝑒 𝑛 𝑠 ⊘ subscript 𝑄 𝑈 𝐶 𝐵 𝑠 𝑎 \displaystyle TreePolicy(s)=\mathop{\arg\max}_{a\in Children(s)\cup\{\oslash\}}Q_{UCB}(s,a), (1) where the action a 𝑎 a can be either moving to a child node, denoted by a ∈ C h i l d ...

3 Exploration-oriented Monte-Carlo Tree Search

来自先前试验的信息。UCB(s,a)表示由上置信界计算的探索奖励，随着状态-动作对(s,a)的重复执行而减小。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: from previous trials. U C B ( s , a ) 𝑈 𝐶 𝐵 𝑠 𝑎 UCB(s,a) denotes the exploration bonus computed by upper confidence bounds (UCB; Auer, 2002 ) , which diminishes with the repeated execution of the state-action pair ( s , a ) 𝑠 𝑎 (s,a) . More specifically, Q U C B ( s , a ) subscript 𝑄 𝑈 𝐶 𝐵 𝑠 𝑎 Q_{UCB}(s,a) stands for an optimistic estimation of Q ( s , a ) 𝑄 𝑠 𝑎 Q(s,a) and can serve as an upper bound with high probability. We defer the discussion of detailed settings of node values and UCB bonus to Section 3.3 . Expansion. The next step is invoking the proof generation model to e...

3 Exploration-oriented Monte-Carlo Tree Search

从根节点到扩展节点的轨迹，即更新与公式(1)中所述树策略相关的值。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: n trajectory from the root to the expanded node, i.e. , updating the values associated with the tree policy stated in Eq. ( 1 ). Let τ = { ( r o o t , s ( 1 ) ) , ( s ( 1 ) , s ( 2 ) ) , ( s ( 2 ) , s ( 3 ) ) , … , ( s ( | τ | − 1 ) = s t , ⊘ ) } 𝜏 𝑟 𝑜 𝑜 𝑡 superscript 𝑠 1 superscript 𝑠 1 superscript 𝑠 2 superscript 𝑠 2 superscript 𝑠 3 … superscript 𝑠 𝜏 1 subscript 𝑠 𝑡 ⊘ \tau=\{(root,s^{(1)}),(s^{(1)},s^{(2)}),(s^{(2)},s^{(3)}),\dots,(s^{(|\tau|-1)}=s_{t},\oslash)\} denote the selection trajectory of t 𝑡 t -th iteration that ends with s t subscript 𝑠 𝑡 s_{t} as the expanding node. We upda...

3 Exploration-oriented Monte-Carlo Tree Search

关于交互环境的信息。在本节中，我们提出了我们的内在奖励驱动探索算法，应用于树搜索的RMax。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: rmation about the interactive environment (Bellemare et al., 2016 ; Houthooft et al., 2016 ; Pathak et al., 2017 ; Burda et al., 2019 ) . In this section, we present our intrinsic-reward-driven exploration algorithm, RMax applied to Tree Search (RMaxTS), to incorporate reward-free exploration in the proof search problem. RMax applied to MCTS. We adopt RMax (Brafman and Tennenholtz, 2002 ) , a classical exploration mechanism, to construct intrinsic rewards for Monte-Carlo tree search. The core idea of RMax is to explore a broad coverage of the state space. The agent awards itself a maximal amou...

3 Exploration-oriented Monte-Carlo Tree Search

Q_{UCB1}(s,a) = W(s,a)/N(s,a) + sqrt(2*ln(sum_{a'}N(s,a'))/N(s,a)) DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: ( s , a ) subscript 𝑄 𝑈 𝐶 𝐵 1 𝑠 𝑎 \displaystyle Q_{UCB1}(s,a) = W ( s , a ) N ( s , a ) + 2 ln ∑ a ′ N ( s , a ′ ) N ( s , a ) , absent 𝑊 𝑠 𝑎 𝑁 𝑠 𝑎 2 subscript superscript 𝑎 ′ 𝑁 𝑠 superscript 𝑎 ′ 𝑁 𝑠 𝑎 \displaystyle=\frac{W(s,a)}{N(s,a)}+\sqrt{\frac{2\ln\sum_{a^{\prime}}N(s,a^{\prime})}{N(s,a)}}, (4) W ( s , a ) 𝑊 𝑠 𝑎 \displaystyle W(s,a) = ∑ τ ∈ Γ ( s , a ) R ( τ ) , absent subscript 𝜏 Γ 𝑠 𝑎 𝑅 𝜏 \displaystyle=\textstyle{\sum_{\tau\in\Gamma(s,a)}}\leavevmode\nobreak\ R(\tau), (5) N ( s , a ) 𝑁 𝑠 𝑎 \displaystyle N(s,a) = | Γ ( s , a ) | , absent Γ 𝑠 𝑎 \displaystyle=\left|\...

3 Exploration-oriented Monte-Carlo Tree Search

W_γ(s,a) = sum_{t=1}^{N(s,a)} γ^{N(s,a)-t} R(τ_t) DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: mma}(s,a)}{N_{\gamma}(s,a)}+\sqrt{\frac{2\ln\sum_{a^{\prime}}N_{\gamma}(s,a^{\prime})}{N_{\gamma}(s,a)}}, (7) W γ ( s , a ) subscript 𝑊 𝛾 𝑠 𝑎 \displaystyle W_{\gamma}(s,a) = ∑ t = 1 N ( s , a ) γ N ( s , a ) − t R ( τ t ) , absent superscript subscript 𝑡 1 𝑁 𝑠 𝑎 superscript 𝛾 𝑁 𝑠 𝑎 𝑡 𝑅 subscript 𝜏 𝑡 \displaystyle=\textstyle{\sum_{t=1}^{N(s,a)}}\leavevmode\nobreak\ \gamma^{N(s,a)-t}R(\tau_{t}), (8) N γ ( s , a ) subscript 𝑁 𝛾 𝑠 𝑎 \displaystyle N_{\gamma}(s,a) = ∑ t = 0 N ( s , a ) − 1 γ t , absent superscript subscript 𝑡 0 𝑁 𝑠 𝑎 1 superscript 𝛾 𝑡 \displaystyle=\textstyle{\sum_{t=0...

3 Exploration-oriented Monte-Carlo Tree Search

和Lean验证。每个线程工作迭代执行树搜索循环，通过选择候选节点进行扩展，调用语言模型生成证明，使用Lean证明器验证生成的证明，并执行回溯。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: d Lean verification. Each thread worker iteratively performs the tree search loop by selecting a candidate node for expansion, invoking the language model to generate the proof, verifying the generated proof with the Lean prover, and performing backpropagation. • Virtual Loss: To encourage diverse node selection among concurrent thread workers, we assign a virtual reward R ( τ ) = 0 𝑅 𝜏 0 R(\tau)=0 for ongoing iterations. This involves backpropagating a reward of 0 0 temporarily and updating it to the true reward upon completion. This strategy promotes exploration of different nodes for expa...

3 Exploration-oriented Monte-Carlo Tree Search

一种新颖的混合方法。它从整个证明生成开始，类似于单次通过方法，但通过实现复杂的截断和恢复机制来扩展。此过程涉及将生成的证明截断到其成功部分。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: ng a novel hybrid approach. It starts with whole-proof generation, similar to the single-pass approach, but extends this by implementing a sophisticated truncate-and-resume mechanism. This process involves truncating the generated proof to its successful initial segment, parsing this segment into individual tactics, and resuming the tree search from this point. This iterative process effectively implements a Monte-Carlo Tree Search, seamlessly integrating single-pass whole-proof generation with multi-pass proof-step generation. Consequently, we can train a single model with nearly identical ob...

3.1 Tactic-level Tree Abstraction

3.1 策略级树抽象为在整个证明生成设置中实现树搜索方法，我们引入了证明树抽象来定义定制的状态和动作空间，利用截断和恢复机制。大致遵循Yao等人的范式。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: To implement the tree search method in the whole-proof generation setting, we introduce a proof tree abstraction to define the tailored state and action space, leveraging a truncate-and-resume mechanism. Roughly following the paradigm of Yao et al. ( 2023 ) , we begin by decomposing an incomplete proof into a sequence of tree nodes that correspond to individual proof steps, and then we utilize the partial content stored in these tree nodes to continue the proof generation process. Figure 4 illustrates the process of constructing a proof search tree from whole-proof generation. Truncate: Proof ...

3.1 Tactic-level Tree Abstraction

.2）已训练为识别和利用此格式，使用增强有策略状态评论的不完整代码来指导后续证明生成。图4：MCTS扩展步骤中的截断和恢复机制。(a)选择后。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: .2 ) has been trained to recognize and utilize this format, using the incomplete code augmented with tactic state comments to guide subsequent proof generation. Figure 4: Truncate-and-Resume Mechanism in the Expansion Step of MCTS. (a) After selecting a node, we trace its corresponding incomplete proof code prefix, which includes the file header, initial statement, and successfully applied tactics from the ancestor nodes. (b) The language model then generates the subsequent proof based on this prefix along with a comment block containing the current tactic state. (c) The combined proof code (p...

3.2 Interactive Theorem Proving via Monte-Carlo Tree Search

3.2 通过蒙特卡洛树搜索的交互式定理证明我们的证明搜索树使用标准蒙特卡洛树搜索（MCTS）范式开发，迭代应用四个步骤：选择、扩展、模拟和回溯。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Our proof search tree is developed using the standard Monte-Carlo Tree Search (MCTS) paradigm (MCTS; Coulom, 2006 ; Browne et al., 2012 ) , which iteratively applies four steps: Selection , Expansion , Simulation , and Backpropagation . We integrate the Simulation step into Expansion because our whole-proof generation model inherently performs a rollout from the expanded node. The detailed design of the algorithm workflow is as follows. Selection. The selection step, a.k.a. the tree policy, starts from the root node and traverses downward to identify a promising node for expansion. The objecti...

3.2 Interactive Theorem Proving via Monte-Carlo Tree Search

forall a属于Children(s)并集{⊘}, Q_{UCB}(s,a) = Q(s,a)【开发】+ UCB(s,a)【探索】 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: a ) for-all 𝑎 𝐶 ℎ 𝑖 𝑙 𝑑 𝑟 𝑒 𝑛 𝑠 ⊘ subscript 𝑄 𝑈 𝐶 𝐵 𝑠 𝑎 \displaystyle\forall a\in Children(s)\cup\{\oslash\},\quad Q_{UCB}(s,a) = Q ( s , a ) ⏟ Exploitation + U C B ( s , a ) ⏟ Exploration , absent limit-from subscript ⏟ 𝑄 𝑠 𝑎 Exploitation subscript ⏟ 𝑈 𝐶 𝐵 𝑠 𝑎 Exploration \displaystyle=\underbrace{Q(s,a)}_{\text{Exploitation}}+\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \underbrace{UCB(s,a)}_{\text{Exploration}}, (2) where Q ( s , a ) 𝑄 𝑠 𝑎 Q(s,a) denotes a sample-based estimation of action values derived from the selection history, functioning as the exploitation...

3.2 Interactive Theorem Proving via Monte-Carlo Tree Search

合并到搜索树中（见图4）。重要的是要注意，因为我们使用整个证明生成设置——输出是由一系列策略组成的整个证明，而不仅仅是下一个策略——我们的扩展过程。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: e merged into the search tree (see Figure 4 ). It is important to note that, because we use the whole-proof generation setting—where the output is an entire proof consisting of a sequence of tactics, rather than just the next tactic—our expansion procedure may insert a path of tree nodes into the search tree during each iteration. This differs from the conventional MCTS designed for competitive games, which typically expands only one layer of children nodes per iteration (Silver et al., 2016 , 2018 ; Schrittwieser et al., 2020 ) . Backpropagation. The final phase of each tree search iteration ...

3.3 Intrinsic Rewards for Monte-Carlo Tree Search

3.3 蒙特卡洛树搜索的内在奖励在形式化定理证明的搜索问题中，外在奖励极其稀疏，即搜索代理仅在证明完全解决时获得非零奖励。更具体地说，证明搜索过程形成树结构。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: In the search problem of formal theorem proving, the extrinsic rewards are extremely sparse, i.e. , the search agent only obtains non-zero rewards when the proof is completely solved. More specifically, the proof search process forms a tree structure with only a narrow set of leaves delivering non-zero rewards, which matches a famous hard-exploration case (Krishnamurthy et al., 2016 ) in the literature of statistical reinforcement learning. To promote exploration in sparse-reward sequential decision making, one classical paradigm is constructing intrinsic rewards (Schmidhuber, 2010 ) that enco...

3.3 Intrinsic Rewards for Monte-Carlo Tree Search

au)=I[至少一个新节点被添加到搜索树]，其中τ表示需要回溯奖励分配的最新选择轨迹。此探索策略优先考虑探索。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: au)=\mathbb{I}\left[\text{at least one new node is added to the search tree}\right], (3) where τ 𝜏 \tau denotes the most recent selection trajectory that requires a reward assignment for backpropagation. This exploration strategy prioritizes the expansion of nodes where the prover model generates tactics that lead to a diverse range of tactic states. As multiple Lean codes can result in the same transition of intermediate states, this heuristics can potentially reduce redundant generation and improve sample efficiency. UCB for Non-stationary Rewards. The common setting of UCB exploration bonus...

3.3 Intrinsic Rewards for Monte-Carlo Tree Search

随着搜索树通过复杂探索扩展，发现具有未见策略状态的新节点变得确实更难。为解决非平稳性，我们考虑了折扣上置信界（DUCB）。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: t becomes definitely harder to discover new nodes with unseen tactic states as the search tree expands through sophisticated exploration. To tackle the non-stationarity, we consider discounted upper confidence bounds (DUCB; Garivier and Moulines, 2011 ) , which uses a discount factor γ ∈ ( 0 , 1 ) 𝛾 0 1 \gamma\in(0,1) to smoothly drop those outdated feedback records: Q D U C B ( s , a ) subscript 𝑄 𝐷 𝑈 𝐶 𝐵 𝑠 𝑎 \displaystyle Q_{DUCB}(s,a) = W γ ( s , a ) N γ ( s , a ) + 2 ln ∑ a ′ N γ ( s , a ′ ) N γ ( s , a ) , absent subscript 𝑊 𝛾 𝑠 𝑎 subscript 𝑁 𝛾 𝑠 𝑎 2 subscript superscr...

3.4 Parallelization of Monte-Carlo Tree Search

3.4 蒙特卡洛树搜索的并行化为提高蒙特卡洛树搜索（MCTS）的效率，我们实施了Chaslot等人描述的几种既定并行化技术。 • 根并行化：我们每个节点部署256个MCTS运行器。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: To enhance the efficiency of Monte-Carlo Tree Search (MCTS), we implement several established parallelization techniques as described by Chaslot et al. ( 2008 ) . • Root Parallelization: We deploy 256 MCTS runners per node, with one language model per GPU and a batch size of 512 for proof generation. The Lean prover is invoked through REPL and executed on a cluster with thousands of CPU cores, where each proof verification task is handled by an individual process, created and terminated in a sandbox. Both proof generation by language models and verification by Lean provers are handled asynchro...

3.5 Comparison with Existing Methods

3.5 与现有方法的比较在本节中，我们将提出的证明树搜索方法与现有方法进行比较。当前在形式化数学证明搜索中使用语言模型的方法。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: In this section, we compare our proposed proof tree search method, which introduces a novel truncate-and-resume mechanism for whole-proof generation, with existing approaches. Current methods for using language models in formal mathematics proof search generally fall into two main strategies: • Multi-pass proof-step generation : This strategy breaks down the proving process into multiple episodes of tactic generation and verification, typically following a tree search pattern. It involves generating and verifying one tactic at a time, repeating the process for the next tactic until no proof go...

3.5 Comparison with Existing Methods

这种统一方法在两种设置中都实现了优越性能。通过结合现有方法的优势并引入创新技术，我们的方法为形式化数学证明搜索提供了更多样化和有效的解决方案。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: this unified approach achieves superior performance in both settings. By combining the strengths of existing methods and introducing innovative techniques, our method offers a more versatile and effective solution for formal mathematics proof search, potentially paving the way for future advancements in this field.

4 Experimental Results

4 实验结果在本节中，我们使用两个不同基准测试评估DeepSeek-Prover-V1.5的定理证明能力：miniF2F，涵盖高中水平练习和竞赛问题，以及ProofNet，涉及本科水平。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: In this section, we evaluate the theorem-proving capabilities of DeepSeek-Prover-V1.5 using two distinct benchmarks: miniF2F, which encompasses high-school level exercises and competition problems, and ProofNet, which pertains to undergraduate-level theorems. We present the results for both complete proof generation and Monte-Carlo tree search methodologies, utilizing the same trained model and inference configuration as Section 2.4 to ensure consistency. 4.1 Main Results We present a comparative analysis of DeepSeek-Prover-V1.5 against previous state-of-the-art language models, highlighting i...

4 Experimental Results

也展示了出色的性能。指标：我们使用pass@K准确性指标比较DeepSeek-Prover-V1.5与最先进模型的性能，该指标评估模型生成正确证明的能力。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: (Wu et al., 2024 ) , also demonstrate outstanding performance. Metric. We compare the performance of DeepSeek-Prover-V1.5 with state-of-the-art models using the pass@ K 𝐾 K accuracy metric, which evaluates the model’s ability to generate a correct proof within K 𝐾 K attempts. We display the sample budget K 𝐾 K according to the the following rules to align the computation budget across different generation schemes. • For single-pass sampling methods, we define the sample budget K 𝐾 K as the total number of proofs generated, with large values of K 𝐾 K factorized for the ease of comparison to tre...

4 Experimental Results

生成采样以实现54.9%的通过率，超过了之前InternLM2-StepProver的最先进结果54.5%（执行64乘以3200树搜索）。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: generation samplings to achieve a pass rate of 54.9%, surpassing the previous state-of-the-art result of InternLM2-StepProver, which performs 64 × 3200 64 3200 64\times 3200 tree searches to achieve 54.5%. Method Sample budget miniF2F-test Single-pass Whole-Proof Generation Methods TheoremLlama [ 49 ] 128 33.6 % percent 33.6 33.6\% DeepSeek-Prover-V1 [ 53 ] 128 128 128 46.1 % ± 0.5 % plus-or-minus percent 46.1 percent 0.5 46.1\%\pm 0.5\% 16 × 4096 16 4096 16\times 4096 50.0 % percent 50.0 50.0\% DeepSeek-Prover-V1.5-Base 128 29.7 % ± 0.5 % plus-or-minus percent 29.7 percent 0.5 29.7\%\pm 0.5\%...

4 Experimental Results

1乘以32乘以100: 25.8% ReProver: 26.5% LLMStep: 27.9% GPT-f: 64乘以8乘以512 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: 5 ] 1 × 32 × 100 1 32 100 1\times 32\times 100 25.8 % percent 25.8 25.8\% ReProver [ 55 ] - 26.5 % percent 26.5 26.5\% LLMStep [ 51 ] 1 × 32 × 100 1 32 100 1\times 32\times 100 27.9 % percent 27.9 27.9\% GPT-f [ 35 ] 64 × 8 × 512 64 8 512 64\times 8\times 512 36.6 % percent 36.6 36.6\% Hypertree Proof Search [ 23 ] 64 × 5000 64 5000 64\times 5000 41.0 % percent 41.0 41.0\% Lean-STaR [ 26 ] 64 × 1 × 50 64 1 50 64\times 1\times 50 46.3 % percent 46.3 46.3\% InternLM2-Math-7B [ 58 ] 1 × 32 × 100 1 32 100 1\times 32\times 100 30.3 % percent 30.3 30.3\% InternLM2-Math-Plus-7B [ 58 ] 1 × 32 × 100 1 ...

4 Experimental Results

使用两种引导提示的混合策略性能（详见第4.2节）。 ProofNet结果：表2展示了ProofNet数据集上各种定理证明方法的比较分析。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: rmance using a mixture strategy with two guiding prompts (see Section 4.2 for details). Results on ProofNet. Table 2 presents a comparative analysis of various theorem-proving methods on the ProofNet dataset. DeepSeek-Prover-V1.5-RL achieved pass rates of 22.6% and 25.3% for the overall ProofNet dataset in the single-pass whole-proof generation setting and with the enhancement of RMaxTS, respectively. These results surpass the existing state-of-the-art methods, ReProver (13.8%) and InternLM2-StepProver (18.1%). When the number of whole-proof generation attempts is restricted to 3200, DeepSeek-...

4 Experimental Results

3200: 21.4%±0.3%, 22.0%±0.5%, 21.7%±0.4% 4乘以6400: 21.6% DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: 3200 21.4 % ± 0.3 % plus-or-minus percent 21.4 percent 0.3 21.4\%\pm 0.3\% 22.0 % ± 0.5 % plus-or-minus percent 22.0 percent 0.5 22.0\%\pm 0.5\% 21.7 % ± 0.4 % plus-or-minus percent 21.7 percent 0.4 21.7\%\pm 0.4\% 4 × 6400 4 6400 4\times 6400 21.6 % percent 21.6 21.6\% 23.7 % percent 23.7 23.7\% 22.6 % percent 22.6 22.6\% Tree Search Methods ReProver [ 55 ] - - - 13.8 % percent 13.8 13.8\% InternLM2-StepProver [ 52 ] 1 × 32 × 100 1 32 100 1\times 32\times 100 - - 18.1 % percent 18.1 18.1\% DeepSeek-Prover-V1.5-SFT + RMaxTS 1 × 3200 1 3200 1\times 3200 22.2 % ± 0.7 % plus-or-minus percent 22.2...

4 Experimental Results

唯一使用大样本预算的版本。比较结果以两列形式呈现在表3中。DeepSeek-Prover-V1.5-RL在所有生成设置中 consistently 优于SFT模型。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: only version using a large sample budget. The comparison results are presented as two columns in Table 3 . DeepSeek-Prover-V1.5-RL consistently outperforms the SFT model across all generation settings, regardless of whether the chain-of-thought strategy is applied. The results also indicate that the improvements gained from conducting online RL is orthogonal to those achieved through RMaxTS, which can be further combined to boost the performance. By integrating both CoT prompting and RMaxTS, DeepSeek-Prover-V1.5-RL achieves a pass rate of 62.7 % percent 62.7 62.7\% on miniF2F-test. This perfor...

4 Experimental Results

模型，在miniF2F-test上达到63.5%的通过率。在附录B中，我们展示了说明两种生成模式不同优势的示例问题。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: el, achieving a pass rate of 63.5 % percent 63.5 63.5\% on miniF2F-test. In Appendix B , we present example problems that illustrate the different advantages of the two generation modes. Prompt mode Sample budget DeepSeek-Prover-V1.5 SFT RL non-CoT 4 × 6400 4 6400 4\times 6400 54.7 % ± 0.4 % plus-or-minus percent 54.7 percent 0.4 54.7\%\pm 0.4\% 56.5 % ± 0.5 % plus-or-minus percent 56.5 percent 0.5 56.5\%\pm 0.5\% 16 × 6400 16 6400 16\times 6400 56.1 % percent 56.1 56.1\% 57.4 % percent 57.4 57.4\% Single-Pass CoT 4 × 6400 4 6400 4\times 6400 55.8 % ± 0.7 % plus-or-minus percent 55.8 percent 0...

4 Experimental Results

3.1% (16+16)乘以6400: 60.2%, 63.5% 表3：大规模消融研究，调查几种算法设计在模型训练中的有效性。结果在miniF2F-test上评估。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: 3.1\% ( 16 + 16 ) × 6400 16 16 6400 (16+16)\times 6400 60.2 % percent 60.2 60.2\% 63.5% Table 3: A large-scale ablation study to investigate the effectiveness of several algorithmic designs on model training. The results are evaluated on the miniF2F-test dataset. 4.3 Ablation Studies on RMaxTS Intrinsic Rewards and Discounted UCB. We investigate the effectiveness of two core components of RMaxTS, i.e. , the intrinsic rewards defined in Eq. ( 3 ) and the discounted upper confidence bound stated in Eq. ( 7 ). We start with a baseline implementing the standard UCT algorithm (Kocsis and Szepesvári...

4 Experimental Results

iF2F-test 单次通过生成 4乘以6400: 58.4%±0.5%, 16乘以6400: 60.2% UCT 4乘以6400: 58.2%±0.3% DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: iF2F-test Single-Pass Generation 4 × 6400 4 6400 4\times 6400 58.4 % ± 0.5 % plus-or-minus percent 58.4 percent 0.5 58.4\%\pm 0.5\% 16 × 6400 16 6400 16\times 6400 60.2 % percent 60.2 60.2\% UCT 4 × 6400 4 6400 4\times 6400 58.2 % ± 0.3 % plus-or-minus percent 58.2 percent 0.3 58.2\%\pm 0.3\% (without R intrinsic subscript 𝑅 intrinsic R_{\text{intrinsic}} ) 16 × 6400 16 6400 16\times 6400 61.1 % percent 61.1 61.1\% RMaxTS 4 × 6400 4 6400 4\times 6400 58.6 % ± 0.3 % plus-or-minus percent 58.6 percent 0.3 58.6\%\pm 0.3\% (DUCB → → \rightarrow UCB1) 16 × 6400 16 6400 16\times 6400 60.7 % percent ...

4 Experimental Results

在缺乏策略状态信息的情况下表现适中，特别是在处理需要大量样本的困难问题时。这突显了编译器信息整合是树搜索算法的 essential 组件。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: mes moderate in the absence of tactic state information, especially when tackling hard problems that require a large amount of samples. This highlights that the integration of compiler information is an essential component of the tree search algorithm, enhancing its overall effectiveness and sample efficiency.

4.1 Main Results

4.1 主要结果我们展示了DeepSeek-Prover-V1.5与之前最先进语言模型的比较分析，突出其性能和进步。 • 通用模型：GPT-3.5和GPT-4是先进的生成式AI模型。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: We present a comparative analysis of DeepSeek-Prover-V1.5 against previous state-of-the-art language models, highlighting its performance and advancements. • General-purpose Models: GPT-3.5 and GPT-4 (OpenAI, 2023 ) are advanced generative AI models developed by OpenAI, known for their effectiveness across diverse tasks, including code generation. Despite not being specifically designed for theorem proving, their extensive parameter scales provide significant capabilities. The evaluation of these models in formal theorem proving is facilitated by COPRA (Thakur et al., 2023 ) , an in-context le...

4.1 Main Results

预算K作为生成的证明总数，K的大值因式分解以便于与树搜索方法比较。 • 对于最佳优先搜索方法，遵循Azerbayev等人的表示法，我们呈现K=N乘以S。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: udget K 𝐾 K as the total number of proofs generated, with large values of K 𝐾 K factorized for the ease of comparison to tree search methods. • For best-first-search methods, following the notation of Azerbayev et al. ( 2024 ) , we present K = N × S × T 𝐾 𝑁 𝑆 𝑇 K=N\times S\times T where N 𝑁 N denotes the number of best-first-search attempts, S 𝑆 S denotes the number of tactics generated for each expansion, and T 𝑇 T denotes the number of expansion iterations. • For tree search methods, e.g. , RMaxTS and HTPS (Lample et al., 2022 ) , we present K = N × T 𝐾 𝑁 𝑇 K=N\times T where N 𝑁 N denotes th...

4.1 Main Results

0%: 50.0% DeepSeek-Prover-V1.5-Base: 128->29.7%±0.5%, 3200->39.2%, 6400->42.2% DeepSeek-Prover-V1.5-SFT: 32->48.2%±0.6%

原文: 0 % percent 50.0 50.0\% DeepSeek-Prover-V1.5-Base 128 29.7 % ± 0.5 % plus-or-minus percent 29.7 percent 0.5 29.7\%\pm 0.5\% 3200 39.2 % percent 39.2 39.2\% 6400 42.2 % percent 42.2 42.2\% DeepSeek-Prover-V1.5-SFT 32 48.2 % ± 0.6 % plus-or-minus percent 48.2 percent 0.6 48.2\%\pm 0.6\% 64 49.6 % ± 0.7 % plus-or-minus percent 49.6 percent 0.7 49.6\%\pm 0.7\% 128 128 128 50.4 % ± 0.4 % plus-or-minus percent 50.4 percent 0.4 50.4\%\pm 0.4\% 3200 3200 3200 53.3 % ± 0.5 % plus-or-minus percent 53.3 percent 0.5 53.3\%\pm 0.5\% 4 × 6400 4 6400 4\times 6400 55.8 % ± 0.7 % plus-or-minus percent 55.8 per...

4.1 Main Results

B: 1乘以32乘以100: 30.3% InternLM2-Math-Plus-7B: 1乘以32乘以100: 43.4% InternLM2-StepProver: 1乘以32乘以100 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: B [ 58 ] 1 × 32 × 100 1 32 100 1\times 32\times 100 30.3 % percent 30.3 30.3\% InternLM2-Math-Plus-7B [ 58 ] 1 × 32 × 100 1 32 100 1\times 32\times 100 43.4 % percent 43.4 43.4\% InternLM2-StepProver [ 52 ] 1 × 32 × 100 1 32 100 1\times 32\times 100 48.8 % percent 48.8 48.8\% 64 × 32 × 100 64 32 100 64\times 32\times 100 54.5 % percent 54.5 54.5\% DeepSeek-Prover-V1.5-SFT + RMaxTS 1 × 3200 1 3200 1\times 3200 53.5 % ± 0.4 % plus-or-minus percent 53.5 percent 0.4 53.5\%\pm 0.4\% 4 × 6400 4 6400 4\times 6400 56.3 % ± 0.3 % plus-or-minus percent 56.3 percent 0.3 56.3\%\pm 0.3\% 16 × 6400 16 6400 ...

4.1 Main Results

3.8%）和InternLM2-StepProver（18.1%）。当整个证明生成尝试次数限制为3200时，DeepSeek-Prover-V1.5也证明了21.7%的定理，比之前的最先进InternLM2-S提高了3.6%。

原文: 3.8%) and InternLM2-StepProver (18.1%). When the number of whole-proof generation attempts is restricted to 3200, DeepSeek-Prover-V1.5 also proved 21.7% of the theorems, demonstrating a 3.6% improvement over the previous state-of-the-art, InternLM2-StepProver. Method Sample budget ProofNet valid ‡ test all Single-pass Whole-Proof Generation Methods DeepSeek-Prover-V1.5-Base 128 128 128 6.6 % ± 0.9 % plus-or-minus percent 6.6 percent 0.9 6.6\%\pm 0.9\% 9.7 % ± 0.7 % plus-or-minus percent 9.7 percent 0.7 9.7\%\pm 0.7\% 7.5 % ± 0.7 % plus-or-minus percent 7.5 percent 0.7 7.5\%\pm 0.7\% 3200 3200 ...

4.1 Main Results

18.1% DeepSeek-Prover-V1.5-SFT + RMaxTS: 1乘以3200: 22.2%±0.7%, 21.6%±0.2%, 21.9%±0.4% DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: percent 18.1 18.1\% DeepSeek-Prover-V1.5-SFT + RMaxTS 1 × 3200 1 3200 1\times 3200 22.2 % ± 0.7 % plus-or-minus percent 22.2 percent 0.7 22.2\%\pm 0.7\% 21.6 % ± 0.2 % plus-or-minus percent 21.6 percent 0.2 21.6\%\pm 0.2\% 21.9 % ± 0.4 % plus-or-minus percent 21.9 percent 0.4 21.9\%\pm 0.4\% 4 × 6400 4 6400 4\times 6400 23.8 % percent 23.8 23.8\% 25.8% 24.8 % percent 24.8 24.8\% DeepSeek-Prover-V1.5-RL + RMaxTS 1 × 3200 1 3200 1\times 3200 22.0 % ± 0.3 % plus-or-minus percent 22.0 percent 0.3 22.0\%\pm 0.3\% 21.5 % ± 0.8 % plus-or-minus percent 21.5 percent 0.8 21.5\%\pm 0.8\% 21.8 % ± 0.4 % p...

4.2 Re-Examining the Effectiveness of Training Strategies on Large-scale Sampling

4.2 重新审视训练策略在大规模采样设置中的有效性我们重新访问了几个训练模块在大规模采样设置中的效果，关注单次通过整个证明生成和蒙特卡洛树搜索。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: We revisit the effects of several training modules in n a large-scale sampling setting, focusing on both single-pass whole-proof generation and Monte-Carlo tree search. The results demonstration that the observations and findings presented in Section 2.4 generalize to sampling scenarios with a large sample size. General Enhancement of Reinforcement Learning. To support the claim that online reinforcement learning from verification feedback generally enhances the model capabilities, we compare our final model to the SFT-only version using a large sample budget. The comparison results are presen...

4.2 Re-Examining the Effectiveness of Training Strategies on Large-scale Sampling

在数学思维中处于active状态，而在non-CoT模式中，模型可以高效使用Lean高级策略来解决Lean自动化机制内可解决的计算问题。为利用这些优势，我们考虑了混合策略。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: active in mathematical thinking, while in the non-CoT mode, the model can efficiently use Lean high-level tactics to solve computational problems that can be addressed within Lean’s automation mechanisms. To leverage these advantages, we consider a mixture strategy, denoted by non-CoT & CoT in Table 3 , allocates half of sample budget to the CoT mode and the remains to the non-CoT mode. This simple combination of two guiding prompts shows great promise in further bootstrapping the performance of our proof completion model, achieving a pass rate of 63.5 % percent 63.5 63.5\% on miniF2F-test. In...

4.2 Re-Examining the Effectiveness of Training Strategies on Large-scale Sampling

times 6400: 56.3%±0.3%, 59.6%±0.6% 16乘以6400: 59.0%, 62.7% non-CoT & CoT DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: times 6400 56.3 % ± 0.3 % plus-or-minus percent 56.3 percent 0.3 56.3\%\pm 0.3\% 59.6 % ± 0.6 % plus-or-minus percent 59.6 percent 0.6 59.6\%\pm 0.6\% 16 × 6400 16 6400 16\times 6400 59.0 % percent 59.0 59.0\% 62.7 % percent 62.7 62.7\% non-CoT & CoT ( 2 + 2 ) × 6400 2 2 6400 (2+2)\times 6400 56.1 % ± 0.8 % plus-or-minus percent 56.1 percent 0.8 56.1\%\pm 0.8\% 60.0 % ± 0.8 % plus-or-minus percent 60.0 percent 0.8 60.0\%\pm 0.8\% ( 8 + 8 ) × 6400 8 8 6400 (8+8)\times 6400 59.0 % percent 59.0 59.0\% 63.1 % percent 63.1 63.1\% ( 16 + 16 ) × 6400 16 16 6400 (16+16)\times 6400 60.2 % percent 60.2 ...

4.3 Ablation Studies on RMaxTS

4.3 RMaxTS消融研究内在奖励和折扣UCB：我们调查了RMaxTS两个核心组件的有效性，即公式(3)中定义的内在奖励和公式(7)中陈述的折扣上置信界。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Intrinsic Rewards and Discounted UCB. We investigate the effectiveness of two core components of RMaxTS, i.e. , the intrinsic rewards defined in Eq. ( 3 ) and the discounted upper confidence bound stated in Eq. ( 7 ). We start with a baseline implementing the standard UCT algorithm (Kocsis and Szepesvári, 2006 ) without intrinsic rewards, in which the exploration is driven exclusively by the UCB bonus. Note that, since no non-zero rewards are provided for this baseline, all variants of the UCB formula become equivalent, as node selection is determined solely by visitation counts. The experimen...

4.3 Ablation Studies on RMaxTS

没有R_intrinsic: 16乘以6400: 61.1% RMaxTS: 4乘以6400: 58.6%±0.3% (DUCB-> DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: without R intrinsic subscript 𝑅 intrinsic R_{\text{intrinsic}} ) 16 × 6400 16 6400 16\times 6400 61.1 % percent 61.1 61.1\% RMaxTS 4 × 6400 4 6400 4\times 6400 58.6 % ± 0.3 % plus-or-minus percent 58.6 percent 0.3 58.6\%\pm 0.3\% (DUCB → → \rightarrow UCB1) 16 × 6400 16 6400 16\times 6400 60.7 % percent 60.7 60.7\% RMaxTS 4 × 6400 4 6400 4\times 6400 58.4 % ± 0.3 % plus-or-minus percent 58.4 percent 0.3 58.4\%\pm 0.3\% (without tactic state) 16 × 6400 16 6400 16\times 6400 61.1 % percent 61.1 61.1\% RMaxTS 4 × 6400 4 6400 4\times 6400 59.6 % ± 0.6 % plus-or-minus percent 59.6 percent 0.6 59.6\...

4.3 Ablation Studies on RMaxTS

采样效率。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: mple efficiency.

5 Conclusion, Limitation, and Future Work

5 结论、局限性和未来工作我们提出了DeepSeek-Prover-V1.5，一个拥有70亿参数的语言模型，在Lean 4形式化定理证明方面超越了所有开源模型。DeepSeek-Prover-V1.5初始化为DeepSeek-Prover-V1.5-Base。

原文: We present DeepSeek-Prover-V1.5, a language model with 7 billion parameters that outperforms all open-source models in formal theorem proving in Lean 4. DeepSeek-Prover-V1.5 is initialized with DeepSeek-Prover-V1.5-Base, which extends the pre-training of DeepSeekMath-Base 7B using a specialized corpus for formal mathematical reasoning. Supervised fine-tuning is conducted on a comprehensive Lean 4 code completion dataset, encompassing a wide range of formal theorems from various mathematical domains. Subsequently, we employ GRPO to enhance whole-proof generation through online reinforcement lea...

5 Conclusion, Limitation, and Future Work

反馈到逐步价值差异。开发评估长规划路径并提供指导奖励的评论家模型提出了一个关键且具有挑战性的问题。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: eedback into step-wise value differences (Arjona-Medina et al., 2019 ) . Developing critic models for assessing long planning paths and providing guidance rewards presents a crucial and challenging problem (Ng and Russell, 2000 ; Sorg et al., 2010 ) that warrants further investigation. Finally, recent work has progressed beyond proving individual theorems to addressing real-world theory proving within complex, multi-theorem Lean files (Hu et al., 2024 ) . This shift is a natural extension of our whole-proof generation approach. Our observations indicate that the current model already possesses...

Appendix A Illustrative Examples of Non-CoT and CoT Prompting for Proof Completion

附录A 非CoT和CoT提示的示例在本节中，我们展示了非CoT和CoT提示的示例以及生成的证明，针对aime_1983_p9，来自miniF2F基准测试的问题。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: In this section, we present examples of non-CoT and CoT prompting and the generated proofs for 𝚊𝚒𝚖𝚎 _ 𝟷𝟿𝟾𝟹 _ 𝚙𝟿 𝚊𝚒𝚖𝚎 _ 1983 _ 𝚙𝟿 \mathtt{aime\_1983\_p9} , a problem from the miniF2F benchmark (Zheng et al., 2022 ) . For clarity, inline LaTeX expressions are utilized. Firstly, here is an example of whole-proof generation in non-CoT prompting mode: Input : ⬇ Complete the following Lean 4 code : “‘ lean4 import Mathlib import Aesop set_option maxHeartbeats 0 open BigOperators Real Nat Topology Rat /– Show that $ \ frac {9 x ^2\ sin ^2 x + 4}{ x \ sin x } \ geq 12 $ for $0 < x < \ pi$ .-/ ...

Appendix A Illustrative Examples of Non-CoT and CoT Prompting for Proof Completion

利用sin x和x在给定范围内为正的事实来建立必要的不等式。最后，我们应用这些结果得出结论，最小值确实是12。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: se the fact that $ \ sin x$ and $x$ are positive in the given range to establish the necessary inequalities . Finally , we apply these results to conclude that the minimum value is indeed 12. -/ – We start by ensuring that the product $x * \ sin x$ is positive in the given range . have h ₁ : 0 < x * Real . sin x := by apply mul_pos – $x$ is positive in the range $ (0, \ pi ) $ . exact h ₀.1 – $ \ sin x$ is positive in the range $ (0, \ pi ) $ . exact Real . sin_pos_of_pos_of_lt_pi h ₀.1 h ₀.2 – Using the division property of inequalities , we rewrite the expression . rw [ le_div_iff h ₁] /- ta...

Appendix B Example Solutions to MiniF2F-test Problems

附录B MiniF2F-test问题示例解决方案为说明non-CoT模式和CoT模式提示的不同优势和互补性，我们提供了non-CoT模式成功而CoT模式失败的示例，反之亦然。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: To illustrate the different advantages and complementary of non-CoT mode and CoT mode of prompting, we provide examples where non-CoT mode succeeds while CoT mode fails, and vice versa. B.1 Problems that non-CoT outperforms CoT ⬇ theorem induction_pord1p1on2powklt5on2 ( n : ℕ) ( h ₀ : 0 < n ) : ( $ \ Pi$ k in Finset . Icc 1 n , 1 + (1 : ℝ) / 2 ^ k ) < 5 / 2 := by rw [ Finset . Icc ] simp_all [ Nat . succ_le_iff , Nat . one_le_iff_ne_zero ] have h ₁ : 0 < 2 ^ k := by apply pow_pos <;> norm_num norm_num have h ₂ : (2 : ℝ) ≠ 0 := by norm_num field_simp rw [ div_lt_div_iff ] ring_nf norm_cast nlin...

B.1 Problems that non-CoT outperforms CoT

定理induction_pord1p1on2powklt5on2的示例证明代码。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: ⬇ theorem induction_pord1p1on2powklt5on2 ( n : ℕ) ( h ₀ : 0 < n ) : ( $ \ Pi$ k in Finset . Icc 1 n , 1 + (1 : ℝ) / 2 ^ k ) < 5 / 2 := by rw [ Finset . Icc ] simp_all [ Nat . succ_le_iff , Nat . one_le_iff_ne_zero ] have h ₁ : 0 < 2 ^ k := by apply pow_pos <;> norm_num norm_num have h ₂ : (2 : ℝ) ≠ 0 := by norm_num field_simp rw [ div_lt_div_iff ] ring_nf norm_cast nlinarith all_goals norm_cast all_goals linarith ⬇ theorem imo_1960_p2 ( x : ℝ) ( h ₀ : 0 ≤ 1 + 2 * x ) ( h ₁ : (1 - Real . sqrt (1 + 2 * x )) ^ 2 ≠ 0) ( h ₂ : 4 * x ^ 2 / (1 - Real . sqrt (1 + 2 * x )) ^ 2 < 2 * x + 9) : -(1 / 2) ≤...

B.2 Problems that CoT outperforms non-CoT

B.2 CoT优于non-CoT的问题定理mathd_algebra_459的示例，涉及有理数方程组求解。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: ⬇ theorem mathd_algebra_459 ( a b c d : ℚ) ( h ₀ : 3 * a = b + c + d ) ( h ₁ : 4 * b = a + c + d ) ( h ₂ : 2 * c = a + b + d ) ( h ₃ : 8 * a + 10 * b + 6 * c = 24) : ↑ d . den + d . num = 28 := by /- Given the system of equations involving the times that Anna ( a ), Bertram ( b ), Carli ( c ), and David ( d ) can hold their breaths : 1. $ 3 a = b + c + d $ 2. $ 4 b = a + c + d $ 3. $ 2 c = a + b + d $ 4. $ 8 a + 10 b + 6 c = 24 $ We need to find the sum of the numerator and the denominator of the fraction representing David ’ s time in minutes , simplified to its lowest terms , and show that i...

← 返回首页详细解读