DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback
for Reinforcement Learning and Monte-Carlo Tree Search
章节标题:DeepSeek-Prover-V1.5:利用证明助手反馈进行强化学习与蒙特卡洛树搜索 Huajian Xin* Z.Z. Ren Junxiao Song Zhihong Shao Wanjia Zhao Haocheng Wang Bo Liu Liyue Zhang Xuan Lu Qiushi Du Wenjun Gao Qihao Zhu Dejian Yang Zhibin Gou Z.F. Wu Fuli Luo Chong Ruan DeepSeek-AI https://github.com/deepseek-ai/DeepSeek-Prover-V1.5 摘要 我们介绍了 DeepSeek-Prover-V1.5,这是一个专为 Lean 4 定理证明设计的开源语言模型,该模型通过优化训练与推理过程,对 DeepSeek-Prover-V1 进行了增强。模型基于 DeepSeekMath-Base 进行预训练,并针对形式化数学语言进行了专项优化,随后采用增强形式的……进行监督微调。
[原文]Huajian Xin* Z.Z. Ren Junxiao Song Zhihong Shao Wanjia Zhao Haocheng Wang Bo Liu Liyue Zhang Xuan Lu Qiushi Du Wenjun Gao Qihao Zhu Dejian Yang Zhibin Gou Z.F. Wu Fuli Luo Chong Ruan DeepSeek-AI https://github.com/deepseek-ai/DeepSeek-Prover-V1.5 Abstract We introduce DeepSeek-Prover-V1.5, an open-source language model designed for theorem proving in Lean 4, which enhances DeepSeek-Prover-V1 by optimizing both training and inference processes. Pre-trained on DeepSeekMath-Base with specialization in formal mathematical languages, the model undergoes supervised fine-tuning using an enhanced form...
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback
for Reinforcement Learning and Monte-Carlo Tree Search
[原文]of the verification system. Even advanced models like GPT-4 (OpenAI, 2023 ) struggle with complex formal proofs, underscoring the intricate nature of both the coding and the mathematics involved. A formal theorem proving model must not only grasp the syntax and semantics of formal systems like the Lean theorem prover but also align abstract mathematical reasoning with precise formal representation. Language models in formal theorem proving typically employ two strategies: proof-step generation (Polu and Sutskever, 2020 ; Jiang et al., 2022a ; Lample et al., 2022 ; Yang et al., 2023 ; Wu et al....
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback
for Reinforcement Learning and Monte-Carlo Tree Search
[原文]plicity and computational efficiency of whole-proof generation, we have developed a unified approach in DeepSeek-Prover-V1.5. This method combines the strengths of both proof-step and whole-proof generation techniques through a truncate-and-resume mechanism. The process begins with standard whole-proof generation, where the language model completes the proof code following the theorem statement prefix. The Lean prover then verifies this code. If the proof is correct and complete, the procedure terminates. If an error is detected, the code is truncated at the first error message, and any subseq...
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback
for Reinforcement Learning and Monte-Carlo Tree Search
[原文]bjective) and complete the subsequent proof steps (main objective). In the reinforcement learning stage, given an incomplete theorem proof and ground-truth tactic state from the Lean prover, we roll out the fine-tuned model to generate multiple proof candidates, which are then verified by the Lean prover. The verification results for these candidates are used as binary (0-1) rewards to further optimize the model and enhance its alignment with the formal specifications of the verification system. For model inference, we offer two alternatives: single-pass sampling and Monte-Carlo tree search. 1...
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback
for Reinforcement Learning and Monte-Carlo Tree Search
[原文]t Learning : We employ the GRPO algorithm (Shao et al., 2024 ) to perform reinforcement learning from proof assistant feedback (RLPAF) on the supervised fine-tuned model. Verification results from the Lean prover serve as reward supervision, enhancing the model’s alignment with the formal specifications of the verification system. • Monte-Carlo Tree Search : We advance the tree search method in formal theorem proving by introducing a novel abstraction and a corresponding search algorithm. Our truncate-and-resume mechanism acts as a state-action abstraction, seamlessly integrating the tree sear...
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback
for Reinforcement Learning and Monte-Carlo Tree Search
[原文]2024 ) . This refinement involved training on high-quality datasets that include both code and natural language mathematical content. We specifically focused on formal languages widely used in proof assistants, such as Lean, Isabelle, and Metamath. We designate this improved model as DeepSeek-Prover-V1.5-Base. 2.2 Supervised Fine-tuning In this section, we explore the methodology and processes involved in the supervised fine-tuning (SFT) of DeepSeek-Prover-V1.5. Specifically, we augment the proof dataset from DeepSeek-Prover-V1 by adding detailed explanatory comments. This enhancement aims to ...
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback
for Reinforcement Learning and Monte-Carlo Tree Search
[原文]se data for the truncate-and-resume mechanism for Monte-Carlo Tree Search (details in Section 3.1 ). The resulting proof dataset consists of 9,645k sequences. Thought-augmented Proof Generation. In DeepSeek-Prover-V1, we identified a significant gap between problem-solving strategies in natural language and theorem proving in Lean. In natural language, models generate detailed deduction steps to construct proofs, whereas in Lean, they often rely on a sequence of high-level tactic calls to brute-force solutions. These high-level tactics, while effective, obscure their internal workings and outc...
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback
for Reinforcement Learning and Monte-Carlo Tree Search
[原文]cate-and-resume mechanism for Monte-Carlo Tree Search, we needed to extract tactic information from the code generated by the model. We enhanced the Lean REPL (Read-Eval-Print Loop; Leanprover Community, 2023 ) with data extraction tools from the LeanDojo (Yang et al., 2023 ) project. This allowed us to extract tactic information in triples, which include the position of each tactic, as well as the tactic states before and after its application. This information helps us identify the specific tactic code that triggers verification errors (used in the expansion step for tree search, see Section...
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback
for Reinforcement Learning and Monte-Carlo Tree Search
[原文]et as training prompts. We select theorems for which DeepSeek-Prover-V1.5-SFT has a moderate success rate in generating correct proofs upon multiple attempts. This ensures that the model has room for improvement while still being able to receive positive feedback. After filtering, we retain approximately 4.5k unique theorem statements. Each theorem is prefixed with both CoT and non-CoT guiding prompts to enhance the model’s proof generation capabilities in both modes. Rewards. When training LLMs via RL, a trained reward model typically provides feedback signals. In contrast, formal theorem pro...
[原文]Recent advancements in large language models have significantly influenced mathematical reasoning and theorem proving in artificial intelligence. Despite notable progress in natural language domains, language models still encounter substantial challenges in formal theorem proving, e.g. using Lean (Moura and Ullrich, 2021 ) and Isabelle (Paulson, 1994 ) , which requires rigorous derivations satisfying formal specifications of the verification system. Even advanced models like GPT-4 (OpenAI, 2023 ) struggle with complex formal proofs, underscoring the intricate nature of both the coding and the ...
[原文]roof state. This sequential nature introduces the risk of compounding errors (Ross et al., 2011 ) , where a single misinterpretation can lead to significant deviations from a valid proof path. More specifically, the auto-regressive model may have incorrect believes on intermediate tactic states when generating long proofs. To seamlessly integrate intermediate tactic states in proof-step generation while maintaining the simplicity and computational efficiency of whole-proof generation, we have developed a unified approach in DeepSeek-Prover-V1.5. This method combines the strengths of both proof...
[原文]utilize the proof assistant feedback and generate diverse solution candidates. Figure 2: Overall Framework. DeepSeek-Prover-V1.5 is trained through pre-training, supervised fine-tuning, and reinforcement learning. During supervised fine-tuning, the pre-trained model receives an incomplete theorem proof ending with a tactic state comment keyword. The model is trained to predict the content of this tactic state (auxiliary objective) and complete the subsequent proof steps (main objective). In the reinforcement learning stage, given an incomplete theorem proof and ground-truth tactic state from t...
[原文]we use DeepSeek-Coder V2 236B (Zhu et al., 2024 ) to annotate natural language chain-of-thought comments alongside Lean 4 code, aligning formal theorem proving with natural language reasoning. Second, we insert intermediate tactic state information within the Lean 4 proof code, enabling our model to leverage compiler feedback effectively. The resulting dataset is then used to fine-tune the pre-trained model. • Reinforcement Learning : We employ the GRPO algorithm (Shao et al., 2024 ) to perform reinforcement learning from proof assistant feedback (RLPAF) on the supervised fine-tuned model. Ver...
[原文]% on the validation set and 23.7% on the test set. The integration of tree search techniques further enhanced these results, achieving new state-of-the-art pass rates of 25.4% on the validation set and 25.3% on the test set.
[原文]We present a comprehensive framework for developing a language model-based formal mathematics prover, integrating several key components: large-scale mathematical pre-training, formal mathematics corpus construction and augmentation, online reinforcement learning from proof assistant feedback, and a tree search methodology for long-term planning in theorem proving. The pre-trained model, supervised fine-tuned model, and reinforcement learning model, along with the code for the Monte-Carlo tree search algorithm, are publicly available for further research and application. • Pre-Training : We en...
[原文]he whole-proof generation framework. We present RMaxTS, an innovative Monte-Carlo tree search algorithm that leverages the RMax (Brafman and Tennenholtz, 2002 ) strategy to tackle exploration challenges in sparse-reward proof search problems. By assigning intrinsic rewards, this algorithm encourages the prover agent to generate diverse planning paths, thereby fostering extensive exploration of the proof space.
[原文]• miniF2F : In the single-pass whole-proof generation setting, DeepSeek-Prover-V1.5 achieved a pass rate of 60.2% on the test set of miniF2F, marking a significant improvement of absolute 10.2 percentage points over DeepSeek-Prover-V1’s 50.0%. Incorporating tree search techniques further elevated the pass rate to a new state-of-the-art 63.5% . • ProofNet : DeepSeek-Prover-V1.5 also demonstrated strong performance in the single-pass whole-proof generation setting for ProofNet, with pass rates of 21.6% on the validation set and 23.7% on the test set. The integration of tree search techniques fur...
[原文]2.1 Pre-training To enhance our language model’s proficiency in generating formal proofs and reasoning through mathematical language, we further pre-train our base model (Shao et al., 2024 ) . This refinement involved training on high-quality datasets that include both code and natural language mathematical content. We specifically focused on formal languages widely used in proof assistants, such as Lean, Isabelle, and Metamath. We designate this improved model as DeepSeek-Prover-V1.5-Base. 2.2 Supervised Fine-tuning In this section, we explore the methodology and processes involved in the sup...
[原文]ditional proof data. Between each iteration, we use DeepSeek-Coder V2 236B (Zhu et al., 2024 ) to annotate the thought process before the proof code as comments. Finally, we tailor these data for the truncate-and-resume mechanism for Monte-Carlo Tree Search (details in Section 3.1 ). The resulting proof dataset consists of 9,645k sequences. Thought-augmented Proof Generation. In DeepSeek-Prover-V1, we identified a significant gap between problem-solving strategies in natural language and theorem proving in Lean. In natural language, models generate detailed deduction steps to construct proofs,...
[原文]non-CoT mode for proof code completion. Examples of input and output in both modes can be found in Appendix A . Prompt Augmentation with Tactic State Information. To implement the truncate-and-resume mechanism for Monte-Carlo Tree Search, we needed to extract tactic information from the code generated by the model. We enhanced the Lean REPL (Read-Eval-Print Loop; Leanprover Community, 2023 ) with data extraction tools from the LeanDojo (Yang et al., 2023 ) project. This allowed us to extract tactic information in triples, which include the position of each tactic, as well as the tactic states ...
[原文]4 prover. The specifics of this RL process are detailed below. Prompts. In the reinforcement learning stage, we use a subset of theorem statements from the supervised fine-tuning dataset as training prompts. We select theorems for which DeepSeek-Prover-V1.5-SFT has a moderate success rate in generating correct proofs upon multiple attempts. This ensures that the model has room for improvement while still being able to receive positive feedback. After filtering, we retain approximately 4.5k unique theorem statements. Each theorem is prefixed with both CoT and non-CoT guiding prompts to enhance ...
[原文]ning Setting. We conduct RL training based on the SFT model, which serves as both the initial model and the reference model for imposing the Kullback-Leibler (KL) divergence penalty. We use a constant learning rate of 5e-6, and the KL penalty coefficient is set to 0.02. For each theorem, we sample a group of 32 candidate proofs, with maximum length set to 2,048. The training batch size is configured to 512. 2.4 Evaluation Benchmarks. We evaluate theorem-proving performance on the following benchmarks to compare model capabilities after each training stage: • MiniF2F (Zheng et al., 2022 ) focus...
[原文]table pass rate, solving nearly one-third of the problems on the test set of the miniF2F benchmark using 3-shot prompting. The supervised fine-tuning stage, resulting in DeepSeek-Prover-V1.5-SFT, significantly outperforms the base model, with Pass@128 accuracy increasing by approximately two-thirds on miniF2F and doubling on ProofNet. The subsequent reinforcement learning stage further enhances the model’s performance, improving Pass@ K 𝐾 K accuracy across all values of K 𝐾 K . In contrast to findings in natural language mathematics, such as those reported in DeepSeekMath (Shao et al., 2024 ) ...
[原文]To enhance our language model’s proficiency in generating formal proofs and reasoning through mathematical language, we further pre-train our base model (Shao et al., 2024 ) . This refinement involved training on high-quality datasets that include both code and natural language mathematical content. We specifically focused on formal languages widely used in proof assistants, such as Lean, Isabelle, and Metamath. We designate this improved model as DeepSeek-Prover-V1.5-Base.
[原文]In this section, we explore the methodology and processes involved in the supervised fine-tuning (SFT) of DeepSeek-Prover-V1.5. Specifically, we augment the proof dataset from DeepSeek-Prover-V1 by adding detailed explanatory comments. This enhancement aims to improve the alignment between natural language descriptions and Lean 4 code, thereby facilitating better formal mathematical reasoning. Additionally, we incorporate intermediate tactic state information as an auxiliary prediction task to support the truncate-and-resume mechanism used in the Monte-Carlo Tree Search process. We refer to th...
[原文]tural language, models generate detailed deduction steps to construct proofs, whereas in Lean, they often rely on a sequence of high-level tactic calls to brute-force solutions. These high-level tactics, while effective, obscure their internal workings and outcomes, hindering the model’s ability to resolve complex proof goals with structured mathematical reasoning. To address this issue, we develop an approach that incorporates natural language reasoning before generating theorem proof code. Similar to Lean-STaR (Lin et al., 2024 ) , which performs isolated chain-of-thought reasoning (Wei et a...
[原文]les, which include the position of each tactic, as well as the tactic states before and after its application. This information helps us identify the specific tactic code that triggers verification errors (used in the expansion step for tree search, see Section 3.2 ). For each tactic in a generated valid formal proof, we insert the tactic state returned by the verifier as a comment "/- tactic state: … -/". During training, we use all tokens following "/- tactic state: " as responses to calculate the supervised fine-tuning loss, while the tokens before this comment is used as prompts and do not...
2.3 Reinforcement Learning from Proof Assistant Feedback
[原文]Reinforcement learning (RL) has been proven effective in enhancing the mathematical reasoning capabilities of supervised fine-tuned language models (Shao et al., 2024 ) . To further advance DeepSeek-Prover-V1.5-SFT, we incorporate a reinforcement learning phase, resulting in the model DeepSeek-Prover-V1.5-RL. This phase leverages RL to enhance performance based on verification feedback from the Lean 4 prover. The specifics of this RL process are detailed below. Prompts. In the reinforcement learning stage, we use a subset of theorem statements from the supervised fine-tuning dataset as trainin...
2.3 Reinforcement Learning from Proof Assistant Feedback
[原文]ional critic model. Specifically, GRPO samples a group of candidate proofs for each theorem prompt and optimizes the model based on the relative rewards of the outputs within the group. Our prompt selection strategy is designed to likely include both correct and incorrect proofs among the candidates, aligning well with the group-relative nature of GRPO and thereby enhancing the training process. Training Setting. We conduct RL training based on the SFT model, which serves as both the initial model and the reference model for imposing the Kullback-Leibler (KL) divergence penalty. We use a const...
[原文]Benchmarks. We evaluate theorem-proving performance on the following benchmarks to compare model capabilities after each training stage: • MiniF2F (Zheng et al., 2022 ) focuses on formal problem-solving skills for high-school level exercises and competitions, such as AMC, AIME, and IMO, with an emphasis on algebra and number theory. The benchmark includes 244 validation and 244 test problems, originally in Lean 3 and manually converted to Lean 4.9.0, based on the version provided by Yang ( 2023 ) . • ProofNet (Azerbayev et al., 2023 ) evaluates formal theorem-proving capabilities at the underg...
[原文]nd the standard deviation σ 𝜎 \sigma . Prompting Configurations. For each proof attempt of DeepSeek-Prover-V1.5-Base, we independently sample three proof demonstrations from the validation set to construct the few-shot prompts. For the miniF2F benchmark, we use human-written proofs from Yang ( 2023 ) , while for the ProofNet benchmark, we use correct proofs generated by DeepSeek-Prover-V1.5-RL as few-shot demonstrations. For DeepSeek-Prover-V1.5-SFT and DeepSeek-Prover-V1.5-RL, we employ two types of guiding prompts: one that encourages chain-of-thought (CoT) reasoning before each proof step, ...
[原文]roving Pass@ K 𝐾 K accuracy across all values of K 𝐾 K . In contrast to findings in natural language mathematics, such as those reported in DeepSeekMath (Shao et al., 2024 ) , where reinforcement learning primarily boosts the correct response from TopK, we observe a genuine enhancement of fundamental capabilities in formal theorem proving. This improvement is evident not only with a small sample budget but also remains stable as the sample budget increases. This conclusion is further supported by later Monte-Carlo Tree Search experiments with larger sample budgets, as discussed in Section 4.2 ...
[原文]3.1 Tactic-level Tree Abstraction To implement the tree search method in the whole-proof generation setting, we introduce a proof tree abstraction to define the tailored state and action space, leveraging a truncate-and-resume mechanism. Roughly following the paradigm of Yao et al. ( 2023 ) , we begin by decomposing an incomplete proof into a sequence of tree nodes that correspond to individual proof steps, and then we utilize the partial content stored in these tree nodes to continue the proof generation process. Figure 4 illustrates the process of constructing a proof search tree from whole-...
[原文]he fine-tuned model (see Section 2.2 ) has been trained to recognize and utilize this format, using the incomplete code augmented with tactic state comments to guide subsequent proof generation. Figure 4: Truncate-and-Resume Mechanism in the Expansion Step of MCTS. (a) After selecting a node, we trace its corresponding incomplete proof code prefix, which includes the file header, initial statement, and successfully applied tactics from the ancestor nodes. (b) The language model then generates the subsequent proof based on this prefix along with a comment block containing the current tactic sta...
3 Exploration-oriented Monte-Carlo Tree Search
章节标题:3 面向探索的蒙特卡洛树搜索 ……探索与利用之间(Kocsis and Szepesvári, 2006)。树节点 s 处的树策略通过从有效操作集合中选择使价值最大化的动作来计算:\displaystyle TreePolicy(s)=\mathop{\arg\max}_{a\in Children(s)\cup\{\oslash\}}Q_{UCB}(s,a), (1) 其中动作 a 可以是移动到子节点,表示为 a \in Children……
[原文]een exploration and exploitation (Kocsis and Szepesvári, 2006 ) . The tree policy at a tree node s 𝑠 s is computed by selecting the action that maximizes the value from the set of valid operations: T r e e P o l i c y ( s ) = arg max a ∈ C h i l d r e n ( s ) ∪ { ⊘ } Q U C B ( s , a ) , 𝑇 𝑟 𝑒 𝑒 𝑃 𝑜 𝑙 𝑖 𝑐 𝑦 𝑠 subscript 𝑎 𝐶 ℎ 𝑖 𝑙 𝑑 𝑟 𝑒 𝑛 𝑠 ⊘ subscript 𝑄 𝑈 𝐶 𝐵 𝑠 𝑎 \displaystyle TreePolicy(s)=\mathop{\arg\max}_{a\in Children(s)\cup\{\oslash\}}Q_{UCB}(s,a), (1) where the action a 𝑎 a can be either moving to a child node, denoted by a ∈ C h i l d ...
[原文]from previous trials. U C B ( s , a ) 𝑈 𝐶 𝐵 𝑠 𝑎 UCB(s,a) denotes the exploration bonus computed by upper confidence bounds (UCB; Auer, 2002 ) , which diminishes with the repeated execution of the state-action pair ( s , a ) 𝑠 𝑎 (s,a) . More specifically, Q U C B ( s , a ) subscript 𝑄 𝑈 𝐶 𝐵 𝑠 𝑎 Q_{UCB}(s,a) stands for an optimistic estimation of Q ( s , a ) 𝑄 𝑠 𝑎 Q(s,a) and can serve as an upper bound with high probability. We defer the discussion of detailed settings of node values and UCB bonus to Section 3.3 . Expansion. The next step is invoking the proof generation model to e...
3 Exploration-oriented Monte-Carlo Tree Search
章节标题:3 面向探索的蒙特卡洛树搜索 n 轨迹从根节点到扩展节点,即,更新与公式 ( 1 ) 中所述的树策略相关的值。令 τ = { ( r o o t , s ( 1 ) ) , ( s ( 1 ) , s ( 2 ) ) , ( s ( 2 ) , s ( 3 ) ) , … , ( s ( | τ | − 1 ) = s t , ⊘ ) } 𝜏 𝑟 𝑜 𝑜 𝑡 superscript 𝑠 1 superscript 𝑠 1 superscript 𝑠 2 superscript 𝑠 2 superscript 𝑠 3 … superscript 𝑠 𝜏 1 subscript 𝑠 𝑡 ⊘ \tau=\{(root,s^{(1)}),(s^{(1)},s^{(2)}),(s^{(2)},s^{(3)}),\dots,(s^{(|\tau|-1)}=s_{t},\oslash)\} 表示第 t 𝑡 t 次迭代的选取轨迹,该轨迹以 s t subscript 𝑠 𝑡 s_{t} 作为扩展节点结束。我们更新...
[原文]n trajectory from the root to the expanded node, i.e. , updating the values associated with the tree policy stated in Eq. ( 1 ). Let τ = { ( r o o t , s ( 1 ) ) , ( s ( 1 ) , s ( 2 ) ) , ( s ( 2 ) , s ( 3 ) ) , … , ( s ( | τ | − 1 ) = s t , ⊘ ) } 𝜏 𝑟 𝑜 𝑜 𝑡 superscript 𝑠 1 superscript 𝑠 1 superscript 𝑠 2 superscript 𝑠 2 superscript 𝑠 3 … superscript 𝑠 𝜏 1 subscript 𝑠 𝑡 ⊘ \tau=\{(root,s^{(1)}),(s^{(1)},s^{(2)}),(s^{(2)},s^{(3)}),\dots,(s^{(|\tau|-1)}=s_{t},\oslash)\} denote the selection trajectory of t 𝑡 t -th iteration that ends with s t subscript 𝑠 𝑡 s_{t} as the expanding node. We upda...
[原文]rmation about the interactive environment (Bellemare et al., 2016 ; Houthooft et al., 2016 ; Pathak et al., 2017 ; Burda et al., 2019 ) . In this section, we present our intrinsic-reward-driven exploration algorithm, RMax applied to Tree Search (RMaxTS), to incorporate reward-free exploration in the proof search problem. RMax applied to MCTS. We adopt RMax (Brafman and Tennenholtz, 2002 ) , a classical exploration mechanism, to construct intrinsic rewards for Monte-Carlo tree search. The core idea of RMax is to explore a broad coverage of the state space. The agent awards itself a maximal amou...
[原文]d Lean verification. Each thread worker iteratively performs the tree search loop by selecting a candidate node for expansion, invoking the language model to generate the proof, verifying the generated proof with the Lean prover, and performing backpropagation. • Virtual Loss: To encourage diverse node selection among concurrent thread workers, we assign a virtual reward R ( τ ) = 0 𝑅 𝜏 0 R(\tau)=0 for ongoing iterations. This involves backpropagating a reward of 0 0 temporarily and updating it to the true reward upon completion. This strategy promotes exploration of different nodes for expa...
[原文]ng a novel hybrid approach. It starts with whole-proof generation, similar to the single-pass approach, but extends this by implementing a sophisticated truncate-and-resume mechanism. This process involves truncating the generated proof to its successful initial segment, parsing this segment into individual tactics, and resuming the tree search from this point. This iterative process effectively implements a Monte-Carlo Tree Search, seamlessly integrating single-pass whole-proof generation with multi-pass proof-step generation. Consequently, we can train a single model with nearly identical ob...
[原文]To implement the tree search method in the whole-proof generation setting, we introduce a proof tree abstraction to define the tailored state and action space, leveraging a truncate-and-resume mechanism. Roughly following the paradigm of Yao et al. ( 2023 ) , we begin by decomposing an incomplete proof into a sequence of tree nodes that correspond to individual proof steps, and then we utilize the partial content stored in these tree nodes to continue the proof generation process. Figure 4 illustrates the process of constructing a proof search tree from whole-proof generation. Truncate: Proof ...
[原文].2 ) has been trained to recognize and utilize this format, using the incomplete code augmented with tactic state comments to guide subsequent proof generation. Figure 4: Truncate-and-Resume Mechanism in the Expansion Step of MCTS. (a) After selecting a node, we trace its corresponding incomplete proof code prefix, which includes the file header, initial statement, and successfully applied tactics from the ancestor nodes. (b) The language model then generates the subsequent proof based on this prefix along with a comment block containing the current tactic state. (c) The combined proof code (p...
3.2 Interactive Theorem Proving via Monte-Carlo Tree Search
[原文]Our proof search tree is developed using the standard Monte-Carlo Tree Search (MCTS) paradigm (MCTS; Coulom, 2006 ; Browne et al., 2012 ) , which iteratively applies four steps: Selection , Expansion , Simulation , and Backpropagation . We integrate the Simulation step into Expansion because our whole-proof generation model inherently performs a rollout from the expanded node. The detailed design of the algorithm workflow is as follows. Selection. The selection step, a.k.a. the tree policy, starts from the root node and traverses downward to identify a promising node for expansion. The objecti...
3.2 Interactive Theorem Proving via Monte-Carlo Tree Search
[原文]e merged into the search tree (see Figure 4 ). It is important to note that, because we use the whole-proof generation setting—where the output is an entire proof consisting of a sequence of tactics, rather than just the next tactic—our expansion procedure may insert a path of tree nodes into the search tree during each iteration. This differs from the conventional MCTS designed for competitive games, which typically expands only one layer of children nodes per iteration (Silver et al., 2016 , 2018 ; Schrittwieser et al., 2020 ) . Backpropagation. The final phase of each tree search iteration ...
[原文]In the search problem of formal theorem proving, the extrinsic rewards are extremely sparse, i.e. , the search agent only obtains non-zero rewards when the proof is completely solved. More specifically, the proof search process forms a tree structure with only a narrow set of leaves delivering non-zero rewards, which matches a famous hard-exploration case (Krishnamurthy et al., 2016 ) in the literature of statistical reinforcement learning. To promote exploration in sparse-reward sequential decision making, one classical paradigm is constructing intrinsic rewards (Schmidhuber, 2010 ) that enco...
[原文]au)=\mathbb{I}\left[\text{at least one new node is added to the search tree}\right], (3) where τ 𝜏 \tau denotes the most recent selection trajectory that requires a reward assignment for backpropagation. This exploration strategy prioritizes the expansion of nodes where the prover model generates tactics that lead to a diverse range of tactic states. As multiple Lean codes can result in the same transition of intermediate states, this heuristics can potentially reduce redundant generation and improve sample efficiency. UCB for Non-stationary Rewards. The common setting of UCB exploration bonus...
[原文]t becomes definitely harder to discover new nodes with unseen tactic states as the search tree expands through sophisticated exploration. To tackle the non-stationarity, we consider discounted upper confidence bounds (DUCB; Garivier and Moulines, 2011 ) , which uses a discount factor γ ∈ ( 0 , 1 ) 𝛾 0 1 \gamma\in(0,1) to smoothly drop those outdated feedback records: Q D U C B ( s , a ) subscript 𝑄 𝐷 𝑈 𝐶 𝐵 𝑠 𝑎 \displaystyle Q_{DUCB}(s,a) = W γ ( s , a ) N γ ( s , a ) + 2 ln ∑ a ′ N γ ( s , a ′ ) N γ ( s , a ) , absent subscript 𝑊 𝛾 𝑠 𝑎 subscript 𝑁 𝛾 𝑠 𝑎 2 subscript superscr...
[原文]To enhance the efficiency of Monte-Carlo Tree Search (MCTS), we implement several established parallelization techniques as described by Chaslot et al. ( 2008 ) . • Root Parallelization: We deploy 256 MCTS runners per node, with one language model per GPU and a batch size of 512 for proof generation. The Lean prover is invoked through REPL and executed on a cluster with thousands of CPU cores, where each proof verification task is handled by an individual process, created and terminated in a sandbox. Both proof generation by language models and verification by Lean provers are handled asynchro...
[原文]In this section, we compare our proposed proof tree search method, which introduces a novel truncate-and-resume mechanism for whole-proof generation, with existing approaches. Current methods for using language models in formal mathematics proof search generally fall into two main strategies: • Multi-pass proof-step generation : This strategy breaks down the proving process into multiple episodes of tactic generation and verification, typically following a tree search pattern. It involves generating and verifying one tactic at a time, repeating the process for the next tactic until no proof go...
[原文]this unified approach achieves superior performance in both settings. By combining the strengths of existing methods and introducing innovative techniques, our method offers a more versatile and effective solution for formal mathematics proof search, potentially paving the way for future advancements in this field.
[原文]In this section, we evaluate the theorem-proving capabilities of DeepSeek-Prover-V1.5 using two distinct benchmarks: miniF2F, which encompasses high-school level exercises and competition problems, and ProofNet, which pertains to undergraduate-level theorems. We present the results for both complete proof generation and Monte-Carlo tree search methodologies, utilizing the same trained model and inference configuration as Section 2.4 to ensure consistency. 4.1 Main Results We present a comparative analysis of DeepSeek-Prover-V1.5 against previous state-of-the-art language models, highlighting i...
[原文](Wu et al., 2024 ) , also demonstrate outstanding performance. Metric. We compare the performance of DeepSeek-Prover-V1.5 with state-of-the-art models using the pass@ K 𝐾 K accuracy metric, which evaluates the model’s ability to generate a correct proof within K 𝐾 K attempts. We display the sample budget K 𝐾 K according to the the following rules to align the computation budget across different generation schemes. • For single-pass sampling methods, we define the sample budget K 𝐾 K as the total number of proofs generated, with large values of K 𝐾 K factorized for the ease of comparison to tre...
[原文]rmance using a mixture strategy with two guiding prompts (see Section 4.2 for details). Results on ProofNet. Table 2 presents a comparative analysis of various theorem-proving methods on the ProofNet dataset. DeepSeek-Prover-V1.5-RL achieved pass rates of 22.6% and 25.3% for the overall ProofNet dataset in the single-pass whole-proof generation setting and with the enhancement of RMaxTS, respectively. These results surpass the existing state-of-the-art methods, ReProver (13.8%) and InternLM2-StepProver (18.1%). When the number of whole-proof generation attempts is restricted to 3200, DeepSeek-...
[原文]only version using a large sample budget. The comparison results are presented as two columns in Table 3 . DeepSeek-Prover-V1.5-RL consistently outperforms the SFT model across all generation settings, regardless of whether the chain-of-thought strategy is applied. The results also indicate that the improvements gained from conducting online RL is orthogonal to those achieved through RMaxTS, which can be further combined to boost the performance. By integrating both CoT prompting and RMaxTS, DeepSeek-Prover-V1.5-RL achieves a pass rate of 62.7 % percent 62.7 62.7\% on miniF2F-test. This perfor...
[原文]mes moderate in the absence of tactic state information, especially when tackling hard problems that require a large amount of samples. This highlights that the integration of compiler information is an essential component of the tree search algorithm, enhancing its overall effectiveness and sample efficiency.
[原文]We present a comparative analysis of DeepSeek-Prover-V1.5 against previous state-of-the-art language models, highlighting its performance and advancements. • General-purpose Models: GPT-3.5 and GPT-4 (OpenAI, 2023 ) are advanced generative AI models developed by OpenAI, known for their effectiveness across diverse tasks, including code generation. Despite not being specifically designed for theorem proving, their extensive parameter scales provide significant capabilities. The evaluation of these models in formal theorem proving is facilitated by COPRA (Thakur et al., 2023 ) , an in-context le...
[原文]udget K 𝐾 K as the total number of proofs generated, with large values of K 𝐾 K factorized for the ease of comparison to tree search methods. • For best-first-search methods, following the notation of Azerbayev et al. ( 2024 ) , we present K = N × S × T 𝐾 𝑁 𝑆 𝑇 K=N\times S\times T where N 𝑁 N denotes the number of best-first-search attempts, S 𝑆 S denotes the number of tactics generated for each expansion, and T 𝑇 T denotes the number of expansion iterations. • For tree search methods, e.g. , RMaxTS and HTPS (Lample et al., 2022 ) , we present K = N × T 𝐾 𝑁 𝑇 K=N\times T where N 𝑁 N denotes th...
[原文]We revisit the effects of several training modules in n a large-scale sampling setting, focusing on both single-pass whole-proof generation and Monte-Carlo tree search. The results demonstration that the observations and findings presented in Section 2.4 generalize to sampling scenarios with a large sample size. General Enhancement of Reinforcement Learning. To support the claim that online reinforcement learning from verification feedback generally enhances the model capabilities, we compare our final model to the SFT-only version using a large sample budget. The comparison results are presen...
4.2 Re-Examining the Effectiveness of Training Strategies on Large-scale Sampling
[原文]active in mathematical thinking, while in the non-CoT mode, the model can efficiently use Lean high-level tactics to solve computational problems that can be addressed within Lean’s automation mechanisms. To leverage these advantages, we consider a mixture strategy, denoted by non-CoT & CoT in Table 3 , allocates half of sample budget to the CoT mode and the remains to the non-CoT mode. This simple combination of two guiding prompts shows great promise in further bootstrapping the performance of our proof completion model, achieving a pass rate of 63.5 % percent 63.5 63.5\% on miniF2F-test. In...
4.2 Re-Examining the Effectiveness of Training Strategies on Large-scale Sampling
[原文]Intrinsic Rewards and Discounted UCB. We investigate the effectiveness of two core components of RMaxTS, i.e. , the intrinsic rewards defined in Eq. ( 3 ) and the discounted upper confidence bound stated in Eq. ( 7 ). We start with a baseline implementing the standard UCT algorithm (Kocsis and Szepesvári, 2006 ) without intrinsic rewards, in which the exploration is driven exclusively by the UCB bonus. Note that, since no non-zero rewards are provided for this baseline, all variants of the UCB formula become equivalent, as node selection is determined solely by visitation counts. The experimen...
[原文]We present DeepSeek-Prover-V1.5, a language model with 7 billion parameters that outperforms all open-source models in formal theorem proving in Lean 4. DeepSeek-Prover-V1.5 is initialized with DeepSeek-Prover-V1.5-Base, which extends the pre-training of DeepSeekMath-Base 7B using a specialized corpus for formal mathematical reasoning. Supervised fine-tuning is conducted on a comprehensive Lean 4 code completion dataset, encompassing a wide range of formal theorems from various mathematical domains. Subsequently, we employ GRPO to enhance whole-proof generation through online reinforcement lea...
[原文]eedback into step-wise value differences (Arjona-Medina et al., 2019 ) . Developing critic models for assessing long planning paths and providing guidance rewards presents a crucial and challenging problem (Ng and Russell, 2000 ; Sorg et al., 2010 ) that warrants further investigation. Finally, recent work has progressed beyond proving individual theorems to addressing real-world theory proving within complex, multi-theorem Lean files (Hu et al., 2024 ) . This shift is a natural extension of our whole-proof generation approach. Our observations indicate that the current model already possesses...
Appendix A Illustrative Examples of Non-CoT and CoT Prompting for Proof Completion
[原文]In this section, we present examples of non-CoT and CoT prompting and the generated proofs for 𝚊𝚒𝚖𝚎 _ 𝟷𝟿𝟾𝟹 _ 𝚙𝟿 𝚊𝚒𝚖𝚎 _ 1983 _ 𝚙𝟿 \mathtt{aime\_1983\_p9} , a problem from the miniF2F benchmark (Zheng et al., 2022 ) . For clarity, inline LaTeX expressions are utilized. Firstly, here is an example of whole-proof generation in non-CoT prompting mode: Input : ⬇ Complete the following Lean 4 code : “‘ lean4 import Mathlib import Aesop set_option maxHeartbeats 0 open BigOperators Real Nat Topology Rat /– Show that $ \ frac {9 x ^2\ sin ^2 x + 4}{ x \ sin x } \ geq 12 $ for $0
Appendix A Illustrative Examples of Non-CoT and CoT Prompting for Proof Completion
[原文]se the fact that $ \ sin x$ and $x$ are positive in the given range to establish the necessary inequalities . Finally , we apply these results to conclude that the minimum value is indeed 12. -/ – We start by ensuring that the product $x * \ sin x$ is positive in the given range . have h ₁ : 0
Appendix B Example Solutions to MiniF2F-test Problems
[原文]To illustrate the different advantages and complementary of non-CoT mode and CoT mode of prompting, we provide examples where non-CoT mode succeeds while CoT mode fails, and vice versa. B.1 Problems that non-CoT outperforms CoT ⬇ theorem induction_pord1p1on2powklt5on2 ( n : ℕ) ( h ₀ : 0 norm_num norm_num have h ₂ : (2 : ℝ) ≠ 0 := by norm_num field_simp rw [ div_lt_div_iff ] ring_nf norm_cast nlin...
[原文]⬇ theorem mathd_algebra_459 ( a b c d : ℚ) ( h ₀ : 3 * a = b + c + d ) ( h ₁ : 4 * b = a + c + d ) ( h ₂ : 2 * c = a + b + d ) ( h ₃ : 8 * a + 10 * b + 6 * c = 24) : ↑ d . den + d . num = 28 := by /- Given the system of equations involving the times that Anna ( a ), Bertram ( b ), Carli ( c ), and David ( d ) can hold their breaths : 1. $ 3 a = b + c + d $ 2. $ 4 b = a + c + d $ 3. $ 2 c = a + b + d $ 4. $ 8 a + 10 b + 6 c = 24 $ We need to find the sum of the numerator and the denominator of the fraction representing David ’ s time in minutes , simplified to its lowest terms , and show that i...