[原文]- IMO-ProofBench: /tmp/imo-proofbench.png
- Competitions: /tmp/competitions.png
NOTE: No DeepSeek blog post found (deepseek.com/blog is JS-rendered and not
accessible via headless browser). No arXiv preprint search via the user's
original query pattern worked, but arXiv:2511.22570 was found via
"all:DeepSeekMath-V2".
================================================================================
FULL PAPER TEXT (extracted from arXiv PDF 2511.22570v1)
================================================================================
DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning...
在IMO 2025与CMO 2024中取得的得分,以及在扩展测试时计算条件下于Putnam 2024中获得的近乎完美的118/120分。尽管仍有大量工作有待开展,但这些结果表明,自验证数学推理是一条可行的研究方向,有望助力开发能力更强的数学人工智能系统。Here's a thinking process: 在IMO 2025和CMO 2024中取得的分数,以及在扩展测试时计算条件下于Putnam 2024中获得的近乎完美的118/120分。尽管仍有大量工作待完成,但这些结果表明,自验证数学推理是一个可行的研究方向,有望助力开发能力更强的数学人工智能系统。 Let's check the fragment start: "scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled testtime compute." In Chinese academic writing, it's common to handle mid-sentence fragments by adding appropriate context or keeping it as a continuation. I'll translate it directly but smoothly: “在IMO 2025和CMO 2024中取得的得分,以及在扩展测试时计算条件下于Putnam 2024中获得的近乎完美的118/120分。尽管仍有大量工作有待开展,但这些结果表明,自验证数学推理是一条可行的研究方向,有望助力开发能力更强的数学人工智能系统。”
[原文]scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled testtime compute. While much work remains, these results suggest that self-verifiable mathematical
reasoning is a feasible research direction that may help develop more capable mathematical AI
systems.
The fragment starts with "scores on...". In Chinese, it's better to say "在IMO 2025和CMO 2024中取得的分数,以及在扩展测试时计算条件下于Putnam 2024中获得的近乎完美的118/120分。" This matches the original structure. Let's refine it to be more academic: “在IMO 2025与CMO 2024中取得的得分,以及在扩展测试时计算条件下于Putnam 2024中获得的近乎完美的118/120分。尽管仍有大量工作有待完成,但这些结果表明,自验证数学推理是一条可行的研究方向,有望助力开发能力更强的数学人工智能系统。” Let's assemble: 在IMO 2025与CMO 2024中取得的得分,以及在扩展测试时计算条件下于Putnam 2024中获得的近乎完美的118/120分。尽管仍有大量工作有待开展,但这些结果表明,自验证数学推理是一条可行的研究方向,有望助力开发能力更强的数学人工智能系统。
Final check of the translation: 在IMO 2025与CMO 2024中取得的得分,以及在扩展测试时计算条件下于Putnam 2024中获得的近乎完美的118/120分。尽管仍有大量工作有待开展,但这些结果表明,自验证数学推理是一条可行的研究方向,有望助力开发能力更强的数学人工智能系统。
[Output Generation] -> 在IMO 2025与CMO 2024中取得的得分,以及在扩展测试时计算条件下于Putnam 2024中获得的近乎完美的118/120分。尽管仍有大量工作有待开展,但这些结果表明,自验证数学推理是一条可行的研究方向,有望助力开发能力更强的数学人工智能系统。 (Note: I'll ensure no extra text is added.) Proceeds.
传统的用于数学推理的强化学习(RL)方法,主要根据大语言模型(LLMs)对定量推理问题的预测最终答案是否与真实答案匹配来给予奖励(Guo et al., 2025)。该方法足以使前沿大语言模型在主要评估最终答案的数学竞赛(如 AIME 和 HMMT)中达到性能上限。然而,这种奖励机制存在两个根本性局限。首先,它作为推理正确性的代理指标并不可靠——模型可能通过有缺陷的逻辑或侥幸的错误得出正确答案。其次,它不适用于定理证明任务,因为此类问题可能不需要输出数值型最终答案,且严谨的推导过程才是主要目标。 因此,仅基于此类最终答案奖励在定量推理问题上训练的大语言模型,仍会频繁生成数学上无效或逻辑不一致的自然语言证明。此外,这种训练方法并未自然培养模型验证证明有效性的能力——它们表现出较高的假阳性率,即使证明中存在明显的逻辑缺陷,也常常错误地判定其为有效。 自然语言定理证明中生成与验证之间的差距(generation-verification gap)阻碍了进一步的提升。为解决这一问题,我们提出在大语言模型中开发证明验证能力。我们的方法基于以下几项关键观察: • 人类即使在没有参考解答的情况下也能识别证明中的问题——这是解决开放性问题的关键能力。 • 当经过扩展的验证努力仍无法识别出问题时,该证明有效的可能性更高。 • 识别真实存在的问题所需的努力可作为证明质量的代理指标,进而可用于优化证明生成。 我们相信,大语言模型可以被训练为在无参考解答的情况下识别证明问题。此类验证器将实现一个迭代优化循环:(1)利用验证反馈优化证明生成;(2)扩展验证计算规模以自动标注难以验证的新证明,从而生成用于改进验证器本身的训练数据;(3)利用增强后的验证器进一步优化证明生成。此外,可靠的证明验证器使我们能够教导证明生成器像验证器一样评估证明。这使得证明生成器能够迭代地完善其证明,直到无法再识别或解决任何问题为止。本质上,我们Here's a thinking process:
[原文]The conventional approach to reinforcement learning (RL) for mathematical reasoning involves
rewarding large language models (LLMs) based on whether their predicted final answers to
quantitative reasoning problems match ground-truth answers (Guo et al., 2025). This methodology suffices to allow frontier LLMs to saturate mathematical competitions that primarily
evaluate final answers, such as AIME and HMMT. However, this reward mechanism has two
fundamental limitations. First, it serves as an unreliable proxy for reasoning correctness – a
model can arrive at the correct answer through flawed lo...
*Para 1:* 传统的用于数学推理的强化学习(RL)方法,主要根据大语言模型(LLMs)对定量推理问题的预测最终答案是否与真实答案匹配来给予奖励(Guo et al., 2025)。该方法足以使前沿大语言模型在主要评估最终答案的数学竞赛(如 AIME 和 HMMT)中达到性能上限。然而,这种奖励机制存在两个根本性局限。首先,它作为推理正确性的代理指标并不可靠——模型可能通过有缺陷的逻辑或侥幸的错误得出正确答案。其次,它不适用于定理证明任务,因为此类问题可能不需要输出数值型最终答案,且严谨的推导过程才是主要目标。
使模型显式感知其奖励函数,并使其能够通过深思熟虑的推理而非盲目试错来最大化该奖励。 基于 DeepSeek-V3.2-Exp-Base (DeepSeek-AI, 2025),我们开发了 DeepSeekMath-V2, 这是一款针对自然语言定理证明进行优化的大语言模型,具备可自我验证的数学推理能力。该模型能够评估并迭代优化其自身的证明,在包括2025年IMO和2024年CMO在内的顶尖高中数学竞赛中取得了金牌级表现。在2024年普特南(Putnam)本科生数学竞赛中,该模型取得118/120分,超越了人类参赛者的最高得分90分。Here's a thinking process:
[原文]make the model explicitly aware of its reward function and enable it to maximize this reward
through deliberate reasoning rather than blind trial-and-error.
Built on DeepSeek-V3.2-Exp-Base (DeepSeek-AI, 2025), we developed DeepSeekMath-V2,
a large language model optimized for natural-language theorem proving that demonstrates
self-verifiable mathematical reasoning. Our model can assess and iteratively improve its own
proofs, achieving gold-level performance in premier high-school mathematics competitions
including IMO 2025 and CMO 2024. On the Putnam 2024 undergraduate competition, it scored
1...
118/120, exceeding the highest score of 90 obtained by human participants." I will match the line breaks exactly as requested, or just output as a coherent paragraph. The prompt says "保持原文结构和格式", so I'll keep the line breaks. Let's align line by line: 使模型显式感知其奖励函数,并使其能够通过深思熟虑的推理而非盲目试错来最大化该奖励。 基于 DeepSeek-V3.2-Exp-Base (DeepSeek-AI, 2025),我们开发了 DeepSeekMath-V2, 这是一款针对自然语言定理证明进行优化的大语言模型,具备可自我验证的数学推理能力。该模型能够评估并迭代优化其自身的证明,在包括2025年IMO和2024年CMO在内的顶尖高中数学竞赛中取得了金牌级表现。在2024年普特南(Putnam)本科生数学竞赛中,该模型取得118/120分,超越了人类参赛者的最高得分90分。 Actually, the original line breaks are just due to PDF/column width. I'll output it as a single paragraph or keep the breaks. I'll keep the breaks to be safe. Let's refine the translation to be strictly academic: 使模型显式感知其奖励函数,并使其能够通过深思熟虑的推理而非盲目试错来最大化该奖励。 基于 DeepSeek-V3.2-Exp-Base (DeepSeek-AI, 2025),我们开发了 DeepSeekMath-V2, 这是一款针对自然语言定理证明进行优化的大语言模型,展现出可自我验证的数学推理能力。该模型能够评估并迭代改进其自身的证明,在包括2025年IMO和2024年CMO在内的顶尖高中数学竞赛中取得了金牌级表现。在2024年普特南(Putnam)本科生数学竞赛中,该模型得分118/120,超越了人类参赛者的最高分90分。
2. Method
我们制定了用于证明评估的高层评分标准 I_v(见附录 A.2),旨在训练验证器依据这些标准对证明进行评估,从而模拟数学专家的评估流程。具体而言,给定问题 X 和证明 Y,验证器 π_φ(·| X, Y, I_v) 被设计为生成一份证明分析,首先总结所发现的问题(如有),随后根据三个等级进行评分:1 分对应完整且严谨的证明,所有逻辑步骤均论证充分;0.5 分对应整体逻辑正确但存在细微错误或遗漏细节的证明;0 分对应存在致命逻辑错误或关键漏洞的根本性缺陷证明。 构建冷启动强化学习数据 我们通过以下流程构建初始训练数据:Here's a thinking process: 我们制定了用于证明评估的高层评估准则 I_v(见附录 A.2),旨在训练验证器依据这些准则对证明进行评估,从而模拟数学专家的评估流程。具体而言,给定问题 X 和证明 Y,验证器 \pi_\phi(\cdot| X, Y, I_v) 被设计为生成一份证明分析,首先总结所发现的问题(如有),随后根据三个等级进行评分:1 分对应完整且严谨的证明,所有逻辑步骤均论证充分;0.5 分对应整体逻辑正确但存在细微错误或遗漏细节的证明;0 分对应存在致命逻辑错误或关键漏洞的根本性缺陷证明。 构建冷启动强化学习数据 我们通过以下流程构建初始训练数据:
[原文]2.1. Proof Verification
2.1.1. Training a Verifier to Identify Issues and Score Proofs
We developed high-level rubrics I_v for proof evaluation (see Appendix A.2) with the goal of
training a verifier to evaluate proofs according to these rubrics, mirroring mathematical experts'
assessment process. Specifically, given a problem X and a proof Y, the verifier π_φ(·| X, Y, I_v) is
designed to produce a proof analysis that first summarizes identified issues (if any) and then
assigns a score based on three levels: 1 for complete and rigorous proofs with all logical steps
clearly justified; 0.5 for p...
[Paragraph] 构建冷启动强化学习数据 我们制定了用于证明评估的高层评估准则 I_v(见附录 A.2),旨在训练验证器依据这些准则对证明进行评估,从而模拟数学专家的评估流程。具体而言,给定问题 X 和证明 Y,验证器 π_φ(·| X, Y, I_v) 被设计为生成一份证明分析,首先总结所发现的问题(如有),随后根据三个等级进行评分:1 分对应完整且严谨的证明,所有逻辑步骤均论证充分;0.5 分对应整体逻辑正确但存在细微错误或遗漏细节的证明;0 分对应存在致命逻辑错误或关键漏洞的根本性缺陷证明。 构建冷启动强化学习数据 我们通过以下流程构建初始训练数据:
1. We crawled problems from Art of Problem Solving (AoPS) contests, prioritizing math
Combine: 1. 我们从 Art of Problem Solving (AoPS) 竞赛中爬取了题目,优先选取数学奥林匹克竞赛、国家队选拔测试以及2010年之后明确要求证明的题目,共计17,503道。该题集记为 D_p。 Refined: 1. 我们从 Art of Problem Solving (AoPS) 竞赛中爬取了题目,优先选取数学奥林匹克竞赛、国家队选拔测试以及2010年之后明确要求证明的题目,共计17,503道。该题集记为 D_p。 One minor adjustment for flow and academic tone: "1. 我们从 Art of Problem Solving (AoPS) 竞赛平台爬取了题目,优先选取数学奥林匹克竞赛、国家队选拔测试以及2010年之后明确要求证明的题目,共计17,503道。该题集记为 D_p。" (Added "平台" for clarity, but maybe not necessary. I'll stick closer to original: "竞赛中")
[原文]olympiads, team selection tests, and post-2010 problems explicitly requiring proofs,
totaling 17,503 problems. This problem set is denoted as D_p.
2. We generated candidate proofs using a variant of DeepSeek-V3.2-Exp-Thinking. As this
Refined: 2. 我们使用 DeepSeek-V3.2-Exp-Thinking 的一个变体生成了候选证明。由于该模型并未针对定理证明进行优化,且倾向于生成简洁但易出错的输出,我们提示其在多轮迭代中逐步完善证明,以提升其全面性与严谨性。 Draft: 2. 我们使用 DeepSeek-V3.2-Exp-Thinking 的一个变体生成了候选证明。由于该模型并未针对定理证明进行优化,且倾向于生成简洁但易出错的输出,我们提示其在多轮迭代中逐步完善证明,以提升其全面性与严谨性。 One minor adjustment: "prompted it" in LLM context is usually translated as "通过提示词引导" or simply "提示". "提示其" is fine. "iteratively refine its proofs over multiple rounds" -> "在多轮交互中迭代优化其证明". Let's stick to: "我们提示其在多轮迭代中逐步完善证明". Actually, "多轮迭代" covers "over multiple rounds" and "iteratively". To be precise: "我们提示其在多轮过程中迭代优化证明". Or "我们引导其在多轮迭代中逐步完善证明". I'll go with: "我们提示其在多轮迭代中逐步完善证明".
[原文]model was not optimized for theorem proving and tended to produce concise but errorprone outputs, we prompted it to iteratively refine its proofs over multiple rounds to
improve comprehensiveness and rigor.
[Output Generation] (matches the final refined version) -> `2. 我们采用 DeepSeek-V3.2-Exp-Thinking 的一个变体生成了候选证明。鉴于该模型未针对定理证明进行优化,且其输出往往简洁但易出错,我们通过提示引导其在多轮迭代中逐步完善证明,以提升其全面性与严谨性。` All good.
*(Note: I will output exactly this string without any extra text as instructed.)* Final Output Generation. `2. 我们采用 DeepSeek-V3.2-Exp-Thinking 的一个变体生成了候选证明。鉴于该模型未针对定理证明进行优化,且其输出往往简洁但易出错,我们通过提示引导其在多轮迭代中逐步完善证明,以提升其全面性与严谨性。` Done.
3. We randomly sampled proofs across diverse problem types (e.g., algebra and number
该过程生成了一个初始强化学习数据集 D_v = {(X_i, Y_i, s_i)},其中每个样本包含一个问题 X_i、一个证明 Y_i 以及一个总体证明分数 s_i ∈ {0, 0.5, 1}。 强化学习目标。基于在数学和代码相关推理数据上进行监督微调的 DeepSeek-V3.2-Exp-SFT 版本,我们使用强化学习训练模型以生成证明分析,该训练采用两个奖励组件: • 格式奖励 R_format:一个指示函数,用于强制模型同时生成已识别问题的摘要和证明分数。该函数通过检查最终回复是否包含关键短语“Here is my evaluation of the solution:”以及在“Based on my evaluation, the final overall score should be:”之后是否包含位于 \boxed{} 中的分数来实现。 • 分数奖励 R_score:基于预测分数 s'_i 与标注分数 s_i 之间接近程度的奖励: R_score(s'_i, s_i) = 1 - |s'_i - s_i| (1) 训练验证器的强化学习目标为: max_{π_φ} E[(X_i,Y_i,s_i)~D_v, (V'_i,s'_i)~π_φ(·|X_i,Y_i)] R_format(V'_i) · R_score(s'_i, s_i) (2) 其中 V'_i 表示验证器的最终回复,s'_i 为从中提取的证明分数。 第 2.1.1 节所述的方法通过强化学习训练证明验证,以使预测的证明分数与专家标注对齐,但并未对识别出的问题本身提供直接监督。这导致了一个关键缺陷:在训练期间评估有缺陷的证明(即 s_i < 1)时,验证器可以通过预测正确的分数同时虚构不存在的问题来获得全额奖励,从而削弱其可信度。 为解决该问题,我们引入了元验证(meta-verification):一种二次评估过程,用于评估验证器识别出的问题是否确实存在,以及这些问题是否根据评估标准 I_v 在逻辑上合理支撑了预测的证明分数。完整的元验证标准 I_mv 详见附录 A.3。 我们使用强化学习训练了一个专用的元验证器来执行此评估。通过将元验证器的反馈纳入验证器的训练中,我们可以提高验证器问题识别的忠实度。 元验证器训练过程Here's a thinking process:
[原文]theory) and had mathematical experts score each proof according to the evaluation rubrics
described above.
This process yielded an initial RL dataset D_v = {(X_i, Y_i, s_i)}, where each item consists of a problem
X_i, a proof Y_i, and an overall proof score s_i ∈ {0, 0.5, 1}.
RL Objective. Building on a version of DeepSeek-V3.2-Exp-SFT which was supervised finetuned on reasoning data related to mathematics and code, we trained the model with reinforcement learning to produce proof analyses using two reward components:
• Format reward R_format: An indicator function that enforces the model to g...
[原文]meta-verifier produces a summary of issues found in the analysis itself, followed by a
quality score measuring how accurate and justified the verifier's analysis is. The RL
objective follows the same structure as the verifier training, with format and score rewards.
Using the trained meta-verifier π_η, we enhanced the verifier training by integrating metaverification feedback into the reward function:
R_V = R_format · R_score · R_meta (3)
where R_meta is the quality score from the meta-verifier.
We trained the enhanced verifier on both the verification dataset D_v and the meta-verification
da...
[原文]Y receives score R_Y = s, and the self-analysis Z receives a meta-verification score R_meta(Z) = ms.
The reward function combines these assessments:
R = R_format(Y, Z) · (α · R_Y + β · R_Z) (5)
R_Z = R_score(s', s) · R_meta(Z) (6)
where R_format(Y, Z) verifies that both the proof and self-analysis follow the specified format,
R_score(s', s) rewards accurate self-assessment. We set α = 0.76 and β = 0.24. This reward structure
creates the following incentives:
• Faithful acknowledgment of errors is rewarded over false claims of correctness.
• The highest rewards come from producing correct pro...
2. Reviewing the verifier's identified issues is exactly meta-verification, which is easier than
从头识别问题。元验证在供大语言模型掌握方面也具有更高的样本效率。基于上述观察,我们开发了以下自动化标注流程: Combine: 从头识别问题。元验证在供大语言模型掌握方面也具有更高的样本效率。基于上述观察,我们开发了以下自动化标注流程: Refined: 从头识别问题。元验证在供大语言模型掌握方面也具有更高的样本效率。基于上述观察,我们设计了以下自动化标注流程: “……从头识别问题。此外,大语言模型掌握元验证的样本效率也更高。基于上述观察,我们开发了如下自动化标注流程:” “从头识别问题。元验证在供大语言模型掌握方面也更具样本效率。基于上述观察,我们设计了以下自动化标注流程:” Check "sample-efficient for LLMs to master": This means LLMs need fewer samples to learn/master meta-verification. So: “大语言模型学习掌握元验证的样本效率也更高。” or “元验证也更具样本效率,更易于大语言模型掌握。” I'll go with: “元验证在供大语言模型掌握方面也具有更高的样本效率。” 从头识别问题。元验证在供大语言模型掌握方面也具有更高的样本效率。基于上述观察,我们设计了以下自动化标注流程: Wait, "identifying issues from scratch" might be part of a larger sentence like "rather than identifying issues from scratch." Since I only have this fragment, I'll translate it directly. In Chinese academic writing, fragments are sometimes kept as is if they follow a colon or dash, but here it's a period. I'll assume it's a continuation. I'll translate it as “从头识别问题”。 “从头识别问题。元验证在供大语言模型掌握方面也更具样本效率。基于上述观察,我们开发了以下自动化标注流程:” 从头识别问题。元验证在供大语言模型掌握方面也具有更高的样本效率。基于上述观察,我们设计了以下自动化标注流程: “从头识别问题。大语言模型掌握元验证的样本效率也更高。基于上述观察,我们开发了以下自动化标注流程:” Actually, keeping the subject as "Meta-verification" is fine: “元验证在供大语言模型掌握方面也更具样本效率。” Both work. I'll stick with a balanced version. 从头识别问题。元验证在供大语言模型掌握方面也具有更高的样本效率。基于上述观察,我们设计了以下自动化标注流程: “从头识别问题。元验证在供大语言模型掌握方面也具有更高的样本效率。基于上述观察,我们设计了以下自动化标注流程:” Final: 从头识别问题。元验证在供大语言模型掌握方面也具有更高的样本效率。基于上述观察,我们设计了以下自动化标注流程: Final decision: 从头识别问题。元验证在供大语言模型掌握方面也
[原文]identifying issues from scratch. Meta-verification is also more sample-efficient for LLMs
to master.
Building on these observations, we developed the following automated labeling process:
2. For analyses reporting issues (scores 0 or 0.5), generate m meta-verification assessments
Actually, in Chinese academic writing, infinitive phrases at the start are often translated as purpose clauses: “以验证已识别的问题。” or “旨在验证已识别的问题。” I'll use “以验证已识别的问题。” For the second sentence: “若多数元评估证实其研究结果,则该分析被视为有效。” Combine: 以验证已识别的问题。若多数元评估证实其研究结果,则该分析被视为有效。 to validate the identified problems. -> 以验证已识别的问题。 An analysis is deemed valid if the majority of metaassessments confirm its findings -> 若多数元评估证实其研究结果,则该分析被视为有效。 Output: 以验证已识别的问题。若多数元评估证实其研究结果,则该分析被视为有效。 “以验证已识别的问题。若多数元评估证实其研究结果,则该分析被视为有效。” matches academic standards perfectly. “以验证已识别的问题。若多数元评估证实其研究结果,则该分析被视为有效。” (Added slight academic polish if needed, but this is direct and accurate.) “以验证已识别的问题。若多数元评估证实其研究结果,则该分析被视为有效。” (Direct, academic, precise) [Output Generation] -> “以验证已识别的问题。若多数元评估证实其研究结果,则该分析被视为有效。” (I'll add a note about context if needed, but direct translation is preferred.)
[原文]to validate the identified problems. An analysis is deemed valid if the majority of metaassessments confirm its findings
3. For each proof, we examine analyses that assign the lowest score. If at least k such analyses
[原文]are deemed valid, the proof is labeled with that lowest score. If no legitimate issues are
identified across all verification attempts, the proof is labeled with 1. Otherwise, the proof
is discarded or routed to human experts for labeling
In our last two training iterations, this fully automated pipeline replaced human annotation entirely. Quality checks confirmed that the automated labels aligned well with expert judgments.
[原文]3.1. Training Settings
We employed Group Relative Policy Optimization (GRPO) (Shao et al., 2024) for reinforcement
learning, iteratively optimizing proof verification and generation capabilities as described in
Section 2. In each iteration, we first optimized proof verification. The proof generator was then
initialized from the verifier checkpoint and optimized for proof generation. Starting from the
second iteration, the proof verifier was initialized with a checkpoint that consolidated both
verification and generation capabilities from the previous iteration through rejection fine-tuning.
3....
[原文]Reasoning models (OpenAI, 2024; Guo et al., 2025) have saturated quantitative reasoning
benchmarks like AIME and HMMT within one year. This rapid advancement is partly attributed
to the well-defined evaluation criterion: if we care only about final answers, then quantitative
reasoning is easy to verify. However, this final answer metric is inapplicable to theorem proving,
which often requires no numerical answers but demands rigorous step-by-step derivation.
Informal mathematical proofs have long been considered hard to verify automatically, lacking
reliable approaches to assess proof correctn...
[原文]We presented DeepSeekMath-V2, a model capable of both generating and verifying mathematical proofs. By training models to identify issues in their own reasoning and incentivizing
them to address these issues before finalizing outputs, we move beyond the limitations of finalanswer-based rewards toward self-verifiable mathematical reasoning. Our iterative training
process – alternating between improving verification capabilities and using these to enhance
generation – creates a sustainable cycle where each component drives the other forward. Our
key technical contributions include: (1) training ...
5. Conclusion
标题:DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning 作者:Zhihong Shao, Yuxiang Luo, Chengda Lu, Z.Z. Ren, Jiewen Hu, Tian Ye, Zhibin Gou, Shirong Ma, Xiaokang Zhang 提交日期:2025年11月27日 学科分类:cs.AI(主),cs.CL DOI:10.48550/arXiv.2511.22570 PDF 链接:https://arxiv.org/pdf/2511.22570.pdf
[原文]Title: DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning
Authors: Zhihong Shao, Yuxiang Luo, Chengda Lu, Z.Z. Ren, Jiewen Hu, Tian Ye, Zhibin Gou, Shirong Ma, Xiaokang Zhang
Submitted: 27 Nov 2025
Subjects: cs.AI (primary), cs.CL
DOI: 10.48550/arXiv.2511.22570
PDF URL: https://arxiv.org/pdf/2511.22570.pdf