[原文]We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with supe-
rior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as
follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mecha-
nism that substantially reduces computational complexity while preserving model performance
in long-context scenarios. (2) Scalable Reinforcement Learning Framework: By implementing
a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2
performs comparably to GPT-5. Notably, our high...
1. Introduction
推理模型(DeepSeek-AI, 2025; OpenAI, 2024a)的发布标志着大语言模型(LLM)演进的关键转折点,推动了其在可验证领域整体性能的显著跃升。自这一里程碑以来,大语言模型的能力得到了快速提升。然而,近几个月来,开源与闭源模型之间出现了明显的分化趋势。尽管开源社区(MiniMax, 2025; MoonShot, 2025; Qwen, 2025; ZhiPu-AI, 2025)仍在不断取得进展,但闭源专有模型(Anthropic, 2025b; DeepMind, 2025a; OpenAI, 2025)的性能提升速度却显著加快。因此,闭源与开源模型之间的性能差距非但没有收敛,反而呈现出扩大的趋势,专有系统在复杂任务中展现出日益卓越的能力。基于我们的分析,我们识别出限制开源模型在复杂任务中表现的三个关键缺陷。首先,在架构层面,主流模型对标准注意力机制(vanilla attention, Vaswani et al., 2017)的过度依赖严重制约了长序列处理的效率。这一效率瓶颈不仅阻碍了模型的可扩展部署,也限制了后训练(post-training)的有效性。其次,在资源分配方面,开源模型在后训练阶段的计算投入不足,制约了其在高难度任务上的性能表现。最后,在AI智能体应用场景中,相较于闭源模型,开源模型在泛化能力与指令遵循能力上存在显著差距(EvalSys, 2025; Li et al., 2025; Luo et al., 2025),这严重制约了其在实际部署中的有效性。为突破上述关键瓶颈,我们首先提出了DSA,一种旨在大幅降低计算复杂度的高效注意力机制。该架构有效突破了效率瓶颈,确保模型在长上下文场景下仍能保持优异性能。其次,我们设计了一套稳定且可扩展的强化学习(RL)协议,支持在后训练阶段进行大规模计算扩展。值得注意的是,该框架将后训练计算预算提升至预训练成本的10%以上,从而充分激发模型的进阶能力。第三,我们提出了一种新颖的训练流水线,旨在提升模型在工具调用场景下的可泛化推理能力。
[原文]The release of reasoning models (DeepSeek-AI, 2025; OpenAI, 2024a) marked a pivotal moment
in the evolution of Large Language Models (LLMs), catalyzing a substantial leap in overall
performance across the verifiable fields. Since this milestone, the capabilities of LLMs have
advanced rapidly. However, a distinct divergence has emerged in the past months. While
the open-source community (MiniMax, 2025; MoonShot, 2025; Qwen, 2025; ZhiPu-AI, 2025)
continues to make strides, the performance trajectory of closed-source proprietary models
(Anthropic, 2025b; DeepMind, 2025a; OpenAI, 2025) has acceler...
[原文]First, we implement a cold-start phase utilizing the DeepSeek-V3 (DeepSeek-AI,
2024) methodology to unify reasoning and tool-use within single trajectories.Subsequently, we
advance to large-scale agentic task synthesis, where we generate over 1,800 distinct environments
and 85,000 complex prompts. This extensive synthesized data drives the RL process, significantly
enhancing the model’s generalization and instruction-following capability in the agent context. DeepSeek-V3.2 achieves similar performance with Kimi-k2-thinking and GPT-5 across mul-
tiple reasoning benchmarks. Furthermore, DeepSeek...
[原文]DeepSeek-V3.2 uses exactly the same architecture as DeepSeek-V3.2-Exp. Compared with
DeepSeek-V3.1-Terminus, the last version of DeepSeek-V3.1, the only architectural modification
of DeepSeek-V3.2 is the introduction of DeepSeek Sparse Attention (DSA) through continued
training.
Prototype of DSA.
The prototype of DSA primarily consists of two components: a lightning
indexer and a fine-grained token selection mechanism.
The lightning indexer computes the index score 𝐼𝑡,𝑠between the query token h𝑡∈R𝑑and a
preceding token h𝑠∈R𝑑, determining which tokens to be selected by the query token:
𝐼𝑡,𝑠=
𝐻𝐼...
[原文]Starting from a base checkpoint of DeepSeek-V3.1-Terminus, whose context length has been ex-
tended to 128K, we perform continued pre-training followed by post-training to create DeepSeek-
V3.2. The continued pre-training of DeepSeek-V3.2 consists of two training stages. For both stages,
the distribution of training data is totally aligned with the 128K long context extension data
used for DeepSeek-V3.1-Terminus.
1We illustrate the difference between the MQA and MHA modes of MLA in Appendix A.
2https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp/tree/main/inference
3
𝐤𝐤𝑡𝑡
𝐼𝐼
···
concatenate
...
[原文]The training signal of the indexer is from only L𝐼, while the optimization of the
main model is according to only the language modeling loss.In this sparse training stage, we
use a learning rate of 7.3 × 10−6, and select 2048 key-value tokens for each query token. We train
both the main model and the indexer for 15000 steps, with each step consisting of 480 sequences
of 128K tokens, resulting in a total of 943.7B tokens.
4
[原文]Standard Benchmark
In September 2025, we evaluate DeepSeek-V3.2-Exp on a suite of bench-
marks, which focus on diverse capabilities, and compare it with DeepSeek-V3.1-Terminus
showing similar performance. While DeepSeek V3.2 Exp significantly improves computational
efficiency on long sequences, we do not observe substantial performance degradation compared
with DeepSeek-V3.1-Terminus, on both short- and long-context tasks.
Human Preference
Given that direct human preference assessments are inherently suscep-
tible to bias, we employ ChatbotArena as an indirect evaluation framework to approxima...
[原文]DSA reduces the core attention complexity of the main model from O�
𝐿2�to O(𝐿𝑘), where 𝑘
(≪𝐿) is the number of selected tokens. Although the lightning indexer still has a complexity
of O�
𝐿2�, it requires much less computation compared with MLA in DeepSeek-V3.1-Terminus.
Combined with our optimized implementation, DSA achieves a significant end-to-end speedup
in long-context scenarios. Figure 3 presents how token costs of DeepSeek-V3.1-Terminus and
DeepSeek-V3.2 vary with the token position in the sequence. These costs are estimated from
benchmarking the actual service deployed on H800 GPUs, a...
[原文]After continued pre-training, we perform post-training to create the final DeepSeek-V3.2. The
post-training of DeepSeek-V3.2 also employs sparse attention in the same way as the sparse
continued pre-training stage. For DeepSeek-V3.2, we maintain the same post-training pipeline
as in DeepSeek-V3.2-Exp, which includes specialist distillation and mixed RL training. Specialist Distillation
For each task, we initially develop a specialized model dedicated
exclusively to that particular domain, with all specialist models being fine-tuned from the same
3https://artificialanalysis.ai/evaluations/artif...
[原文]For
reasoning and agent tasks, we employ rule-based outcome reward, length penalty, and language
consistency reward.For general tasks, we employ a generative reward model where each
prompt has its own rubrics for evaluation. DeepSeek-V3.2 and DeepSeek-V3.2-Speciale
DeepSeek-V3.2 integrates reasoning, agent, and
human alignment data distilled from specialists, undergoing thousands of steps of continued RL
training to reach the final checkpoints. To investigate the potential of extended thinking, we also
developed an experimental variant, DeepSeek-V3.2-Speciale. This model was trained exclusivel...
[原文]We first review the objective of GRPO. GRPO optimizes the policy model 𝜋𝜃by maximizing the
following objective on a group of responses {𝑜1, · · · , 𝑜𝐺} sampled from the old policy 𝜋old given
each question 𝑞:
JGRPO(𝜃) = E𝑞∼𝑃(𝑄),{𝑜𝑖}𝐺
𝑖=1∼𝜋old(·|𝑞)
�
1
𝐺
𝐺
∑︁
𝑖=1
1
|𝑜𝑖|
|𝑜𝑖|
∑︁
𝑡=1
min �
𝑟𝑖,𝑡(𝜃) ˆ𝐴𝑖,𝑡, clip �
𝑟𝑖,𝑡(𝜃), 1 −𝜀, 1 + 𝜀�ˆ𝐴𝑖,𝑡
�−𝛽DKL
�
𝜋𝜃(𝑜𝑖,𝑡)
��𝜋ref(𝑜𝑖,𝑡)�
�
,
(5)
where
𝑟𝑖,𝑡(𝜃) = 𝜋𝜃(𝑜𝑖,𝑡|𝑞, 𝑜𝑖,
[原文]Off-Policy Sequence Masking
To improve the efficiency of RL systems, we typically generate
a large batch of rollout data, which is subsequently split into multiple mini-batches for several
gradient update steps.This practice inherently introduces off-policy behavior. Additionally,
inference frameworks used for efficient data generation are often highly optimized, which may
differ in implementation details from training frameworks. Such training-inference inconsistency
7
further exacerbates the degree of off-policyness. To stabilize training and improve tolerance for
off-policy updates, we mas...
[原文]To mitigate this, we preserve the expert routing paths used during sampling
in the inference framework and enforce the same routing paths during training, ensuring that
identical expert parameters are optimized.This Keep Routing operation was found crucial for
RL training stability of MoE models, and has been adopted in our RL training pipeline since
DeepSeek-V3-0324. Keep Sampling Mask
Top-p and top-k sampling are widely used sampling strategies to
enhance the quality of responses generated by LLMs. Employing these strategies in RL training
is also advantageous, as it avoids sampling extremel...
[原文]DeepSeek-R1 has demonstrated that incorporating a thinking process can significantly enhance
a model’s ability to solve complex problems. Building on this insight, we aim to integrate
thinking capabilities into tool-calling scenarios.
We observed that replicating DeepSeek-R1’s strategy—discarding reasoning content upon the
arrival of the second round of messages—results in significant token inefficiency. This approach
forces the model to redundantly re-reason through the entire problem for each subsequent
tool call. To mitigate this, we developed a context management strictly tailored for tool...
3.2.2. Cold-Start
to label the reasoning path. - *Draft:* 表6展示了我们推理数据的一个示例,它使用系统提示明确要求模型在最终答案之前进行推理,并使用特殊标签 标记推理路径。 - *Refinement:* 表6展示了我们推理数据的示例,其系统提示明确要求模型在输出最终答案前进行推理,并使用特殊标签 对推理路径进行标注。 - *Sentence 7:* Table 7 shows the prompt of non-reasoning agentic data, where the system prompt contains the guidance of toolcall. - *Draft:* 表7展示了非推理智能体数据的提示,其中系统提示包含工具调用的指导。 - *Refinement:* 表7展示了非推理智能体数据的提示,其系统提示中包含了工具调用(toolcall)的相关指引。 - *Sentence 8:* Table 8 presents the system prompt we designed to instruct the model to incorporate multiple tool calls within its reasoning process. - *Draft:* 表8展示了我们设计的系统提示,用于指示模型在其推理过程中整合多个工具调用。 - *Refinement:* 表8展示了我们设计的系统提示,旨在指导模型在推理过程中集成多次工具调用。 - *Sentence 9:* In this manner, although the reasoning in tool-use patterns may lack robustness, the model is occasionally able to generate the desired trajectories, thereby providing a basis for subsequent reinforcement learning stages. - *Draft:* 通过这种方式,尽管工具使用模式中的推理可能缺乏鲁棒性,但模型偶尔能够生成期望的轨迹,从而为后续的强化学习阶段提供基础。 - *Refinement:* 借此方式,尽管模型在工具使用模式下的推理可能尚欠鲁棒性,但其仍能偶尔生成符合预期的轨迹,从而为后续的强化学习阶段奠定基础。
[原文]Given the availability of reasoning data (non-agentic) and non-reasoning agentic data, a straight-
forward strategy for integrating these two capabilities is through carefully designed prompting.
We posit that the model possesses sufficient ability to accurately follow explicit instructions,
thereby enabling the seamless incorporation of tool execution within the reasoning process.
9
To demonstrate the operation of the cold-start mechanism, we selectively sample the training
data as shown in Appendix Tables 6–8. It is important to note that distinct task prompts are
associated with different ...
*Self-Correction/Verification during synthesis:* - "non-agentic" -> 非智能体 (standard in AI context) - "non-reasoning agentic data" -> 非推理智能体数据 (accurate) - "prompting" -> 提示工程/提示设计 (I'll use 提示设计 or 提示词设计, but "精心设计的提示" is fine. Let's stick to 提示设计/prompting) - "tool execution" -> 工具执行/工具调用 (tool execution is fine as 工具执行) - "competitive programming prompt" -> 算法竞赛编程提示/任务提示 (算法竞赛编程任务提示 is better) - "toolcall" -> 工具调用 - "trajectories" -> 轨迹 (standard in RL/agent context) - Check formatting: The original has a line break after "straight-" and "correspond-". I will ignore those hyphenation artifacts. - The number "9" appears randomly in the text. It's likely a page number or footnote marker from the PDF. I will omit it as it's not part of the content. - Let's refine the translation to ensure maximum academic precision: “鉴于推理数据(非智能体)与非推理智能体数据均已可用,整合这两种能力的一种直接策略是采用精心设计的提示(prompting)。我们认为模型具备充分的能力以准确遵循显式指令,从而实现在推理过程中无缝集成工具执行。为展示冷启动机制的运行方式,我们对照附录表6–8对训练数据进行了选择性采样。需注意的是,不同的任务提示对应不同的系统提示。表6–8展示了一个针对算法竞赛编程任务的提示示例。表6展示了我们推理数据的示例,其系统提示明确要求模型在输出最终答案前进行推理,并使用特殊标签 对推理路径进行标注。表7展示了非推理智能体数据的提示,其系统提示中包含了工具调用(toolcall)的相关指引。表8展示了我们设计的系统提示,旨在指导模型在推理过程中集成多次工具调用。借此方式,尽管模型在工具使用模式下的推理可能尚欠鲁棒性,但其仍能偶尔生成符合预期的轨迹,从而为后续的强化学习阶段奠定基础。”
*Self-Correction/Note during thought*: I should make sure the formatting matches the original structure (Title + Content). I will present it clearly. “3.2.2. 冷启动 鉴于推理数据(非智能体)与非推理智能体数据均已可用,整合这两种能力的一种直接策略是采用精心设计的提示(prompting)。我们认为模型具备充分的能力以准确遵循显式指令,从而实现在推理过程中无缝集成工具执行。为展示冷启动机制的运行方式,我们对照附录表6–8对训练数据进行了选择性采样。需注意的是,不同的任务提示对应不同的系统提示。表6–8展示了一个针对算法竞赛编程任务的提示示例。表6展示了我们推理数据的示例,其系统提示明确要求模型在输出最终答案前进行推理,并使用特殊标签 对推理路径进行标注。表7展示了非推理智能体数据的提示,其系统提示中包含了工具调用(toolcall)的相关指引。表8展示了我们设计的系统提示,旨在指导模型在推理过程中集成多次工具调用。借此方式,尽管模型在工具使用模式下的推理可能尚欠鲁棒性,但其仍能偶尔生成符合预期的轨迹,从而为后续的强化学习阶段奠定基础。” Matches perfectly. No extra text. Ready. Output matches response. 鉴于推理数据(非智能体)与非推理智能体数据均已可用,整合这两种能力的一种直接策略是采用精心设计的提示(prompting)。我们认为模型具备充分的能力以准确遵循显式指令,从而实现在推理过程中无缝集成工具执行。为展示冷启动机制的运行方式,我们对照附录表6–8对训练数据进行了选择性采样。需注意的是,不同的任务提示对应不同的系统提示。表6–8展示了一个针对算法竞赛编程任务的提示示例。表6展示了我们推理数据的示例,其系统提示明确要求模型在输出最终答案前进行推理,并使用特殊标签 对推理路径进行标注。表7展示了非推理智能体数据的提示,其系统提示中包含了工具调用(toolcall)的相关指引。表8展示了我们设计的系统提示,旨在指导模型在推理过程中集成多次工具调用。借此方式,尽管模型在工具使用模式下的推理可能尚欠鲁棒性,但其仍能偶尔生成符合预期的轨迹,从而为后续的强化学习阶段奠定基础。
[原文]A diverse set of RL tasks is crucial for enhancing model robustness. For tasks such as search,
code engineering, and code interpretation, we employ real-world tools, including actual web
search APIs, coding tools, and Jupyter Notebooks. While these RL environments are real, the
prompts employed are either extracted from Internet sources or synthetically generated, rather
than obtained from actual user interactions. For other tasks, the environment and prompts are
both synthetically constructed. The agent tasks we used are described in Table 1. Table 1 | The description of different agent tasks...
[原文]This dataset was rigorously
filtered using heuristic rules and LLM-based judgments to ensure high quality, requiring that
each entry contain a reasonable issue description, a correlated gold patch, and a test patch for
validation.An automated environment-setup agent, powered by DeepSeek-V3.2, was employed
to build executable environments for these pairs. This agent handles package installation, de-
pendency resolution, and test execution. Test results are output in the standard JUnit format,
ensuring consistent parsing across programming languages and test frameworks. An environ-
ment is deeme...
3. To create tasks that are both challenging and automatically verifiable, the agent initially
[原文]proposes a simple task based on the current database, along with its solution and verifica-
tion functions implemented in Python. The solution function is restricted to invoking tool
functions or performing logical computations, and cannot call other functions or directly
access the database, ensuring the task can only be solved through the tool interface. Addi-
tionally, the results produced by the solution function must be validated by the verification
function. If the solution is not validated, the agent will modify the solution or verification
functions until the solution’s output passes t...
3. To create tasks that are both challenging and automatically verifiable, the agent initially
[原文]If the hotel on day 2 is in
the mid-to-high range (500-800 CNY), then I have a bit more flexibility - I just need to make
sure at least one of my restaurant choices is rated 4.0 or higher, and the attraction ticket should
be below 180 CNY.For more affordable hotels (200-500 CNY range), I only need to ensure
that at least one restaurant has a rating of 3.2 or above. Can you help me put together this itinerary? Submit Result Format
[
{ "time": "2025-10-01", "city": "cite_name", "hotel": "hotel_name", "afternoon_restaurant": "restau-
rant_name", "afternoon_attraction": "attraction_name", "evening...
我们在 MMLU-Pro(Wang et al., 2024)、GPQA Diamond(Rein et al., 2023)、Human Last Exam(HLE)Text-only(Phan et al., 2025)、LiveCodeBench(2024.08–2025.04)、Code- 12
[原文]We evaluate models on MMLU-Pro (Wang et al., 2024), GPQA Diamond (Rein et al., 2023),
Human Last Exam (HLE) Text-only (Phan et al., 2025), LiveCodeBench (2024.08-2025.04), Code-
12
forces, Aider-Polyglot, AIME 2025, HMMT Feb 2025, HMMT Nov 2025 (Balunovi´c et al., 2025),
IMOAnswerBench (Luong et al., 2025), Terminal Bench 2.0, SWE-Verified (OpenAI, 2024b), SWE
Multilingual (Yang et al., 2025), BrowseComp (Wei et al., 2025), BrowseCompZh (Zhou et al.,
2025), 𝜏2-bench (Barres et al., 2025), MCP-Universe (Luo et al., 2025), MCP-Mark (EvalSys, 2025),
and Tool-Decathlon (Li et al., 2025). Tool-use...
forces、Aider-Polyglot、AIME 2025、HMMT Feb 2025、HMMT Nov 2025(Balunović et al., 2025)、IMOAnswerBench(Luong et al., 2025)、Terminal Bench 2.0、SWE-Verified(OpenAI, 2024b)、SWE Multilingual(Yang et al., 2025)、BrowseComp(Wei et al., 2025)、BrowseCompZh(Zhou et al., 2025)、τ²-bench(Barres et al., 2025)、MCP-Universe(Luo et al., 2025)、MCP-Mark(EvalSys, 2025)以及 Tool-Decathlon(Li et al., 2025)等基准上评估模型。Tool-use……
[原文]We also evaluated
DeepSeek-V3.2 with Terminus in non-thinking mode, yielding a score of 39.3.For SWE-bench
Verified, the primary score was obtained using our internal framework. Robustness tests across
other settings—including the Claude Code and RooCode frameworks, as well as non-thinking
mode—produced consistent results, ranging from 72 to 74. For the search agent evaluation, we assess our models using a standard commercial search
API. Since DeepSeek-V3.2 supports a maximum context length of only 128K, approximately
20%+ of the test cases exceed this limit. To address this, we employ a conte...
[原文]In this section, we perform ablation experiments to study the effect of synthetic agentic tasks.
We focus on two questions. First, are synthetic tasks sufficiently challenging for reinforcement
learning? Second, how well do these synthetic tasks generalize, i.e., can they transfer to different
downstream tasks or real-world environments?
To address the first question, we randomly sample 50 instances from the general synthesized
agentic tasks and evaluate both the model used for synthesis and frontier closed-source LLMs.
As shown in Table 5, DeepSeek-V3.2-Exp attains an accuracy of only 12%, wh...
4.4. Context Management of Search Agent
即使采用128k等扩展上下文窗口,智能体工作流(尤其是在基于搜索的场景中)仍经常遭遇最大长度限制,导致推理过程被提前截断。这一瓶颈阻碍了测试时计算潜力的充分发挥。为解决该问题,我们引入了上下文管理机制:当Token使用量超过上下文窗口长度的80%时,采用简单策略在测试时扩展Token预算。这些策略包括:(1)摘要(Summary),对溢出的轨迹进行总结并重新展开生成过程;(2)丢弃75%(Discard-75%),丢弃轨迹中前75%的工具调用历史以释放空间;(3)全部丢弃(Discard-all),通过丢弃所有先前的工具调用历史来重置上下文(类似于新的上下文工具(Anthropic, 2025a))。作为对比,我们还实现了一个并行扩展基线方法 Parallel-fewest-step,该方法采样 N 条独立轨迹,并选择步数最少的轨迹。 我们在 BrowseComp 基准测试(Wei et al., 2025)上评估了这些策略。如图6所示,在不同的计算预算下,上下文管理通过允许模型扩展测试时计算,为执行额外步骤提供了更多空间,从而带来了显著的性能提升。例如,Summary 策略将平均步数扩展至364步,性能得分提升至60.2。然而,其整体效率相对较低。尽管 Discard-all 策略较为简单,但其在效率和可扩展性方面均表现良好,得分达到67.6,在步数显著减少的情况下,性能与并行扩展相当。 综上所述,测试时计算既可以通过上下文管理进行串行扩展,也可以进行并行扩展,两者均能有效提升模型的问题解决能力。然而,不同策略在效率和可扩展性方面表现出差异。因此,在基准测试模型性能时,充分考虑实际计算成本至关重要。同时,寻找串行与并行扩展的最佳组合以最大化效率和可扩展性,仍是未来工作的重要方向。
[原文]Even with extended context windows such as 128k, agentic workflows, particularly in search-
based scenarios, frequently encounter maximum length limitations that prematurely truncate
the reasoning process. This bottleneck inhibits the full realization of test-time compute potential.
To address this, we introduce context management employing simple strategies to extend token
budgets at test time,when the token usage exceeds 80% of the context window length. These
strategies include (1) Summary, which summarizes the overflowed trajectory and re-initiates
the rollout; (2) Discard-75%, which disca...
[原文]In this work, we introduced DeepSeek-V3.2, a framework that effectively bridges the gap be-
tween computational efficiency and advanced reasoning capabilities. Using DSA, we addressed
critical computation complexity without sacrificing long-context performance. By increasing
computational budget, DeepSeek-V3.2 achieves comparable performance with GPT-5 on rea-
soning benchmarks. Finally, the integration of our large-scale agentic task synthesis pipeline
significantly enhances tool-use proficiency, unlocking new possibilities for robust and generaliz-
able AI agents with open LLM. Furthermore, ...
References
Anthropic. System card: Claude opus 4.5, 2025a. URL https://assets.anthropic.com/m /64823ba7485345a7/Claude-Opus-4-5-System-Card.pdf. Anthropic. Introducing claude sonnet 4.5, 2025b. URL https://www.anthropic.com/news /claude-sonnet-4-5l. M. Balunović, J. Dekoninck, I. Petrov, N. Jovanović, and M. Vechev. Matharena: Evaluating llms on uncontaminated math competitions. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmark, 2025. V. Barres, H. Dong, S. Ray, X. Si, and K. Narasimhan. τ²-bench: Evaluating conversational agents in a dual-control environment, 202……
[原文]Anthropic. System card: Claude opus 4.5, 2025a. URL https://assets.anthropic.com/m
/64823ba7485345a7/Claude-Opus-4-5-System-Card.pdf. Anthropic. Introducing claude sonnet 4.5, 2025b. URL https://www.anthropic.com/news
/claude-sonnet-4-5l. M. Balunovi´c, J. Dekoninck, I. Petrov, N. Jovanovi´c, and M. Vechev. Matharena: Evaluating llms
on uncontaminated math competitions. Proceedings of the Neural Information Processing
Systems Track on Datasets and Benchmark, 2025. V. Barres, H. Dong, S. Ray, X. Si, and K. Narasimhan. 𝜏2-bench: Evaluating conversational agents
in a dual-control environment, 202...
References
URL https://moonshotai.github.io/Kimi -K2/thinking.html.OpenAI. Learning to reason with llms, 2024a. URL https://openai.com/index/learnin g-to-reason-with-llms/. OpenAI. Introducing SWE-bench verified we're releasing a human-validated subset of swe- bench that more, 2024b. URL https://openai.com/index/introducing-swe-bench -verified/. OpenAI. Introducing gpt-5, 2025. URL https://openai.com/index/introducing-gpt-5 /. L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi, et al. Humanity's last exam. arXiv preprint arXiv:2501.14249, 2025. Qwen. Qwen3 tech……
[原文]URL https://moonshotai.github.io/Kim
i-K2/thinking.html.OpenAI. Learning to reason with llms, 2024a. URL https://openai.com/index/learnin
g-to-reason-with-llms/. OpenAI. Introducing SWE-bench verified we’re releasing a human-validated subset of swe-
bench that more, 2024b. URL https://openai.com/index/introducing-swe-bench
-verified/. OpenAI. Introducing gpt-5, 2025. URL https://openai.com/index/introducing-gpt-5
/. L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi, et al. Humanity’s last exam. arXiv preprint arXiv:2501.14249, 2025. Qwen. Qwen3 tech...
2019. URL http://arxiv.org/abs/1911.02150.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. pages 5998–6008, 2017. URL https://proceedings.neur ips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html. Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. CoRR, abs/2406.01574, 2024. URL https://doi.org/10.48550/arXiv.2406.01574. J. Wei, Z. Sun, S. Papay, S. McKi……
[原文]A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo-
sukhin. Attention is all you need. pages 5998–6008, 2017. URL https://proceedings.neur
ips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html. Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li,
M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. Mmlu-pro: A more robust and
challenging multi-task language understanding benchmark. CoRR, abs/2406.01574, 2024. URL https://doi.org/10.48550/arXiv.2406.01574. J. Wei, Z. Sun, S. Papay, S. McKi...
[原文]Figure 7 | Illustration of the MHA and MQA modes of MLA.For DeepSeek-V3.1-Terminus, the
MHA mode is used for training and prefilling, while the MQA mode is used for decoding. Figure 7 illustrates two aspects of MLA – the MHA and MQA modes – as well as the
transformation between them. B. Cold Start Template
20
Table 6 | An example of the reasoning data system prompt. The system prompt requires the
model to output the reasoning process in the tag . Reasoning
System
Prompt
You are an expert Python programmer. You will be given a question (problem
specification) and will generate a...
[原文]Once
you have the answer, stop reasoning and present your solution using Markdown
and LaTeX.
- Do NOT invoke any tools in your presented final solution steps.
- To improve efficiency and accuracy, you should prefer code execution over
language-based reasoning whenever possible.Keep your reasoning succinct; let
the code do the heavy lifting.
## Tools
You have access to the following tools:
{TOOL-DESCRIPTIONS}
Important: ALWAYS adhere to this exact format for tool use:
{TOOLCALL-FORMAT}
Prompt
Given a linked list, swap every two adjacent nodes and return its head ... Agent
Response
with
Thinking...
[原文]We generated 32 candidate solutions per problem and applied the identical
filtering criteria to select submissions.In the IMO and CMO tasks, we employ a generate-verify-refine loop. The model iteratively
improves its solution until it achieves a perfect self-evaluation or hits the maximum revision
cap, identical to the process in Shao et al. (2025).
22
E. Author List
Research & Engineering: Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang,
Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang
Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan*, Damai Dai, Daya ...