DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-V3.2：拓展开源大语言模型前沿

📄 arXiv: 2512.02556📅 2025-12-02PDF

翻译进度65 / 65 段 (100%)

中文摘要

DeepSeek-V3.2 引入 DeepSeek Sparse Attention（DSA）稀疏注意力机制和大规模强化学习框架，在推理和 Agent 能力上实现大幅超越。DSA 通过动态选择关键 token 进行注意力计算，在保持精度的同时显著降低计算复杂度。结合改进的 MoE 路由策略，V3.2 在多项基准测试中刷新开源模型记录。

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

【摘要】DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models - 本文介绍了DeepSeek-V3.2的架构、训练方法和实验结果。

原文: DeepSeek-AI research@deepseek.com Abstract We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA) : We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2) Scalable Reinforcement Learning Framework : By implementing a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2 perfor...

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

（DeepSeek-V3.2: Pushing the Frontier of Open Large - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: source proprietary models ( gpt-5 ; sonnet-4.5 ; comanici2025gemini ) has accelerated at a significantly steeper rate. Consequently, rather than converging, the performance gap between closed-source and open-source models appears to be widening, with proprietary systems demonstrating increasingly superior capabilities in complex tasks. Through our analysis, we identify three critical deficiencies that limit the capability of open-source models in complex tasks. First, architecturally, the predominant reliance on vanilla attention ( vaswani2017attention ) mechanisms severely constrains efficien...

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

原文: bsequently, we advance to large-scale agentic task synthesis, where we generate over 1,800 distinct environments and 85,000 complex prompts. This extensive synthesized data drives the RL process, significantly enhancing the model’s generalization and instruction-following capability in the agent context. DeepSeek-V3.2 achieves similar performance with Kimi-k2-thinking and GPT-5 across multiple reasoning benchmarks. Furthermore, DeepSeek-V3.2 significantly advances the agentic capabilities of open models, demonstrating exceptional proficiency on the long-tail agent tasks introduced in mcpmark ;...

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

原文: U ( 𝐪 t , j I ⋅ 𝐤 s I ) , I_{t,s}=\sum_{j=1}^{H^{I}}w_{t,j}^{I}\cdot\text{ReLU}\left(\mathbf{q}^{I}_{t,j}\cdot\mathbf{k}^{I}_{s}\right), (1) where H I H^{I} denotes the number of indexer heads; 𝐪 t , j I ∈ ℝ d I \mathbf{q}^{I}_{t,j}\in\mathbb{R}^{d^{I}} and w t , j I ∈ ℝ w_{t,j}^{I}\in\mathbb{R} are derived from the query token 𝐡 t \mathbf{h}_{t} ; and 𝐤 s I ∈ ℝ d I \mathbf{k}^{I}_{s}\in\mathbb{R}^{d^{I}} is derived from the preceding token 𝐡 s \mathbf{h}_{s} . We choose ReLU as the activation function for throughput consideration. Given that the lightning indexer has a small number of heads...

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

原文: alue entry of MLA) will be shared across all query heads of the query token. The DSA architecture based on MLA is illustrated in Figure 2 . We also provide an open-source implementation of DeepSeek-V3.2 2 2 2 https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp/tree/main/inference to specify the details unambiguously. 2.1.1 Continued Pre-Training Starting from a base checkpoint of DeepSeek-V3.1-Terminus, whose context length has been extended to 128K, we perform continued pre-training followed by post-training to create DeepSeek-V3.2. The continued pre-training of DeepSeek-V3.2 consists of two...

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

原文: tion distribution, but considering only the selected token set 𝒮 t = { s | I t , s ∈ Top-k ( I t , : ) } \mathcal{S}_{t}=\quantity{s\,\middle|\,I_{t,s}\in\text{Top-k}\quantity(I_{t,:})} : ℒ I = ∑ t 𝔻 KL ( p t , 𝒮 t ∥ Softmax ( I t , 𝒮 t ) ) . \mathcal{L}^{I}=\sum_{t}\mathbb{D}_{\mathrm{KL}}\left(p_{t,\mathcal{S}_{t}}\,\middle\|\,\text{Softmax}\quantity(I_{t,\mathcal{S}_{t}})\right). (4) It is worth noting that we detach the indexer input from the computational graph for separate optimization. The training signal of the indexer is from only ℒ I \mathcal{L}^{I} , while the optimization of the ma...

1 Introduction

【引言】DeepSeek-V3.2的研究背景、动机和主要贡献。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: The release of reasoning models ( o1 ; deepseekr1 ) marked a pivotal moment in the evolution of Large Language Models (LLMs), catalyzing a substantial leap in overall performance across the verifiable fields. Since this milestone, the capabilities of LLMs have advanced rapidly. However, a distinct divergence has emerged in the past months. While the open-source community ( yang2025qwen3technicalreport ; zeng2025glm ; MiniMax-M2 ; k2-thinking ) continues to make strides, the performance trajectory of closed-source proprietary models ( gpt-5 ; sonnet-4.5 ; comanici2025gemini ) has accelerated at...

1 Introduction

原文: p a stable and scalable RL protocol that allows for significant computational expansion during the post-training phase. Notably, this framework allocates a post-training computational budget exceeding 10% of the pre-training cost, unlocking advanced capabilities. Thirdly, we propose a novel pipeline to foster generalizable reasoning in tool-use scenarios. First, we implement a cold-start phase utilizing the DeepSeek-V3 ( deepseekv3 ) methodology to unify reasoning and tool-use within single trajectories. Subsequently, we advance to large-scale agentic task synthesis, where we generate over 1,8...

2 DeepSeek-V3.2 Architecture

【架构】DeepSeek-V3.2的模型架构设计和技术细节。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: 2.1 DeepSeek Sparse Attention DeepSeek-V3.2 uses exactly the same architecture as DeepSeek-V3.2-Exp. Compared with DeepSeek-V3.1-Terminus, the last version of DeepSeek-V3.1, the only architectural modification of DeepSeek-V3.2 is the introduction of DeepSeek Sparse Attention (DSA) through continued training. Prototype of DSA. The prototype of DSA primarily consists of two components: a lightning indexer and a fine-grained token selection mechanism. The lightning indexer computes the index score I t , s I_{t,s} between the query token 𝐡 t ∈ ℝ d \mathbf{h}_{t}\in\mathbb{R}^{d} and a preceding to...

2 DeepSeek-V3.2 Architecture

原文: c}_{s}\,\middle|\,I_{t,s}\in\text{Top-k}\quantity(I_{t,:})}). (2) Figure 2: Attention architecture of DeepSeek-V3.2, where DSA is instantiated under MLA. The green part illustrates how DSA selects the top-k key-value entries according to the indexer. Instantiate DSA Under MLA. For the consideration of continued training from DeepSeek-V3.1-Terminus, we instantiate DSA based on MLA ( deepseekV2 ) for DeepSeek-V3.2. At the kernel level, each key-value entry must be shared across multiple queries for computational efficiency ( yuan-etal-2025-native ) . Therefore, we implement DSA based on the MQA ...

2 DeepSeek-V3.2 Architecture

原文: \in\mathbb{R}^{t} . Based on p t , : p_{t,:} , we set a KL-divergence loss as the training objective of the indexer: ℒ I = ∑ t 𝔻 KL ( p t , : ∥ Softmax ( I t , : ) ) . \mathcal{L}^{I}=\sum_{t}\mathbb{D}_{\mathrm{KL}}\left(p_{t,:}\,\middle\|\,\text{Softmax}\quantity({I}_{t,:})\right). (3) For warm-up, we use a learning rate of 10 − 3 10^{-3} . We train the indexer for only 1000 steps, with each step consisting of 16 sequences of 128K tokens, resulting in a total of 2.1B tokens. Sparse Training Stage. Following indexer warm-up, we introduce the fine-grained token selection mechanism and optimize...

2 DeepSeek-V3.2 Architecture

原文: efficiency on long sequences, we do not observe substantial performance degradation compared with DeepSeek-V3.1-Terminus, on both short- and long-context tasks. Human Preference Given that direct human preference assessments are inherently susceptible to bias, we employ ChatbotArena as an indirect evaluation framework to approximate user preferences for the newly developed base models. Both DeepSeek‑V3.1‑Terminus and DeepSeek‑V3.2‑Exp share an identical post‑training strategy, and their Elo scores, obtained from evaluations conducted on 10 November 2025, are closely matched. These results sugg...

2 DeepSeek-V3.2 Architecture

原文: e token position in the sequence. These costs are estimated from benchmarking the actual service deployed on H800 GPUs, at a rental price of 2 USD per GPU hour. Note that for short-sequence prefilling, we specially implement a masked MHA mode to simulate DSA, which can achieve higher efficiency under short-context conditions. (a) Prefilling (b) Decoding Figure 3: Inference costs of DeepSeek-V3.1-Terminus and DeepSeek-V3.2 on H800 clusters.

2.1 DeepSeek Sparse Attention

（2.1 DeepSeek Sparse Attention - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: DeepSeek-V3.2 uses exactly the same architecture as DeepSeek-V3.2-Exp. Compared with DeepSeek-V3.1-Terminus, the last version of DeepSeek-V3.1, the only architectural modification of DeepSeek-V3.2 is the introduction of DeepSeek Sparse Attention (DSA) through continued training. Prototype of DSA. The prototype of DSA primarily consists of two components: a lightning indexer and a fine-grained token selection mechanism. The lightning indexer computes the index score I t , s I_{t,s} between the query token 𝐡 t ∈ ℝ d \mathbf{h}_{t}\in\mathbb{R}^{d} and a preceding token 𝐡 s ∈ ℝ d \mathbf{h}_{s}\i...

2.1 DeepSeek Sparse Attention

原文: ext{Top-k}\quantity(I_{t,:})}). (2) Figure 2: Attention architecture of DeepSeek-V3.2, where DSA is instantiated under MLA. The green part illustrates how DSA selects the top-k key-value entries according to the indexer. Instantiate DSA Under MLA. For the consideration of continued training from DeepSeek-V3.1-Terminus, we instantiate DSA based on MLA ( deepseekV2 ) for DeepSeek-V3.2. At the kernel level, each key-value entry must be shared across multiple queries for computational efficiency ( yuan-etal-2025-native ) . Therefore, we implement DSA based on the MQA ( MQA ) mode of MLA 1 1 1 We i...

2.1 DeepSeek Sparse Attention

原文: t , : p_{t,:} , we set a KL-divergence loss as the training objective of the indexer: ℒ I = ∑ t 𝔻 KL ( p t , : ∥ Softmax ( I t , : ) ) . \mathcal{L}^{I}=\sum_{t}\mathbb{D}_{\mathrm{KL}}\left(p_{t,:}\,\middle\|\,\text{Softmax}\quantity({I}_{t,:})\right). (3) For warm-up, we use a learning rate of 10 − 3 10^{-3} . We train the indexer for only 1000 steps, with each step consisting of 16 sequences of 128K tokens, resulting in a total of 2.1B tokens. Sparse Training Stage. Following indexer warm-up, we introduce the fine-grained token selection mechanism and optimize all model parameters to adapt ...

2.2 Parity Evaluation

（2.2 Parity Evaluation - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Standard Benchmark In September 2025, we evaluate DeepSeek-V3.2-Exp on a suite of benchmarks, which focus on diverse capabilities, and compare it with DeepSeek-V3.1-Terminus showing similar performance. While DeepSeek V3.2 Exp significantly improves computational efficiency on long sequences, we do not observe substantial performance degradation compared with DeepSeek-V3.1-Terminus, on both short- and long-context tasks. Human Preference Given that direct human preference assessments are inherently susceptible to bias, we employ ChatbotArena as an indirect evaluation framework to approximate u...

2.3 Inference Costs

（2.3 Inference Costs - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: DSA reduces the core attention complexity of the main model from ( L 2 ) \order{L^{2}} to ( L k ) \order{Lk} , where k k ( ≪ L \ll L ) is the number of selected tokens. Although the lightning indexer still has a complexity of ( L 2 ) \order{L^{2}} , it requires much less computation compared with MLA in DeepSeek-V3.1-Terminus. Combined with our optimized implementation, DSA achieves a significant end-to-end speedup in long-context scenarios. Figure 3 presents how token costs of DeepSeek-V3.1-Terminus and DeepSeek-V3.2 vary with the token position in the sequence. These costs are estimated fr...

3 Post-Training

（3 Post-Training - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: After continued pre-training, we perform post-training to create the final DeepSeek-V3.2. The post-training of DeepSeek-V3.2 also employs sparse attention in the same way as the sparse continued pre-training stage. For DeepSeek-V3.2, we maintain the same post-training pipeline as in DeepSeek-V3.2-Exp, which includes specialist distillation and mixed RL training. Specialist Distillation For each task, we initially develop a specialized model dedicated exclusively to that particular domain, with all specialist models being fine-tuned from the same pre-trained DeepSeek-V3.2 base checkpoint. In ad...

3 Post-Training

原文: reward, length penalty, and language consistency reward. For general tasks, we employ a generative reward model where each prompt has its own rubrics for evaluation. DeepSeek-V3.2 and DeepSeek-V3.2-Speciale DeepSeek-V3.2 integrates reasoning, agent, and human alignment data distilled from specialists, undergoing thousands of steps of continued RL training to reach the final checkpoints. To investigate the potential of extended thinking, we also developed an experimental variant, DeepSeek-V3.2-Speciale. This model was trained exclusively on reasoning data with a reduced length penalty during RL...

3 Post-Training

原文: o i , t | q , o i , < t ) r_{i,t}(\theta)=\frac{\pi_{\theta}(o_{i,t}|q,o_{i,

3 Post-Training

原文: result of this adjustment, the gradient of this KL estimator becomes unbiased, which eliminates systematic estimation errors, thereby facilitating stable convergence. This contrasts sharply with the original K3 estimator, particularly when the sampled tokens have substantially lower probabilities under the current policy than the reference policy, i.e., π θ ≪ π ref \pi_{\theta}\ll\pi_{\mathrm{ref}} . In such cases, the gradient of the K3 estimator assigns disproportionately large, unbounded weights to maximize the likelihood of these tokens, resulting in noisy gradient updates that accumulate ...

3 Post-Training

原文: (Q),\{o_{i}\}_{i=1}^{G}\sim\pi_{\mathrm{old}}(\cdot|q)}\Bigg[\frac{1}{G}\sum_{i=1}^{G}\frac{1}{|o_{i}|}\sum_{t=1}^{|o_{i}|} min ( r i , t ( θ ) A ^ i , t , clip ( r i , t ( θ ) , 1 − ε , 1 + ε ) A ^ i , t ) M i , t − β 𝔻 KL ( π θ ( o i , t ) ∥ π ref ( o i , t ) ) ] , \displaystyle\min\left(r_{i,t}(\theta)\hat{A}_{i,t},\text{clip}\left(r_{i,t}(\theta),1-\varepsilon,1+\varepsilon\right)\hat{A}_{i,t}\right)M_{i,t}-\beta\mathbb{D}_{\mathrm{KL}}\left(\pi_{\theta}(o_{i,t})\,\middle\|\,\pi_{\mathrm{ref}}(o_{i,t})\right)\Bigg], (8) where M i , t = { 0 A ^ i , t < 0 , 1 | o i | ∑ t = 1 | o i | log ⁡ ...

3 Post-Training

原文: al inputs. Such inconsistency induces abrupt shifts in the active parameter subspace, which destabilizes optimization and exacerbates off-policy issues. To mitigate this, we preserve the expert routing paths used during sampling in the inference framework and enforce the same routing paths during training, ensuring that identical expert parameters are optimized. This Keep Routing operation was found crucial for RL training stability of MoE models, and has been adopted in our RL training pipeline since DeepSeek-V3-0324. Keep Sampling Mask Top-p and top-k sampling are widely used sampling strate...

3 Post-Training

原文: e this, we developed a context management strictly tailored for tool-calling scenarios as shown in Fig 4 : • Historical reasoning content is discarded only when a new user message is introduced to the conversation. If only tool-related messages (e.g., tool outputs) are appended, the reasoning content is retained throughout the interaction. • When reasoning traces are removed, the history of tool calls and their results remains preserved in the context. Notably, certain agent frameworks, such as Roo Code or Terminus, simulate tool interactions via user messages. These frameworks may not fully b...

3 Post-Training

原文: orporate multiple tool calls within its reasoning process. In this manner, although the reasoning in tool‑use patterns may lack robustness, the model is occasionally able to generate the desired trajectories, thereby providing a basis for subsequent reinforcement learning stages. 3.2.3 Large-Scale Agentic Tasks A diverse set of RL tasks is crucial for enhancing model robustness. For tasks such as search, code engineering, and code interpretation, we employ real-world tools, including actual web search APIs, coding tools, and Jupyter Notebooks. While these RL environments are real, the prompts ...

3 Post-Training

原文: se data spans multiple languages, domains, and difficulty levels. To complement these verifiable samples and better reflect real-world usage, we also augment the dataset with filtered instances from our existing helpful RL datasets, for which the search tool provides measurable benefits. We then develop detailed evaluation rubrics across multiple quality dimensions and employ a generative reward model to score responses based on these rubrics. This hybrid approach enables optimization for both factual reliability and practical helpfulness. Code Agent We constructed large-scale, executable envi...

3 Post-Training

原文: ode execution capabilities to arrive at a solution. General Agent To scale up agent environments and tasks in RL, we employ an automatic environment-synthesis agent that synthesizes 1,827 task-oriented environments. These tasks are hard to solve but easy to verify. The synthesis workflow primarily consists of environment and toolset construction, task synthesis, and solution generation. Specifically, the workflow proceeds as follows. 1. Given a task category (e.g., planning a travel itinerary) and a sandbox equipped with a bash and a search tool, the agent first uses these tools to generate or...

3 Post-Training

原文: nces with non-zero pass@100, resulting in 1,827 environments and their corresponding tasks (4,417 in total). A synthetic trip-planning example is illustrated below. This example highlights that, while searching the large combinatorial space for a trip plan that satisfies all constraints is challenging, checking whether a given candidate solution satisfies these constraints is relatively straightforward. An Example of Synthesized Task: Trip Planning I’m planning a three-day trip starting from Hangzhou, and I need help creating an itinerary from October 1st to October 3rd, 2025. A few important ...

3 Post-Training

原文: name", "afternoon_attraction": "attraction_name", "evening_restaurant": "restaurant_name" }, { "time": "2025-10-03", "city": "cite_name", "hotel": "hotel_name", "afternoon_restaurant": "restaurant_name", "afternoon_attraction": "attraction_name", "evening_restaurant": "restaurant_name" } ] Tool Set for Trip Planning Function Name Description get_all_attractions_by_city(city) Get all attractions for given city. get_all_cities() Get all cities from the database. get_all_hotels_by_city(city) Get all hotels for given city. get_all_restaurants_by_city(city) Get all restaurants for given city. get_c...

3.1 Scaling GRPO

（3.1 Scaling GRPO - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: We first review the objective of GRPO. GRPO optimizes the policy model π θ \pi_{\theta} by maximizing the following objective on a group of responses { o 1 , ⋯ , o G } \{o_{1},\cdots,o_{G}\} sampled from the old policy π old \pi_{\mathrm{old}} given each question q q : 𝒥 GRPO ( θ ) = \displaystyle\mathcal{J}_{\mathrm{GRPO}}(\theta)=\kern 5.0pt 𝔼 q ∼ P ( Q ) , { o i } i = 1 G ∼ π old ( ⋅ | q ) [ 1 G ∑ i = 1 G 1 | o i | ∑ t = 1 | o i | \displaystyle\mathbb{E}_{q\sim P(Q),\{o_{i}\}_{i=1}^{G}\sim\pi_{\mathrm{old}}(\cdot|q)}\Bigg[\frac{1}{G}\sum_{i=1}^{G}\frac{1}{|o_{i}|}\sum_{t=1}^{|o_{i}|} min ...

3.1 Scaling GRPO

原文: e RL scaling, directly building on the GRPO algorithm. Unbiased KL Estimate Given o i , t o_{i,t} is sampled from the old policy π old ( ⋅ | q , o i , < t ) \pi_{\mathrm{old}}(\cdot|q,o_{i,

3.1 Scaling GRPO

原文: ence Masking To improve the efficiency of RL systems, we typically generate a large batch of rollout data, which is subsequently split into multiple mini-batches for several gradient update steps. This practice inherently introduces off-policy behavior. Additionally, inference frameworks used for efficient data generation are often highly optimized, which may differ in implementation details from training frameworks. Such training-inference inconsistency further exacerbates the degree of off-policyness. To stabilize training and improve tolerance for off-policy updates, we mask negative sequen...

3.1 Scaling GRPO

原文: controls the threshold of policy divergence. Note that π old \pi_{\mathrm{old}} here denotes the sampling probability directly returned by the inference framework, thus the KL divergence between the old and current policy accounts for both sources of off-policyness mentioned above. It is also worth noting that we only mask sequences with negative advantages. Intuitively, the model benefits the most by learning from its own mistakes, whereas highly off-policy negative samples can be detrimental, potentially misleading or destabilizing the optimization process. We empirically observe that this O...

3.1 Scaling GRPO

原文: ld}} and π θ \pi_{\theta} , which violates the principles of importance sampling and destabilizes training. To address this, we preserve the truncation masks during sampling from π old \pi_{\mathrm{old}} and apply them to π θ \pi_{\theta} during training, ensuring both policies share identical action subspaces. Empirically, we find that combining top-p sampling with the Keep Sampling Mask strategy effectively preserves language consistency during RL training.

3.2 Thinking in Tool-Use

（3.2 Thinking in Tool-Use - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: 3.2.1 Thinking Context Management DeepSeek-R1 has demonstrated that incorporating a thinking process can significantly enhance a model’s ability to solve complex problems. Building on this insight, we aim to integrate thinking capabilities into tool-calling scenarios. We observed that replicating DeepSeek-R1’s strategy—discarding reasoning content upon the arrival of the second round of messages—results in significant token inefficiency. This approach forces the model to redundantly re-reason through the entire problem for each subsequent tool call. To mitigate this, we developed a context man...

3.2 Thinking in Tool-Use

原文: tinct task prompts are associated with different system prompts. Tables 8 – 8 present an illustrative example corresponding to a competitive programming prompt. Table 8 presents an example of our reasoning data, which uses a system prompt to explicitly asks the model to do reasoning before the final answer and uses a special tag to label the reasoning path. Table 8 shows the prompt of non-reasoning agentic data, where the system prompt contains the guidance of toolcall. Table 8 presents the system prompt we designed to instruct the model to incorporate multiple tool calls withi...

3.2 Thinking in Tool-Use

原文: question-construction agent then explores each entity using search tools with configurable depth and breadth parameters, consolidating the discovered information into question-answer pairs. Multiple answer-generation agents with heterogeneous configurations (different checkpoints, system prompts, etc.) produce diverse candidate responses for each proposed QA pair. A verification agent with search capabilities validates all answers through multiple passes, retaining only samples where the ground-truth is correct and all candidates are verifiably incorrect. These data spans multiple languages, d...

3.2 Thinking in Tool-Use

原文: the issue is fixed) and a zero count of pass-to-fail (P2F) test cases (indicating no regressions). Using this pipeline, we successfully built tens of thousands of reproducible issue resolution environments spanning multiple programming languages, including Python, Java, JavaScript, TypeScript, C, C++, Go, and PHP. Code Interpreter Agent We utilize Jupyter Notebook as a code interpreter to address complex reasoning tasks. To facilitate this, we curate a diverse set of problems spanning mathematics, logic, and data science, each requiring the model to leverage code execution capabilities to arri...

3.2 Thinking in Tool-Use

原文: the solution’s output passes the verification. The agent then iteratively increases the difficulty of the task and updates the corresponding solution and verification functions. During this iterative process, if the current toolset is not sufficient to solve the task, the agent will augment the toolset. Following this workflow, we obtain thousands of ⟨ environment , tools , task , verifier ⟩ \textlangle\text{environment},\text{tools},\text{task},\text{verifier}\textrangle tuples. We then perform RL on this dataset using DeepSeek-V3.2 and retain only instances with non-zero pass@100, result...

3.2 Thinking in Tool-Use

原文: d 4.0 or higher, and the attraction ticket should be below 180 CNY. For more affordable hotels (200-500 CNY range), I only need to ensure that at least one restaurant has a rating of 3.2 or above. Can you help me put together this itinerary? Submit Result Format [ { "time": "2025-10-01", "city": "cite_name", "hotel": "hotel_name", "afternoon_restaurant": "restaurant_name", "afternoon_attraction": "attraction_name", "evening_restaurant": "restaurant_name" }, { "time": "2025-10-02", "city": "cite_name", "hotel": "hotel_name", "afternoon_restaurant": "restaurant_name", "afternoon_attraction": "at...

4 Evaluation

（4 Evaluation - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: 4.1 Main Results We evaluate models on MMLU-Pro ( mmlu_pro ) , GPQA Diamond ( gpqa ) , Human Last Exam (HLE) Text-only ( hle ) , LiveCodeBench (2024.08-2025.04), Codeforces, Aider-Polyglot, AIME 2025, HMMT Feb 2025, HMMT Nov 2025 ( balunovic2025matharena ) , IMOAnswerBench ( luong-etal-2025-towards ) , Terminal Bench 2.0, SWE-Verified ( swe_verified ) , SWE Multilingual ( yang2025swesmith ) , BrowseComp ( wei2025browsecomp ) , BrowseCompZh ( zhou2025browsecomp ) , τ 2 \tau^{2} -bench ( tau2 ) , MCP-Universe ( mcpuniverse ) , MCP-Mark ( mcpmark ) , and Tool-Decathlon ( li2025tool ) . Tool-use b...

4 Evaluation

原文: eCodeBench (Pass@1-COT) 64.0 84.5 90.7 82.6 83.0 83.3 Codeforces (Rating) 1480 2537 2708 - - 2386 Math AIME 2025 (Pass@1) 87.0 94.6 95.0 94.5 78.3 93.1 HMMT Feb 2025 (Pass@1) 79.2 88.3 97.5 89.4 - 92.5 HMMT Nov 2025 (Pass@1) 81.7 89.2 93.3 89.2 - 90.2 IMOAnswerBench (Pass@1) - 76.0 83.3 78.6 - 78.3 Code Agent Terminal Bench 2.0 (Acc) 42.8 35.2 54.2 35.7 30.0 46.4 SWE Verified (Resolved) 77.2 74.9 76.2 71.3 69.4 73.1 SWE Multilingual (Resolved) 68.0 55.3 - 61.1 56.5 70.2 Search Agent BrowseComp (Pass@1) 24.1 54.9 - -/60.2* 44.0 51.4/67.6 * BrowseCompZh (Pass@1) 42.4 63.0 - 62.3 48.5 65.0 HLE (P...

4 Evaluation

原文: management strategy for the ’thinking mode’ is currently incompatible with Terminus; consequently, the reported score of 46.4 was achieved using the Claude Code framework. We also evaluated DeepSeek-V3.2 with Terminus in non-thinking mode, yielding a score of 39.3. For SWE-bench Verified, the primary score was obtained using our internal framework. Robustness tests across other settings—including the Claude Code and RooCode frameworks, as well as non-thinking mode—produced consistent results, ranging from 72 to 74. For the search agent evaluation, we assess our models using a standard commerci...

4 Evaluation

原文: t still significantly outperforms existing open models. Notably, since the environments and toolsets employed in these benchmarks were not encountered during RL training, the observed improvements demonstrate DeepSeek-V3.2’s capacity to generalize its reasoning strategies to out-of-domain agentic scenarios. The evaluation of non-thinking model in the agent scenario is shown in Appendix Table 9 . 4.2 Results of DeepSeek-V3.2-Speciale Table 3: Benchmark performance and efficiency of reasoning models. For each benchmark, cells show accuracy and output token count (in thousands). The highest accur...

4 Evaluation

原文: MO) 5 5 5 We evaluated the English version of CMO 2025. The IMO 2025 and CMO 2025 problems, together with the inference code, can be found at: https://github.com/deepseek-ai/DeepSeek-Math-V2 . . Detailed evaluation protocols are provided in Appendix D . However, the token efficiency of DeepSeek-V3.2-Speciale remains significantly inferior to that of Gemini-3.0-Pro. To mitigate deployment costs and latency, we imposed stricter token constraints during the training of the official DeepSeek-V3.2, aiming to optimize the trade-off between performance and cost. We believe that token efficiency remai...

4 Evaluation

原文: e 5: Accuracy of general synthesized tasks on different models. Pass@K DeepSeek-v3.2-Exp Sonnet-4.5 Gemini-3.0 Pro GPT-5-Thinking 1 12% 34% 51% 62% 2 18% 47% 65% 75% 4 26% 62% 74% 82% To investigate whether RL on synthetic data can generalize to different tasks or real-world environments, we apply RL to the SFT checkpoint of DeepSeek-V3.2 (denoted DeepSeek-V3.2-SFT). To exclude the effects of long CoT and other RL data, we conduct RL only on synthetic agentic tasks in non-thinking mode. We then compare the model with DeepSeek-V3.2-SFT and DeepSeek-V3.2-Exp, where DeepSeek-V3.2-Exp is trained w...

4 Evaluation

原文: or comparison, we also implement a parallel scaling baseline, Parallel-fewest-step , which samples N independent trajectories and selects the trajectory with the fewest steps. We evaluate these strategies on the BrowseComp benchmark ( wei2025browsecomp ) . As illustrated in Figure 6 , under varying compute budgets, context management leads to significant performance gains by allowing the model to scale up test-time compute, providing more space to perform additional execution steps. For example, Summary extends the average steps to 364, achieving a performance improvement of up to 60.2. Howeve...

4.1 Main Results

（4.1 Main Results - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: We evaluate models on MMLU-Pro ( mmlu_pro ) , GPQA Diamond ( gpqa ) , Human Last Exam (HLE) Text-only ( hle ) , LiveCodeBench (2024.08-2025.04), Codeforces, Aider-Polyglot, AIME 2025, HMMT Feb 2025, HMMT Nov 2025 ( balunovic2025matharena ) , IMOAnswerBench ( luong-etal-2025-towards ) , Terminal Bench 2.0, SWE-Verified ( swe_verified ) , SWE Multilingual ( yang2025swesmith ) , BrowseComp ( wei2025browsecomp ) , BrowseCompZh ( zhou2025browsecomp ) , τ 2 \tau^{2} -bench ( tau2 ) , MCP-Universe ( mcpuniverse ) , MCP-Mark ( mcpmark ) , and Tool-Decathlon ( li2025tool ) . Tool-use benchmarks are eva...

4.1 Main Results

原文: 1-COT) 64.0 84.5 90.7 82.6 83.0 83.3 Codeforces (Rating) 1480 2537 2708 - - 2386 Math AIME 2025 (Pass@1) 87.0 94.6 95.0 94.5 78.3 93.1 HMMT Feb 2025 (Pass@1) 79.2 88.3 97.5 89.4 - 92.5 HMMT Nov 2025 (Pass@1) 81.7 89.2 93.3 89.2 - 90.2 IMOAnswerBench (Pass@1) - 76.0 83.3 78.6 - 78.3 Code Agent Terminal Bench 2.0 (Acc) 42.8 35.2 54.2 35.7 30.0 46.4 SWE Verified (Resolved) 77.2 74.9 76.2 71.3 69.4 73.1 SWE Multilingual (Resolved) 68.0 55.3 - 61.1 56.5 70.2 Search Agent BrowseComp (Pass@1) 24.1 54.9 - -/60.2* 44.0 51.4/67.6 * BrowseCompZh (Pass@1) 42.4 63.0 - 62.3 48.5 65.0 HLE (Pass@1) 32.0 35.2 ...

4.1 Main Results

原文: gy for the ’thinking mode’ is currently incompatible with Terminus; consequently, the reported score of 46.4 was achieved using the Claude Code framework. We also evaluated DeepSeek-V3.2 with Terminus in non-thinking mode, yielding a score of 39.3. For SWE-bench Verified, the primary score was obtained using our internal framework. Robustness tests across other settings—including the Claude Code and RooCode frameworks, as well as non-thinking mode—produced consistent results, ranging from 72 to 74. For the search agent evaluation, we assess our models using a standard commercial search API. Si...

4.1 Main Results

原文: ntly outperforms existing open models. Notably, since the environments and toolsets employed in these benchmarks were not encountered during RL training, the observed improvements demonstrate DeepSeek-V3.2’s capacity to generalize its reasoning strategies to out-of-domain agentic scenarios. The evaluation of non-thinking model in the agent scenario is shown in Appendix Table 9 .

4.2 Results of DeepSeek-V3.2-Speciale

（4.2 Results of DeepSeek-V3.2-Speciale - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Table 3: Benchmark performance and efficiency of reasoning models. For each benchmark, cells show accuracy and output token count (in thousands). The highest accuracy per benchmark is in bold; the second-highest is underlined. Benchmark GPT-5 Gemini-3.0 Kimi-K2 DeepSeek-V3.2 DeepSeek-V3.2 High Pro Thinking Thinking Speciale AIME 2025 (Pass@1) 94.6 (13k) 95.0 (15k) 94.5 (24k) 93.1 (16k) 96.0 (23k) HMMT Feb 2025 (Pass@1) 88.3 (16k) 97.5 (16k) 89.4 (31k) 92.5 (19k) 99.2 (27k) HMMT Nov 2025 (Pass@1) 89.2 (20k) 93.3 (15k) 89.2 (29k) 90.2 (18k) 94.4 (25k) IMOAnswerBench (Pass@1) 76.0 (31k) 83.3 (18k...

4.2 Results of DeepSeek-V3.2-Speciale

原文: onstraints during the training of the official DeepSeek-V3.2, aiming to optimize the trade-off between performance and cost. We believe that token efficiency remains a critical area for future investigation. Table 4: Performance of DeepSeek-V3.2-Speciale in top-tier mathematics and coding competitions. For ICPC WF 2025, we report the number of submissions for each successfully solved problem. DeepSeek-V3.2-Speciale ranked 2nd in ICPC WF 2025 and 10th in IOI 2025. Competition P1 P2 P3 P4 P5 P6 Overall Medal IMO 2025 7 7 7 7 7 0 35/42 Gold CMO 2025 18 18 9 21 18 18 102/126 Gold IOI 2025 100 82 7...

4.3 Synthesis Agentic Tasks

（4.3 Synthesis Agentic Tasks - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: In this section, we perform ablation experiments to study the effect of synthetic agentic tasks. We focus on two questions. First, are synthetic tasks sufficiently challenging for reinforcement learning? Second, how well do these synthetic tasks generalize, i.e., can they transfer to different downstream tasks or real-world environments? To address the first question, we randomly sample 50 instances from the general synthesized agentic tasks and evaluate both the model used for synthesis and frontier closed-source LLMs. As shown in Table 5 , DeepSeek-V3.2-Exp attains an accuracy of only 12%, w...

4.4 Context Management of Search Agent

（4.4 Context Management of Search Agent - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Figure 6: Accuracy of Browsecomp with different test-time compute expansion strategies. Even with extended context windows such as 128k, agentic workflows, particularly in search-based scenarios, frequently encounter maximum length limitations that prematurely truncate the reasoning process. This bottleneck inhibits the full realization of test-time compute potential. To address this, we introduce context management employing simple strategies to extend token budgets at test time， when the token usage exceeds 80% of the context window length. These strategies include (1) Summary , which summar...

4.4 Context Management of Search Agent

原文: for actual compute costs when benchmarking model performance. Meanwhile, finding the optimal combination of serial and parallel scaling to maximize both efficiency and scalability remains a crucial direction for future work.

5 Conclusion, Limitation, and Future Work

（5 Conclusion, Limitation, and Future Wor - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: In this work, we introduced DeepSeek-V3.2, a framework that effectively bridges the gap between computational efficiency and advanced reasoning capabilities. Using DSA, we addressed critical computation complexity without sacrificing long-context performance. By increasing computational budget, DeepSeek-V3.2 achieves comparable performance with GPT-5 on reasoning benchmarks. Finally, the integration of our large-scale agentic task synthesis pipeline significantly enhances tool-use proficiency, unlocking new possibilities for robust and generalizable AI agents with open LLM. Furthermore, our hi...

Appendix A MHA and MQA Modes of MLA

（Appendix A MHA and MQA Modes of MLA - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: (a) MHA mode of MLA. (b) MQA mode of MLA. Figure 7: Illustration of the MHA and MQA modes of MLA. For DeepSeek-V3.1-Terminus, the MHA mode is used for training and prefilling, while the MQA mode is used for decoding. Figure 7 illustrates two aspects of MLA – the MHA and MQA modes – as well as the transformation between them.

Appendix B Cold Start Template

（Appendix B Cold Start Template - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Table 6: An example of the reasoning data system prompt. The system prompt requires the model to output the reasoning process in the tag . Reasoning System Prompt You are an expert Python programmer. You will be given a question (problem specification) and will generate a correct Python program that matches the specification and passes all tests. Please first reason before giving the final answer. The reasoning process enclosed within . The final answer is output after the tag. Prompt Given a linked list, swap every two adjacent nodes and return its head...

Appendix B Cold Start Template

原文: u have the answer, stop reasoning and present your solution using Markdown and LaTeX. - Do NOT invoke any tools in your presented final solution steps. - To improve efficiency and accuracy, you should prefer code execution over language-based reasoning whenever possible. Keep your reasoning succinct; let the code do the heavy lifting. ## Tools You have access to the following tools: {TOOL-DESCRIPTIONS} Important: ALWAYS adhere to this exact format for tool use: {TOOLCALL-FORMAT} Prompt Given a linked list, swap every two adjacent nodes and return its head … Agent Response with Thinking ...

Appendix C Non-thinking DeepSeek-V3.2 Agentic Evaluation

（Appendix C Non-thinking DeepSeek-V3.2 A - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Table 9: Comparison between DeepSeek-V3.2 non-thinking and thinking modes. The terminal bench scores are evaluated with the Claude Code framework in the table. Non-thinking score of Terminal Bench 2.0 with Terminus framework is 39.3. Benchmark (Metric) non-thinking thinking Code Agent Terminal Bench 2.0 (Acc) 37.1 46.4 SWE Verified (Resolved) 72.1 73.1 SWE Multilingual (Resolved) 68.9 70.2 ToolUse τ 2 \tau^{2} -bench (Pass@1) 77.2 80.3 MCP-Universe (Success Rate) 38.6 45.9 MCP-Mark (Pass@1) 26.5 38.0 Tool-Decathlon (Pass@1) 25.6 35.2 The performance of non-thinking mode is slightly worse than ...

Appendix D Evaluation Method of IOI, ICPC World Final, IMO, and CMO

（Appendix D Evaluation Method of IOI, ICP - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: For all competitions, the model’s maximum generation length is set to 128k. No tools or internet access are used, and testing strictly adheres to the contest’s time and attempt limits. For the IOI evaluation, we designed our submission strategy in accordance with the official competition rules, which permit up to 50 submissions per problem and score each submission based on the maximum points achieved across all subtasks. Specifically, we first sampled 500 candidate solutions for each problem, then applied a multi-stage filtering pipeline. In the initial stage, we eliminated invalid submission...

Appendix E Author List

（Appendix E Author List - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Research & Engineering : Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan*, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou*, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Li, Haofen Liang, Haoran Wei, Haowei Zhang, Haowen Luo, Haozhe Ji, Honghui Ding, Hongxuan Tang, Huanqi Cao, Huazuo Gao, Hui Qu, Hui Zeng, Jialiang Huang, Jiashi Li, Jiaxin Xu, Jiewen Hu, Jingchang Chen, Jingting Xiang, Jingyang...

Appendix E Author List

原文: ao, Yisong Wang, Yixiao Chen, Yixuan Tan, Yixuan Wei, Yiyang Ma, Yiyuan Liu, Yonglun Yang, Yongqiang Guo, Yongtong Wu, Yu Wu, Yuan Cheng, Yuan Ou, Yuanfan Xu, Yuduan Wang, Yue Gong*, Yuhan Wu, Yuheng Zou, Yukun Li, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Z.F. Wu, Z.Z. Ren, Zehua Zhao, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhigang Yan, Zhihong Shao, Zhixian Huang, Zhiyu Wu, Zhuoshu Li, Zhuping Zhang, Zian Xu, Zihao Wang, Zihui Gu, Zijia Zhu, Zilin Li, Zipeng Zhang, Ziwei Xie, Ziyi Gao, Zizheng Pan, Zon...

← 返回首页详细解读