在处理上述OPD目标时,现有工作通常将全词汇表KL散度损失简化为各词元位置上的词元级KL估计,并通过在策略损失计算中用 sg[log(π_Ei(y_t|x, y_
[原文]In handling the above OPD objective, prior works usually simplify the full-vocabulary KL
loss into a token-level KL estimate at each token position, and reuse RL framework by replac-
ing sg
�
log
𝜋𝐸𝑖(𝑦𝑡|𝑥,𝑦
5.2. RL and OPD Infrastructures
我们的后训练基础设施构建于为 DeepSeek-V3.2 开发的可扩展框架之上。具体而言,我们集成了第 3.5 节所述的分布式训练栈,以及前文介绍的用于高效自回归采样的 rollout 引擎。在此基础上,本文引入了以下主要改进。这些设计支持高效执行涉及十余个不同教师模型的超长上下文强化学习与 OPD 合并任务,从而大幅缩短模型发布的迭代周期。
[原文]Our post-training infrastructure is built upon the scalable framework developed for DeepSeek-
V3.2. Specifically, we integrate the same distributed training stack described in Section 3.5 and
the rollout engine introduced earlier for efficient auto-regressive sampling. Building on this
foundation, we introduce the following principal enhancements in the present work. These
designs enable efficient execution of ultra-long-context RL and OPD merging tasks involving
over ten distinct teacher models, thereby substantially accelerating the iteration cycle for model
releases.
5.2.1. FP4 Quantization Integration
我们采用FP4(MXFP4)量化技术,以加速rollout过程及所有纯推理前向传播(涵盖教师模型与参考模型),从而有效降低访存开销与采样延迟。如第3.4节所述,在rollout与推理阶段,我们直接使用原生FP4权重。在训练阶段,我们通过一个无损的FP4至FP8反量化步骤来模拟FP4量化,从而能够无缝复用现有的FP8混合精度框架(保留FP32主权重),且无需对反向传播流水线进行任何修改。
[原文]We apply FP4 (MXFP4) quantization to accelerate both rollouts and all inference-only forward
passes, including those of teacher and reference models, thereby reducing memory traffic and
sampling latency. As detailed in Section 3.4, we directly use native FP4 weights during the
rollout and inference phases. For training steps, FP4 quantization is simulated via a lossless
FP4-to-FP8 dequantization step, allowing seamless reuse of the existing FP8 mixed-precision
framework with FP32 master weights and requiring no modification to the backward pipeline.
5.2.2. Efficient Teacher Scheduling for Full-Vocabulary OPD
我们的框架支持全词汇表同策略蒸馏(OPD),可容纳实际上无上限数量的教师模型,每个教师模型可能包含数万亿参数。为实现这一目标,所有教师权重均被卸载至集中式分布式存储中,并在教师前向传播期间按需加载;同时采用类似ZeRO的参数分片技术,以缓解I/O与DRAM压力。此外,即使采用磁盘暂存,跨所有教师模型朴素地实例化词汇表规模 |𝑉| > 100k 的 logits 也是难以承受的。为解决该问题,我们仅在前向传播期间将教师模型的最后一层隐藏状态缓存至集中式缓冲区中。在训练阶段,系统会检索这些缓存状态并将其输入对应的预测头模块,从而动态重建完整的 logits。该设计带来的重计算开销微乎其微,同时彻底规避了显式实例化 logits 所带来的内存负担。为降低教师预测头的 GPU 内存占用,我们在数据分发阶段按教师索引对训练样本进行排序。该策略确保每个独立的教师头在每个 mini-batch 中仅加载一次,且任意时刻设备内存中最多仅驻留一个教师头。所有参数与隐藏状态的加载/卸载操作均在后台异步执行,不会阻塞关键路径上的计算。最后,教师与学生 logits 之间的精确 KL 散度通过专用的 TileLang 内核进行计算,该内核不仅加速了计算过程,还有效减少了动态内存分配。
[原文]Our framework supports full-vocabulary On-Policy Distillation (OPD) with an effectively
unbounded number of teachers, each potentially comprising trillions of parameters. To enable
this, all teacher weights are offloaded to a centralized distributed storage and are loaded on
demand during the teacher forward pass with ZeRO-like parameter sharding to alleviate both
I/O and DRAM pressure. Furthermore, naively materializing logits for a vocabulary size
|𝑉| > 100k across all teachers is prohibitive, even when spooled to disk. We address this by
caching only the last-layer teacher hidden states in ...
5.2.3. Preemptible and Fault-Tolerant Rollout Service
为了在最大化 GPU 资源利用率的同时,为高优先级任务实现快速的硬件资源分配,我们的 GPU 集群采用了一种集群级抢占式任务调度器,允许随时抢占任何正在运行的任务。此外,硬件故障在大规模 GPU 集群中十分常见。为此,我们实现了一种面向 RL/OPD rollout 的可抢占且容错的大语言模型(LLM)生成服务。 具体而言,我们为每个生成请求实现了一个 token 粒度的预写式日志(Write-Ahead Log, WAL)。每当为某个请求生成一个新的 token 时,我们会立即将其追加至该请求的 WAL 中。在发生抢占时,系统会暂停推理引擎并保存未完成请求的 KV 缓存。恢复运行时,我们利用持久化的 WAL 和已保存的 KV 缓存继续解码过程。即使发生致命硬件错误,我们也可以利用 WAL 中持久化的 token 重新运行 prefill 阶段,从而重建 KV 缓存。 重要的是,从头重新生成未完成的请求在数学上是不正确的,因为这会引入长度偏差。由于较短的响应更有可能在发生中断时得以保留,从头重新生成会导致模型在每次中断时更倾向于输出较短的序列。若推理栈具备批次不变性与确定性,该正确性问题也可通过为采样器中的伪随机数生成器设置固定种子并重新生成来解决。然而,该方法仍需承担重新运行解码阶段的额外开销,其效率远低于我们采用的 token 粒度 WAL 方法。
[原文]To maximize GPU resource utilization while enabling rapid hardware provisioning for high-
priority tasks, our GPU cluster employs a cluster-wide preemptive task scheduler, where any
running task may be preempted at any time. Also, hardware failures are prevalent in large-scale
GPU clusters. To this end, we implement a preemptible and fault-tolerant LLM generation
service for RL/OPD rollout.
Specifically, we implement a token-granular Write-Ahead Log (WAL) for each generation
request. Whenever a new token is generated for a request, we immediately append it to that
request’s WAL. During preempt...
5.2.4. Scaling RL Framework for Million-Token Context
我们针对百万Token序列上的高效强化学习(RL)与OPD引入了针对性优化。在rollout阶段,我们采用了一种可抢占且容错的rollout服务,详见第5.2.3节。在推理与训练阶段,我们将rollout数据格式拆分为轻量级元数据与重量级逐Token字段。在数据分发过程中,可加载全部rollout数据的元数据以执行全局洗牌与打包布局计算。重量级逐Token字段通过共享内存数据加载器进行加载,以消除节点内数据冗余,并在以mini-batch为粒度被消费后立即释放,从而大幅降低CPU与GPU的内存压力。设备端mini-batch的数量根据工作负载动态确定,从而在计算吞吐量与I/O重叠之间实现高效权衡。
[原文]We introduce targeted optimizations for efficient RL and OPD on million-token sequences.
During the rollout phase, we adopt a preemptible and fault-tolerant rollout service, detailed in
Section 5.2.3. For the inference and training phase, we decompose the rollout data format into
lightweight metadata and heavy per-token fields. During data dispatching, the metadata for the
entire rollout data can be loaded to perform global shuffling and packing layout computation.
Heavy per-token fields are loaded via a shared-memory data loader to eliminate intra-node
data redundancy and are released immedia...
5.2.5. Sandbox Infrastructure for Agentic AI
为满足智能体AI在训练后阶段与评估过程中的多样化执行需求,我们构建了一个生产级沙箱平台——DeepSeek弹性计算(DSec)。DSec由三个Rust组件构成:API网关(Apiserver)、单主机代理(Edge)以及集群监控器(Watcher)。这些组件通过自定义RPC协议相互连接,并基于3FS分布式文件系统(DeepSeek-AI, 2025)实现水平扩展。在生产环境中,单个DSec集群可管理数十万个并发沙箱实例。DSec的设计基于以下四点观察:(1)智能体工作负载具有高度异构性,涵盖从轻量级函数调用到完整的软件工程流水线,且对操作系统和安全性的需求各异;(2)环境镜像数量庞大且体积巨大,但必须能够快速加载并支持迭代式定制;(3)高密度部署要求高效的CPU与内存利用率;(4)沙箱生命周期必须与GPU训练调度相协调,包括抢占机制与基于检查点的恢复。基于上述观察,下文将逐一详细阐述DSec的四大核心设计。 统一接口下的四种执行基座。DSec提供了一个统一的Python SDK(libdsec),抽象了四种执行基座。函数调用(Function Call)将无状态请求分发至预热的容器池,从而消除冷启动开销。容器(Container)完全兼容Docker,并利用EROFS(Gao et al., 2019)的按需加载机制实现高效的镜像组装。微型虚拟机(microVM)基于Firecracker(Agache et al., 2020)构建,为对安全性敏感的高密度部署提供了虚拟机级别的隔离。完整虚拟机(fullVM)基于QEMU(Bellard, 2005)构建,支持任意客户机操作系统。这四种基座共享统一的API接口——包括命令执行、文件传输和TTY访问——在它们之间切换仅需更改参数即可。 基于分层存储的快速镜像加载。DSec通过分层按需加载机制,在快速启动与庞大且不断增长的环境镜像库之间取得了平衡。对于容器,基础镜像和文件系统提交记录被存储为由3FS支持的只读EROFS层,并直接挂载至overlay的lower目录中。在挂载时,我们将文件元数据保留在本地磁盘以确保立即可用;同时,数据块则在请求时从3FS中按需获取。
[原文]To meet the diverse execution demands of agentic AI during post-training and evaluation,
we build a production-grade sandbox platform, DeepSeek Elastic Compute (DSec). DSec
comprises three Rust components — the API gateway (Apiserver), per-host agent (Edge), and
the cluster monitor (Watcher) — that are interconnected by a custom RPC protocol and scale
horizontally atop the 3FS distributed filesystem (DeepSeek-AI, 2025). In production, a single
DSec cluster manages hundreds of thousands of concurrent sandbox instances. The design of DSec is motivated by four observations: (1) agentic workloads ...
5.2.5. Sandbox Infrastructure for Agentic AI
针对微虚拟机(microVMs),DSec 采用 overlaybd(Li 等,2020)磁盘格式:只读基层存放于 3FS 上以实现跨实例共享,而写操作则指向本地的写时复制(copy-on-write)层。此类快照支持链式结构,从而实现了高效的版本管理与毫秒级恢复。 大规模并发下的密度优化。为在单个集群中容纳数十万个沙箱,DSec 着力解决两大资源瓶颈。首先,它缓解了虚拟化环境中的重复页缓存占用问题,并应用内存回收机制以实现安全的内存超分。其次,它降低了容器运行时中的自旋锁竞争,从而减少了单沙箱 CPU 开销,显著提升了单宿主机的部署密度。 轨迹日志与抢占安全恢复。DSec 为每个沙箱维护一份全局有序的轨迹日志,持久化记录每一次命令调用及其结果。该轨迹日志具备三项用途:(1)客户端快进——当训练任务被抢占时,沙箱资源依然保留;恢复运行时,DSec 会重放此前已完成命令的缓存结果,从而加速任务恢复,同时避免重执行非幂等操作引发的错误;(2)细粒度溯源——每次状态变更的源头及其对应结果均可追溯;(3)确定性重放——任何历史会话均可根据其轨迹被精确复现。
[原文]For microVMs, DSec uses the
overlaybd (Li et al., 2020) disk format: the read-only base layer resides on 3FS for cross-
instance sharing, while writes go to a local copy-on-write layer.Such snapshots are chainable,
facilitating efficient versioning and millisecond-scale resumption. Density Optimizations Under Massive Concurrency. To accommodate hundreds of thousands
of sandboxes per cluster, DSec tackles two resource bottlenecks. First, it mitigates duplicate
page-cache footprints in virtualized environments and applies memory reclamation to enable
safe overcommitment. Second, it alleviates sp...
5.3.1. Evaluation Setup
知识与推理。知识与推理数据集包括 MMLU-Pro (Wang et al., 2024b)、GPQA (Rein et al., 2023)、Human Last Exam (Phan et al., 2025)、Simple-QA Verified (Haas et al., 2025)、Chinese-SimpleQA (He et al., 2024)、LiveCodeBench-v6 (Jain et al., 2024)、CodeForces(内部基准)、HMMT 2026 Feb、Apex (Balunović et al., 2025)、Apex Shortlist (Balunović et al., 2025)、IMOAnswerBench (Luong et al., 2025) 以及 PutnamBench (Tsoukalas et al., 2024)。在代码任务方面,我们在 LiveCodeBench-v6 和内部 Codeforces 基准上对 DeepSeek-V4 系列模型进行评估。针对 Codeforces,我们收集了 14 场 Codeforces Division 1 比赛,共包含 114 道题目(2025年5月至2025年11月)。Elo 评分的计算方法如下:对于每场比赛,我们为每道题目生成 32 个候选解答。针对每道题目独立地,我们从中不放回地采样 10 个解答,并按随机顺序排列以形成提交序列。每次提交均由领域专家构建的测试用例集进行评判。解出题目的得分遵循 OpenAI (2025) 的惩罚机制:模型获得的分数等于以相同失败尝试次数解出该题的人类参赛者的分数中位数。由此得到每个采样提交序列的总比赛得分,随后将其转换为比赛排名,并通过标准的 Codeforces 评分系统进一步转换为预估评级。比赛级别的预期评级定义为:在每道题目的 10 次提交的所有可能随机采样与排列下,该预估评级的期望值。模型的整体评级为这 14 场比赛级别预期评级的平均值。对于推理与知识任务,我们将温度(temperature)参数设置为 1.0,并将 Non-think、High 和 Max 模式的上下文窗口分别设置为 8K、128K 和 384K tokens。对于数学任务(例如 HMMT、IMOAnswerBench、Apex 和 HLE),我们使用以下模板进行评估:“{question}\n请逐步推理,并将最终答案放入 \boxed{} 中。”针对 DeepSeek-V4-Pro-Max 的数学任务,我们使用以下模板以激发更深入的推理:“解决以下问题。该问题可能要求你证明一个命题,或要求给出一个答案。如果需要找出答案,你应得出该答案,且你的最终解答也必须包含对该答案有效性的严格证明。\n\n{question}”。
[原文]Knowledge and Reasoning. Knowledge and reasoning datasets include MMLU-Pro (Wang
et al., 2024b), GPQA (Rein et al., 2023), Human Last Exam (Phan et al., 2025), Simple-QA Veri-
fied (Haas et al., 2025), Chinese-SimpleQA (He et al., 2024), LiveCodeBench-v6 (Jain et al., 2024),
CodeForces (Internal Benchmark), HMMT 2026 Feb, Apex (Balunovi´c et al., 2025), Apex Short-
list (Balunovi´c et al., 2025), IMOAnswerBench (Luong et al., 2025), and PutnamBench (Tsoukalas
et al., 2024). For code, we evaluate DeepSeek-V4 series on LiveCodeBench-v6 and an internal Codeforces
benchmark. For Codeforces, we col...
5.3.1. Evaluation Setup
针对形式化数学任务,我们在 Lean v4.28.0-rc1 (Moura and Ullrich, 2021) 环境中采用智能体(agentic)模式进行评估。模型可访问 Lean 编译器与语义战术(tactic)搜索引擎,在启用最大推理努力的条件下,最多允许执行 500 次工具调用。此外,我们还评估了一种计算密集型流水线:首先生成候选的自然语言解答,并通过自我验证(Shao et al., 2025)进行筛选;随后,将保留的解答作为指导提供给形式化智能体,用于证明对应的 Lean 命题。该设计利用非形式化推理以提升探索能力,同时通过形式化验证确保严格的正确性。仅当严格验证器 Comparator 在两种设置下均接受该提交时,才将其计为正确。由于 K2.6 和 GLM-5.1 的 API 负载过高,未能返回查询响应,因此我们将其部分条目留空。
[原文]For formal math tasks, we evaluate in an agentic setting on Lean v4.28.0-rc1 (Moura and
Ullrich, 2021), with access to the Lean compiler and a semantic tactic search engine, running
up to 500 tool calls with max reasoning effort.In addition, we evaluate a more compute-
intensive pipeline in which candidate natural-language solutions are first generated and filtered
by self-verification(Shao et al., 2025), and the retained solutions are then provided as guidance
to a formal agent for proving the corresponding Lean statement. This design uses informal
reasoning to improve exploration while prese...
5.3.1. Evaluation Setup
针对搜索智能体任务(BrowseComp、带工具的HLE),我们同样采用内部开发的评估框架,该框架集成了网页搜索与Python工具,并将最大交互步数设置为500,最大上下文长度设置为512K个token。对于BrowseComp任务,我们采用了与DeepSeek-V3.2(DeepSeek-AI, 2025)相同的“全部丢弃”上下文管理策略。
[原文]For search agent tasks (BrowseComp, HLE w/ tool), we also use an in-house harness with
websearch and Python tool, and set maximum interaction steps to 500 and the maximum context
length to 512K tokens.For BrowseComp, we use the same discard-all context management
strategy as DeepSeek-V3.2 (DeepSeek-AI, 2025).
37
5.3.2. Evaluation Results
**知识。** 在通用世界知识评估中,DeepSeek-V4-Pro-Max(即DeepSeek-V4-Pro的最大推理强度模式)在开源大语言模型中确立了新的最先进水平。SimpleQA-Verified基准测试结果表明,DeepSeek-V4-Pro-Max以20个绝对百分点的显著优势超越了所有现有的开源基线模型。尽管取得上述进展,该模型目前仍落后于领先的闭源模型Gemini-3.1-Pro。在教育知识与推理领域,DeepSeek-V4-Pro-Max在MMLU-Pro、GPQA和HLE基准测试中均略微优于Kimi和GLM,尽管其性能仍不及领先的闭源模型。总体而言,DeepSeek-V4-Pro-Max在提升开源模型世界知识能力方面标志着重要的里程碑。
[原文]Table 6 | Comparison between DeepSeek-V4-Pro-Max and closed/open source models. "Max",
"xHigh", and "High" denote reasoning effort. The best results are highlighted in bold; the
second-best results are underlined. Benchmark (Metric)
Opus-4.6 GPT-5.4 Gemini-3.1-Pro
K2.6
GLM-5.1 DS-V4-Pro
Max
xHigh
High
Thinking Thinking
Max
Knowledge & Reasoning
MMLU-Pro (EM)
89.1
87.5
91.0
87.1
86.0
87.5
SimpleQA-Verified (Pass@1)
46.2
45.3
75.6
36.9
38.1
57.9
Chinese-SimpleQA (Pass@1)
76.4
76.8
85.9
75.9
75.0
84.4
GPQA Diamond (Pass@1)
91.3
93.0
94.3
90.5
86.2
90.1
HLE (Pass@1)
40.0
39.8
44.4
36.4
34.7
37.7
L...
5.3.2. Evaluation Results
此外,DeepSeek-V4-Flash 与 DeepSeek-V4-Pro 在基于知识的任务上存在显著的性能差距;这一现象符合预期,因为更大的参数量有助于在预训练阶段保留更多知识。值得注意的是,当分配更高的推理投入时,两个模型在知识基准测试上的表现均有所提升。 推理能力。DeepSeek-V4-Pro-Max 在所有推理基准测试上均优于以往的所有开源模型,并在多项指标上比肩最先进的闭源模型;而规模较小的 DeepSeek-V4-Flash-Max 也在代码和数学推理任务上超越了此前最佳的开源模型 K2.6-Thinking。同时,DeepSeek-V4-Pro 和 DeepSeek-V4-Flash 在编程竞赛中表现卓越。根据我们的评估,其性能可与 GPT-5.4 相媲美,这是开源模型首次在该任务上达到闭源模型的水平。在 Codeforces 排行榜上,DeepSeek-V4-Pro-Max 目前在人类选手中排名第 23 位。此外,DeepSeek-V4 在智能体模式与计算密集型设置下,于形式化数学任务中均展现出强劲的性能。
[原文]In addition, a significant performance gap exists between DeepSeek-V4-Flash and DeepSeek-
V4-Pro on knowledge-based tasks; this is anticipated, as larger parameter counts facilitate
greater knowledge retention during pre-training.Notably, both models demonstrate improved
results on knowledge benchmarks when allocated higher reasoning effort.
38
Table 7 | Comparison among different sizes and modes of DeepSeek-V4 series. "Non-Think",
"High", and "Max" denote reasoning effort. Benchmark (Metric)
DeepSeek-V4-Flash
DeepSeek-V4-Pro
Non-Think
High
Max
Non-Think
High
Max
Knowledge & Reasoning
MMLU-Pr...
5.3.2. Evaluation Results
在智能体设置下,该模型取得了如图8所示的最先进结果,优于Seed Prover(Chen等,2025)等先前模型。采用计算量更大的流水线后,性能进一步提升,超越了包括Aristotle(Achim等,2025)在内的系统,并达到了该设置下已知的最佳结果。 智能体。DeepSeek-V4系列在各项评估中展现出强大的智能体性能。在代码智能体任务中,DeepSeek-V4-Pro取得了与K2.6和GLM-5.1相当的结果,尽管这些开源模型仍落后于其闭源同类模型。在编码任务上,DeepSeek-V4-Flash的表现不及DeepSeek-V4-Pro,尤其在Terminal Bench 2.0基准上。在其他智能体评估中也观察到了类似趋势。值得注意的是,DeepSeek-V4-Pro在MCPAtlas和Toolathlon上表现优异——这两个评估测试集涵盖了广泛的工具和MCP服务——这表明我们的模型具备出色的泛化能力,并非仅在内部框架上表现良好。
[原文]Under an agentic setup, it achieves state-of-the-art
results, shown in Figure 8, outperforming prior models such as Seed Prover (Chen et al., 2025).With a more compute-intensive pipeline, performance further improves, surpassing systems
including Aristotle(Achim et al., 2025) and matching the best known results under this setting. Agent. The DeepSeek-V4 series demonstrates strong agent performance in evaluations. For
code agent tasks, DeepSeek-V4-Pro achieves results comparable to K2.6 and GLM-5.1, though
all these open models still lag behind their closed-source counterparts. DeepSeek-V4-Flas...
5.3.2. Evaluation Results
在衡量上下文检索能力的MRCR任务上,DeepSeek-V4-Pro的表现优于Gemini-3.1-Pro,但仍略逊于Claude Opus 4.6。如图9所示,在128K上下文窗口内,检索性能保持高度稳定。尽管在超过128K后性能出现可见下降,但与各类闭源及开源模型相比,该模型在1M token长度下的检索能力依然表现强劲。与MRCR不同,CorpusQA更贴近真实应用场景。评估结果同样表明,DeepSeek-V4-Pro的性能优于Gemini-3.1-Pro。 推理投入(Reasoning Effort)。如表7所示,Max模式在强化学习(RL)中采用了更长的上下文并降低了长度惩罚,在最具挑战性的任务上表现优于High模式。图10展示了DeepSeek-V4-Pro、DeepSeek-V4-Flash与DeepSeek-V3.2在代表性推理及智能体(agentic)任务上的性能与成本对比。通过扩展测试时计算(test-time compute),DeepSeek-V4系列模型相较于前代实现了显著提升。此外,在HLE等推理任务上,DeepSeek-V4-Pro展现出更高的token效率。
[原文]DeepSeek-V4-Pro outperforms Gemini-3.1-Pro on the MRCR task, which
measures in-context retrieval, but remains behind Claude Opus 4.6.As illustrated in Figure 9,
retrieval performance remains highly stable within a 128K context window. While a performance
degradation becomes visible beyond the 128K mark, the model’s retrieval capabilities at 1M
tokens remain remarkably strong compared to both proprietary and open-source counterparts. Unlike MRCR, CorpusQA is similar to real scenarios. The evaluation results also indicate that
DeepSeek-V4-Pro is better than Gemini-3.1-Pro. Reasoning Effort. As s...
5.4. Performance on Real-World Tasks
标准化评测基准往往难以捕捉多样化真实世界任务的复杂性,导致测试结果与实际用户体验之间存在差距。为弥合这一差距,我们开发了专有的内部评估指标,相较于传统基准测试,这些指标更侧重于真实世界的使用模式。这一策略确保了我们的优化能够转化为切实的实际效益。我们的评估框架专门针对 DeepSeek API 与聊天机器人的主要应用场景,使模型性能与实际需求紧密对齐。
[原文]Standardized benchmarks often struggle to capture the complexities of diverse, real-world
tasks, creating a gap between test results and actual user experience. To bridge this, we have
developed proprietary internal metrics that prioritize real-world usage patterns over traditional
benchmarks. This approach ensures that our optimizations translate into tangible benefits.
Our evaluation framework specifically targets the primary use cases of the DeepSeek API and
Chatbot, aligning model performance with practical demands.
5.4.1. Chinese Writing
DeepSeek的主要应用场景之一是中文写作。我们对功能性写作和创意写作进行了严格评估。表12展示了DeepSeek-V4-Pro与Gemini-3.1-Pro在功能性写作任务上的两两对比结果。这些任务涵盖常见的日常写作查询,其提示词通常简洁明了。我们选择Gemini-3.1-Pro作为基线模型,因为在我们评估的中文写作任务中,它是表现最佳的外部模型。结果表明,DeepSeek-V4-Pro的表现优于基线模型,总体胜率为62.7%对34.1%;这主要是因为Gemini在中文写作场景中偶尔会使其固有的风格偏好凌驾于用户的明确要求之上。
[原文]One of the primary use cases for DeepSeek is Chinese writing. We conducted a rigorous
evaluation on functional writing and creative writing. Table 12 presents a pairwise comparison
between DeepSeek-V4-Pro and Gemini-3.1-Pro on functional writing tasks. These tasks consist
of common daily writing queries, where prompts are typically concise and straightforward.
Gemini-3.1-Pro was selected as the baseline, as it stands as the top-performing external model
for Chinese writing in our evaluations. The results indicate that DeepSeek-V4-Pro outperforms
the baseline with an overall win rate of 62.7% v...
5.4.2. Search
搜索增强型问答是DeepSeek聊天机器人的核心能力。在DeepSeek网页端与App中,“非思考”模式采用检索增强搜索(RAG),而“思考”模式则采用智能体搜索。 检索增强搜索。我们在客观与主观问答类别上对DeepSeek-V4-Pro与DeepSeek-V3.2进行了成对评估。如表11所示,DeepSeek-V4-Pro大幅优于DeepSeek-V3.2,在两类任务中均展现出一致的优势。性能提升最为显著的是单值搜索与规划及策略任务,这表明DeepSeek-V4-Pro在精准定位事实性答案以及基于检索上下文生成结构化方案方面表现卓越。然而,DeepSeek-V3.2在对比与推荐任务上仍具备较强的竞争力,这提示DeepSeek-V4-Pro在需要对搜索结果进行均衡、多视角推理的场景中仍存在优化空间。 智能体搜索。与标准RAG不同,智能体搜索赋予模型针对单个查询迭代调用搜索与获取工具的能力,从而显著提升整体搜索性能。针对DeepSeek-Chat的“思考”模式,我们对智能体搜索功能进行了优化,旨在预定义的“思考预算”内实现响应准确性的最大化。如表9所示,智能体搜索的性能始终优于RAG,在复杂任务上尤为明显。此外,其成本效益依然极高,智能体搜索的开销仅略高于标准RAG(见表10)。
[原文]Search-augmented question answering is a core capability of the DeepSeek chatbot. On the
DeepSeek web and app, the "non-think" mode employs Retrieval-Augmented Search (RAG),
whereas the "thinking" mode utilizes agentic search.
Retrieval Augmented Search.
We conducted a pairwise evaluation comparing DeepSeek-V4-
Pro and DeepSeek-V3.2 across both objective and subjective Q&A categories. As presented in
Table 11, DeepSeek-V4-Pro outperforms DeepSeek-V3.2 by a substantial margin, demonstrating
a consistent advantage across both categories. The most pronounced gains are observed in
single-value sea...
5.4.3. White-Collar Task
为严格评估模型在复杂企业生产力场景中的实用性,我们构建了一套包含30项高级中文专业任务的综合测试集。这些工作流刻意涵盖了高级认知需求,包括深度信息分析、综合性文档生成以及精细化文档编辑,广泛覆盖金融、教育、法律和技术等13个关键行业。评估在内部开发的智能体测试框架中进行,该框架配备了Bash和网页搜索等基础工具。鉴于这些任务的开放性,自动化指标通常难以准确捕捉高质量回复的细微差别。因此,我们进行了人工评估,以对比DeepSeek-V4-Pro-Max与Opus-4.6-Max的性能表现。标注人员从以下四个维度对模型输出进行了盲评: • 任务完成度:核心问题是否得到成功解决。 • 指令遵循度:对特定约束和指示的遵守程度。 • 内容质量:事实准确性、逻辑连贯性及专业语气。 • 格式美观度:版面可读性与视觉呈现效果。 如图11所示,DeepSeek-V4-Pro-Max在多样化的中文白领任务中表现优于Opus-4.6-Max,取得了63%的优异不输率,并在分析、生成和编辑任务中展现出一致的优势。图12所示的详细维度得分凸显了该模型在任务完成度和内容质量方面的主要优势。具体而言,DeepSeek-V4-Pro-Max通过频繁提供补充见解和自验证步骤,主动预判用户的隐性意图。该模型在长文本生成方面同样表现出色,能够输出深入且连贯的叙述性内容,而非依赖Opus-4.6-Max常生成的过于简化的要点列表。此外,该模型严格遵循正式的专业规范,例如标准化的中文层级编号。然而,在指令遵循度方面,该模型偶尔会忽略特定的格式约束,表现略逊于Opus。此外,该模型在将大量文本输入浓缩为简明摘要方面的能力相对较弱。最后,在演示文稿的整体视觉设计方面,其格式美观度仍有较大的提升空间。
[原文]To rigorously evaluate the model’s utility in sophisticated enterprise productivity scenarios, we
constructed a comprehensive suite of 30 advanced Chinese professional tasks. These workflows
deliberately encompass high-level cognitive demands, including in-depth information analysis,
comprehensive document generation, and nuanced document editing, spanning a diverse
spectrum of 13 critical industries (e.g., finance, education, law, and technology). The evaluation
was conducted within an in-house agent harness equipped with basic tools, including Bash and
web search. Given the open-ended nature...
5.4.3. White-Collar Task
0% 20% 40% 60% 80% 100% 比例 分析 生成 编辑 总体
[原文]Figure 13, 14, and 15 present several test cases; due to the extensive length of certain
outputs, only partial pages are displayed.
0%
20%
40%
60%
80%
100%
Proportion
analysis
generation
editing
overall
55.0%
8.0%
37.0%
52.0%
10.0%
38.0%
47.0%
18.0%
35.0%
53.0%
10.0%
37.0%
Win Rate: DeepSeek-V4-Pro-Max vs Opus-4.6-Max
Win
Tie
Lose
Figure 11 | Win-rate comparison across analy-
sis, generation, editing tasks, and the overall
performance.Task Completion
Instruction Following
Content Quality
Formatting Aesthetics
Overall
70
75
80
85
90
95
100
Score
98.32
87.76
83.32
76.68
86.52
96.68
88.88
78.00
7...
55.0%
8.0%
37.0%
52.0%
10.0%
38.0%
47.0%
18.0%
35.0%
53.0%
10.0%
37.0%
胜率:DeepSeek-V4-Pro-Max 对比 Opus-4.6-Max 胜 平 负
70 75 80 85 90 95 100 分数
98.32
87.76
83.32
76.68
86.52
96.68
88.88
78.00
7...
5.4.4. Code Agent
为评估我们的代码智能体能力,我们从真实的内部研发工作负载中精选了任务。我们收集了来自50多位内部工程师的约200项具有挑战性的任务,涵盖功能开发、缺陷修复、代码重构和诊断等多个方面,涉及PyTorch、CUDA、Rust和C++等多种技术栈。每项任务均附带其原始代码仓库、对应的运行环境以及人工标注的评分标准;经过严格的质量筛选后,最终保留30项任务作为评估集。如表8所示,DeepSeek-V4-Pro 的性能显著优于 Claude Sonnet 4.5,并接近 Claude Opus 4.5 的水平。 在一项针对 DeepSeek 开发者与研究人员(N=85)的调查中——所有受访者均在日常工作中使用 DeepSeek-V4-Pro 进行智能体编程——当被问及与其他前沿模型相比,DeepSeek-V4-Pro 是否已准备好作为其默认且主要的编程模型时,52% 的受访者表示肯定,39% 倾向于肯定,不足 9% 表示否定。受访者认为 DeepSeek-V4-Pro 在大多数任务中均能交付令人满意的结果,但也指出其存在细微错误、对模糊提示的误读以及偶尔的过度思考等问题。
[原文]To benchmark our coding agent capability, we curate tasks from real internal R&D workloads
We collect ∼200 challenging tasks from 50+ internal engineers, spanning feature development,
bug fixing, refactoring, and diagnostics across diverse technology stacks including PyTorch,
CUDA, Rust, and C++. Each task is accompanied by its original repository, the corresponding
execution environment, and human-annotated scoring rubrics; after rigorous quality filtering,
30 tasks are retained as the evaluation set. As shown in Table 8, DeepSeek-V4-Pro significantly
outperforms Claude Sonnet 4.5 and approac...
6. Conclusion, Limitations, and Future Directions
在本工作中,我们发布了 DeepSeek-V4 系列的预览版,旨在打造能够突破超长上下文处理效率瓶颈的下一代大语言模型。通过结合融合了 CSA 与 HCA 的混合注意力架构,DeepSeek-V4 系列在长序列处理效率上实现了显著飞跃。架构创新与大规模基础设施优化的结合,使其能够高效原生支持百万级 Token 上下文,并为未来的测试时扩展、长周期任务以及在线学习等新兴范式奠定了必要基础。评估结果表明,DeepSeek-V4-Pro 的最大推理努力模式 DeepSeek-V4-Pro-Max 重新定义了开源模型的最先进水平。它在知识基准测试上大幅优于以往的开源模型,推理性能卓越且接近前沿闭源模型,并展现出极具竞争力的智能体能力。同时,DeepSeek-V4-Flash-Max 在保持高成本效益架构的同时,达到了与领先闭源模型相当的推理性能。我们相信,DeepSeek-V4 系列为开源模型开启了百万级上下文的新纪元,并为迈向更高的效率、规模与智能铺平了道路。 为追求极致的长上下文效率,DeepSeek-V4 系列采用了大胆的架构设计。为最大限度降低风险,我们保留了许多经过初步验证的组件与技巧。尽管这些方法行之有效,但也使得架构相对复杂。在未来的迭代中,我们将开展更全面、更系统严谨的研究,将架构提炼至最核心的设计,使其在保持性能的同时更加优雅简洁。同时,尽管前瞻性路由(Anticipatory Routing)与 SwiGLU 钳位(SwiGLU Clamping)已被证明能有效缓解训练不稳定性,但其底层原理仍有待深入理解。我们将积极研究训练稳定性的基础问题,并加强内部指标监控,旨在为大规模稳定训练建立更具理论依据且可预测的方法。 此外,除了 MoE 与稀疏注意力架构外,我们还将主动探索模型稀疏性的新维度——例如更稀疏的嵌入模块(Cheng et al., 2026)——以在不牺牲模型能力的前提下,进一步提升计算与内存效率。
[原文]In this work, we present a preview version of DeepSeek-V4 series, aiming at next-generation
large language models that break the efficiency barrier of ultra-long-context processing. By com-
bining a hybrid attention architecture that integrates CSA and HCA, DeepSeek-V4 series achieve
a dramatic leap in long-sequence efficiency. The architectural innovations, together with exten-
sive infrastructure optimization, enable efficient native support for million-token contexts and
establish a necessary foundation for future test-time scaling, long-horizon tasks, and emerging
paradigms such as online ...
6. Conclusion, Limitations, and Future Directions
我们还将持续研究低延迟架构与系统技术,以提升长上下文部署与交互的响应速度。此外,我们充分认识到长周期、多轮智能体任务的重要性与实际应用价值,并将在此方向上持续进行迭代与探索。同时,我们正致力于将多模态能力整合至模型之中。最后,我们致力于开发更优的数据筛选与合成策略,以在日益广泛的场景与任务中持续提升模型的智能水平、鲁棒性及实际可用性。
[原文]We will also continuously investigate low-latency architectures and system tech-
niques to make long-context deployment and interaction more responsive.Furthermore, we
recognize the importance and practical value of long-horizon, multi-round agentic tasks, and
will continue to iterate and explore in this direction. We are also working on incorporating
multimodal capabilities to our models. Finally, we are committed to developing better data
curation and synthesis strategies to consistently enhance model intelligence, robustness, and
practical usability across an increasingly broad range of sce...
References
参考文献(续前页):AA. GDPval-AA 排行榜,2025年。网址:https://artificialanalysis.ai/methodology/intelligence-benchmarking#gdpval-aa。T. Achim 等。Aristotle:IMO 级别的自动定理证明。arXiv 预印本 arXiv:2510.01346,2025年。A. Agache 等。Firecracker:面向无服务器应用的轻量级虚拟化。载于第17届 USENIX 网络系统设计与实现会议论文集。T. Achim, A. Best, A. Bietti, K. Der, M. Fédérico, S. Gukov, D. Halpern-Leistner, K. Henningsgard, Y. Kudryashov, A. Meiburg, et al. Aristotle: Imo-level automated theorem proving. arXiv preprint arXiv:2510.01346, 2025. A. Agache, M. Brooker, A. Florescu, A. Iordache, A. Liguori, R. Neugebauer, P. Piwonka, and D.-M. Popa. Firecracker: lightweight virtualization for serverless applications. In Proceedings of the 17th Usenix Conference on Networked Systems Design and Implementatio
[原文]AA. Gdpval-aa leaderboard, 2025. URL https://artificialanalysis.ai/methodolog
y/intelligence-benchmarking#gdpval-aa. T. Achim, A. Best, A. Bietti, K. Der, M. Fédérico, S. Gukov, D. Halpern-Leistner, K. Henningsgard,
Y. Kudryashov, A. Meiburg, et al. Aristotle: Imo-level automated theorem proving. arXiv
preprint arXiv:2510.01346, 2025. A. Agache, M. Brooker, A. Florescu, A. Iordache, A. Liguori, R. Neugebauer, P. Piwonka, and
D.-M. Popa. Firecracker: lightweight virtualization for serverless applications. In Proceedings
of the 17th Usenix Conference on Networked Systems Design and Implementatio...
References
Vechev. MathArena:在未受数据污染的数学竞赛上评估大语言模型. 神经信息处理系统大会数据集与基准轨道论文集, 2025. C. Bandi, B. Hertzberg, G. Boo, T. Polakam, J. Da, S. Hassaan, M. Sharma, A. Park, E. Hernandez, D. Rambado, et al. MCP-Atlas:基于真实MCP服务器的大规模工具使用能力基准测试. arXiv预印本 arXiv:2602.00933, 2026. F. Bellard. QEMU:一种快速且可移植的动态翻译器. 载于USENIX年度技术会议论文集,ATEC ’05,第41页,美国,2005年。USENIX协会。I. Bello,...
[原文]Vechev.Matharena: Evaluating llms
on uncontaminated math competitions. Proceedings of the Neural Information Processing
Systems Track on Datasets and Benchmark, 2025. C. Bandi, B. Hertzberg, G. Boo, T. Polakam, J. Da, S. Hassaan, M. Sharma, A. Park, E. Hernandez,
D. Rambado, et al. Mcp-atlas: A large-scale benchmark for tool-use competency with real
mcp servers. arXiv preprint arXiv:2602.00933, 2026. F. Bellard. Qemu, a fast and portable dynamic translator. In Proceedings of the Annual
Conference on USENIX Annual Technical Conference, ATEC ’05, page 41, USA, 2005. USENIX
Association. I. Bello,...
References
参考文献(续):事实性评估综合基准:面向大语言模型事实性的全面评测。arXiv 预印本 arXiv:2512.10791,2025年。X. Cheng 等。通过可扩展查找实现条件记忆:大语言模型稀疏性的新维度。CoRR, abs/2601.07372,2026年。doi: 10.48550/ARXIV.2601.
[原文]The facts leaderboard: A comprehensive benchmark for large language
model factuality. arXiv preprint arXiv:2512.10791, 2025.X. Cheng, W. Zeng, D. Dai, Q. Chen, B. Wang, Z. Xie, K. Huang, X. Yu, Z. Hao, Y. Li, H. Zhang,
H. Zhang, D. Zhao, and W. Liang. Conditional memory via scalable lookup: A new axis of
sparsity for large language models. CoRR, abs/2601.07372, 2026. doi: 10.48550/ARXIV.2601.
07372. URL https://doi.org/10.48550/arXiv.2601.07372.
参考文献(续):K. Cobbe 等。训练验证器求解数学文字题。arXiv 预印本 arXiv:2110.14168,2021年。D. Dai 等。DeepSeekMoE:面向混合专家语言模型终极专家特化的研究。CoRR, abs/2401.06066,2024年。网址:https://doi.org/10.48550/arXiv.2401.06066。 46
[原文]K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek,
J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint
arXiv:2110.14168, 2021. D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, Z. Xie, Y. K. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang. Deepseekmoe: Towards ultimate expert
specialization in mixture-of-experts language models. CoRR, abs/2401.06066, 2024. URL
https://doi.org/10.48550/arXiv.2401.06066.
46
T. Dao, D. Haziza, F. Massa, and G. Sizov. Flash-decoding for long-context i...
T. Dao 等。面向长上下文的 Flash-decoding 加速方法。
07372. URL https://doi.org/10.48550/arXiv.2601.07372.
参考文献(续):Hymba:面向小型语言模型的混合头架构。载于第十三届国际学习表征会议,2025年。网址:https://openreview.net/forum?id=A1ztozypga。X. Du 等。SuperGPQA:将 LLM 评估扩展至285个研究生学科。arXiv 预印本 arXiv:2502.14739,2025年。D. Dua 等。DROP:需要在段落上进行离散推理的阅读理解基准。载于 NAACL 2019 会议论文集。
[原文]Hymba: A hybrid-head archi-
tecture for small language models.In The Thirteenth International Conference on Learning
Representations, 2025. URL https://openreview.net/forum?id=A1ztozypga. X. Du, Y. Yao, K. Ma, B. Wang, T. Zheng, K. Zhu, M. Liu, Y. Liang, X. Jin, Z. Wei, et al. Supergpqa:
Scaling llm evaluation across 285 graduate disciplines. arXiv preprint arXiv:2502.14739, 2025. D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. DROP: A reading compre-
hension benchmark requiring discrete reasoning over paragraphs. In J. Burstein, C. Doran, and
T. Solorio, editors, Proceedin...
2378. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N19-1246. URL
参考文献(续):网址:https://doi.org/10.18653/v1/n19-1246。X. Gao 等。EROFS:面向资源受限设备的压缩友好只读文件系统。载于2019 USENIX 年度技术会议。A. P. Gema 等。MMLU 评测是否已完成?CoRR, abs/2406.04127,2024年。
[原文]https://doi.org/10.18653/v1/n19-1246. X. Gao, M. Dong, X. Miao, W. Du, C. Yu, and H. Chen. Erofs: a compression-friendly readonly
file system for resource-scarce devices. In Proceedings of the 2019 USENIX Conference on
Usenix Annual Technical Conference, USENIX ATC ’19, page 149–162, USA, 2019. USENIX
Association. ISBN 9781939133038.
47
A. P. Gema, J. O. J. Leang, G. Hong, A. Devoto, A. C. M. Mancino, R. Saxena, X. He, Y. Zhao,
X. Du, M. R. G. Madani, C. Barale, R. McHardy, J. Harris, J. Kaddour, E. van Krieken, and
P. Minervini. Are we done with mmlu? CoRR, abs/2406.04127, 2024. URL https://...
2378. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N19-1246. URL
参考文献(续):LiveCodeBench:面向代码大语言模型的全面且无污染评估。arXiv 预印本 arXiv:2403.07974,2024年。K. Jordan 等。Muon:神经网络隐藏层的优化器。2024年。M. Joshi 等。TriviaQA:大规模远程监督阅读理解挑战数据集。载于 ACL 2017 会议论文集。
[原文]Solar-Lezama, K.Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.
arXiv preprint arXiv:2403.07974, 2024. K. Jordan, Y. Jin, V. Boza, J. You, F. Cesista, L. Newhouse, and J. Bernstein. Muon: An optimizer
for hidden layers in neural networks. Cited on, page 10, 2024. M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. TriviaQA: A large scale distantly supervised chal-
lenge dataset for reading comprehension. In R. Barzilay and M.-Y. Kan, editors, Proceedings of
the 55th Annual Meeting of the Association for Computational Linguistics (Vol...
2378. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N19-1246. URL
参考文献(续):CorpusQA:面向语料级分析与推理的千万 token 基准。arXiv 预印本 arXiv:2601.14952,2026年。T. Luong 等。Towards robust mathematical reasoning。载于 EMNLP 2025 会议论文集。网址:https://aclanthology.org/2025.emnlp-main.1794/。
[原文]Corpusqa: A 10 million token benchmark
for corpus-level analysis and reasoning. arXiv preprint arXiv:2601.14952, 2026.T. Luong, D. Hwang, H. H. Nguyen, G. Ghiasi, Y. Chervonyi, I. Seo, J. Kim, G. Bingham,
J. Lee, S. Mishra, A. Zhai, C. H. Hu, H. Michalewski, J. Kim, J. Ahn, J. Bae, X. Song, T. H. Trinh, Q. V. Le, and J. Jung. Towards robust mathematical reasoning. In Proceedings of
the 2025 Conference on Empirical Methods in Natural Language Processing, 2025. URL
https://aclanthology.org/2025.emnlp-main.1794/. M. A. Merrill, A. G. Shaw, N. Carlini, B. Li, H. Raj, I. Bercovich, L. Shi, J. Y. Sh...
2378. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N19-1246. URL
参考文献(续):GDPval:评估 AI 模型在真实世界经济价值任务上的性能。arXiv 预印本 arXiv:2510.04374,2025年。L. Phan 等。Humanity's last exam。arXiv 预印本 arXiv:2501.14249,2025年。W. Qi 等。ProphetNet:为序列到序列预训练预测未来 n-gram。载于 EMNLP 2020 Findings。
[原文]Gdpval: Evaluating ai model performance on real-world
economically valuable tasks. arXiv preprint arXiv:2510.04374, 2025.L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi, et al. Humanity’s last exam. arXiv preprint arXiv:2501.14249, 2025. W. Qi, Y. Yan, Y. Gong, D. Liu, N. Duan, J. Chen, R. Zhang, and M. Zhou. Prophetnet: Predicting
future n-gram for sequence-to-sequence pre-training. In T. Cohn, Y. He, and Y. Liu, edi-
tors, Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event,
16-20 November 2020, volume EMNLP 2020 ...
2378. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N19-1246. URL
参考文献(续,作者名单节选):以下为 Gemini 等大型模型论文的合作者名单节选:Lipschultz, J. Newlan, J. Ji, K. Mohamed, K. Badola, K. Black, K. Millican, K. McDonell, K. Nguyen, K. Sodhia, K. Greene, L. L. Sjoesund, L. Usui, L. Sifre, L. Heuermann, L. cia Lago, L. McNealus, L. B. Soares, L. Kilpatrick, L. Dixon, L. L. B. Martins, M. Reid, M. Singh, M. Iverson, M. Gorner, M. Velloso, M. Wirth, M. Davidow, M. Miller, M. Rahtz, M. Watson, M. Risdal, M. Kazemi, M. Moynihan, M. Zhang, M. Kahng, M. Park, M. Rahman, M. Khatwani, N. Dao, N. shad Bardoliwalla, N. Devanathan, N. Dumai, N. Chauhan, O. Wahltinez, P. Botarda, P. Barnes, P. Barham, P. Michel, P. chong Jin, P. Georgiev
[原文]Lipschultz, J.Newlan, J. Ji, K. Mohamed, K. Badola,
K. Black, K. Millican, K. McDonell, K. Nguyen, K. Sodhia, K. Greene, L. L. Sjoesund, L. Usui,
L. Sifre, L. Heuermann, L. cia Lago, L. McNealus, L. B. Soares, L. Kilpatrick, L. Dixon, L. L. B. Martins, M. Reid, M. Singh, M. Iverson, M. Gorner, M. Velloso, M. Wirth, M. Davidow,
M. Miller, M. Rahtz, M. Watson, M. Risdal, M. Kazemi, M. Moynihan, M. Zhang, M. Kahng,
M. Park, M. Rahman, M. Khatwani, N. Dao, N. shad Bardoliwalla, N. Devanathan, N. Dumai,
N. Chauhan, O. Wahltinez, P. Botarda, P. Barnes, P. Barham, P. Michel, P. chong Jin, P. Georgiev...
2021. URL https://proceedings.neurips.cc/paper/2021/hash/92bf5e6240737
参考文献(续):B. D. Rouhani 等。Microscaling:面向深度学习的微缩放数据格式,2023年。K. Sakaguchi 等。WinoGrande:大规模对抗性 Winograd 模式挑战。
[原文]e0326ea59846a83e076-Abstract.html.
B. D. Rouhani, R. Zhao, A. More, M. Hall, A. Khodamoradi, S. Deng, D. Choudhary, M. Cornea,
E. Dellinger, K. Denolf, S. Dusan, V. Elango, M. Golub, A. Heinecke, P. James-Roxby, D. Jani,
G. Kolhe, M. Langhammer, A. Li, L. Melnick, M. Mesmakhosroshahi, A. Rodriguez, M. Schulte,
R. Shafipour, L. Shao, M. Siu, P. Dubey, P. Micikevicius, M. Naumov, C. Verrilli, R. Wittig,
D. Burger, and E. Chung. Microscaling data formats for deep learning, 2023.
K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd
schema challenge at scale, 2...
2019. URL http://arxiv.org/abs/1911.02150.
参考文献(续):N. Shazeer。GLU 变体改进 Transformer。arXiv 预印本 arXiv:2002.05202,2020年。F. Shi 等。语言模型是多语言思维链推理器。载于 ICLR 2023 会议。网址:https://openreview.net/forum?id=fR3wGCk-IXp。J. Su 等。RoFormer:带旋转位置嵌入的增强 Transformer。Neurocomputing, 568:127063,202
[原文]N. Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder,
D. Zhou, D. Das, and J. Wei. Language models are multilingual chain-of-thought reasoners.
In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali,
Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/forum?i
d=fR3wGCk-IXp.
J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu. Roformer: Enhanced transformer with rotary
position embedding. Neurocomputing, 568:127063, 202...
2024. URL https://arxiv.org/abs/2407.11214.
参考文献(续):A. Vaswani 等。Attention is all you need。Advances in neural information processing systems, 30, 2017。L. Wang 等。混合专家模型的无辅助损失负载均衡策略。CoRR, abs/2408.15664,2024a。L. Wang 等。TileLang:在现代神经内核中桥接可编程性与性能。载于第十四届国际学习表征会议。
[原文]A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo-
sukhin. Attention is all you need. Advances in neural information processing systems, 30,
2017. L. Wang, H. Gao, C. Zhao, X. Sun, and D. Dai. Auxiliary-loss-free load balancing strategy for
mixture-of-experts. CoRR, abs/2408.15664, 2024a. URL https://doi.org/10.48550/arX
iv.2408.15664. L. Wang, Y. Cheng, Y. Shi, Z. Mo, Z. Tang, W. Xie, T. Wu, L. Ma, Y. Xia, J. Xue, et al. Tilelang:
Bridge programmability and performance in modern neural kernels. In The Fourteenth
International Conference on Learning...
2024. URL https://arxiv.org/abs/2407.11214.
参考文献(续):网址:https://doi.org/10.18653/v1/2020.coling-main.419。J. Yang 等。SWE-smith:为软件工程智能体扩展数据,2025年。网址:https://arxiv.org/abs/2504.21798。 52
[原文]URL
https://doi.org/10.18653/v1/2020.coling-main.419.J. Yang, K. Lieret, C. E. Jimenez, A. Wettig, K. Khandpur, Y. Zhang, B. Hui, O. Press, L. Schmidt,
and D. Yang. Swe-smith: Scaling data for software engineering agents, 2025. URL https:
//arxiv.org/abs/2504.21798.
52
R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. HellaSwag: Can a machine really finish
your sentence? In A. Korhonen, D. R. Traum, and L. Màrquez, editors, Proceedings of the 57th
Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July
28- August 2, 2019, Volume 1: Long Papers, pa...
R. Zellers 等。HellaSwag:机器能否真正补全你的句子?载于 ACL 2019 会议论文集。
2024. URL https://arxiv.org/abs/2407.11214.
Zhang, P.Yadav, 等. Bigcodebench:通过多样化的函数调用与复杂指令对代码生成进行基准测试. 收录于第十三届国际学习表征会议(ICLR 2025),新加坡,2025年4月24-28日. OpenReview.net,2025. URL https://openreview.net/forum?id=YrycTjllL0. 53
[原文]Zhang,
P.Yadav, and et al. Bigcodebench: Benchmarking code generation with diverse function
calls and complex instructions. In The Thirteenth International Conference on Learning
Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL http
s://openreview.net/forum?id=YrycTjllL0.
53
Appendix
附录 A. 作者名单与致谢 A.1. 作者名单 作者按名字字母顺序排列。标有 * 的姓名表示已离开团队的成员。Research & Engineering: Anyi Xu, Bangcai Lin, Bing Xue, Bingxuan Wang*, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chenze Shao, Chong Ruan*, Conner Sun, Damai Dai, Daya Guo*, Dejian Yang, Deli Chen, Donghao Li, Erhang Li, Fangyun Lin, Fangzhou Yuan, Feiyu Xia, Fucong Dai, Guangbo Hao, Guanting Chen, Guoai Cao, Guolai Meng, Guowei Li, Han Yu, Han Z
[原文]A. Author List and Acknowledgment
A.1. Author List
Authors are listed alphabetically by their first name. Names marked with * denote individuals
who have departed from our team. Research & Engineering: Anyi Xu, Bangcai Lin, Bing Xue, Bingxuan Wang*, Bingzheng Xu,
Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi
Deng, Chenhao Xu, Chenze Shao, Chong Ruan*, Conner Sun, Damai Dai, Daya Guo*, Dejian
Yang, Deli Chen, Donghao Li, Erhang Li, Fangyun Lin, Fangzhou Yuan, Feiyu Xia, Fucong
Dai, Guangbo Hao, Guanting Chen, Guoai Cao, Guolai Meng, Guowei Li, Han Yu, Han Z...
Appendix
Ma, Yanfeng Luo, Yang Zhang, Yanhong Xu, Yanru Ma, Yanwen Huang, Yao Li, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Qian, Yi Yu, Yichao Zhang, Yifan Ding, Yifan Shi, Yijia Wu, Yiliang Xiong, Ying He, Ying Zhou, Yingjia Luo, Yinmin Zhong, Yishi Piao, Yisong Wang, Yixiang Zhang, Yixiao Chen, Yixuan Tan, Yixuan Wei, Yiyang Ma, Yiyuan Liu, Yonglun Yang, Yongqiang Guo, Yongtong Wu, Yu Wu, Yuan Cheng, Yuan Ou, Yuanfan Xu, Yuanhao Li, Yuduan Wang, Yuhan Wu, Yuhao Meng, Yuheng Zou, YuKun Li, Yunfan Xiong, Yupeng Chen, Yuqian Cao, Yuqian Wang, Yushun Zhang, Yutong Lin, Yuxian Gu, Yuxiang Luo, Yuxia
[原文]Ma,
Yanfeng Luo, Yang Zhang, Yanhong Xu, Yanru Ma, Yanwen Huang, Yao Li, Yao Li, Yao Zhao,
Yaofeng Sun, Yaohui Wang, Yi Qian, Yi Yu, Yichao Zhang, Yifan Ding, Yifan Shi, Yijia Wu, Yiliang
Xiong, Ying He, Ying Zhou, Yingjia Luo, Yinmin Zhong, Yishi Piao, Yisong Wang, Yixiang Zhang,
Yixiao Chen, Yixuan Tan, Yixuan Wei, Yiyang Ma, Yiyuan Liu, Yonglun Yang, Yongqiang Guo,
Yongtong Wu, Yu Wu, Yuan Cheng, Yuan Ou, Yuanfan Xu, Yuanhao Li, Yuduan Wang, Yuhan
Wu, Yuhao Meng, Yuheng Zou, YuKun Li, Yunfan Xiong, Yupeng Chen, Yuqian Cao, Yuqian
Wang, Yushun Zhang, Yutong Lin, Yuxian Gu, Yuxiang Luo, Yuxia...
Appendix
版本 工具调用次数 预填充(词元) 输出(词元) V4 智能体搜索
[原文]Version
Tool Calls
Prefill (tokens)
Output (tokens)
V4 Agentic Search
16.2
13649
1526
V4 Retrieval Augmented Search
—
10453
1308
Table 11 | Comparative Evaluation of DeepSeek-V4-Pro and DeepSeek-V3.2 on Search Q&A
Tasks.Internal Evaluation (内部综合评估)
Category
Subcategory
#
V4 win
V3.2 win
tie
V4%
V3.2%
tie%
Objective
Q&A
(客观问答)
Single-value Search (单值信息查找)
95
36
10
49
37.9
10.5
51.6
Entity Search (实体信息查找)
99
24
7
68
24.2
7.1
68.7
Enumerative Search (枚举型信息查找)
95
19
8
68
20.0
8.4
71.6
Subtotal (小计)
289
79
25
185
27.3
8.7
64.0
Subjective
Q&A
(主观问答)
Causal Analysis (原因分析)
100
28
5
67
28.0
5.0
67.0
C...
16.2
13649 1526 V4 检索增强搜索 — 10453 1308
表11 | DeepSeek-V4-Pro与DeepSeek-V3.2在搜索问答任务上的对比评估。内部综合评估
类别
子类别
Appendix
内部综合评估(Internal Evaluation) Category Subcategory # DS win Gem win Tie DS% Gem% Tie% Business Writing (办公文本) Report (报告) 527 350 162 15 66.41 30.74 2.85 Proposal (方案策划) 291 181 103 7 62.20 35.40 2.41 Education (教育培训) 159 100 56 3 62.89 35.22 1.89 Email & Letter (邮件书信) 146 107 37 2 73.29 25.34 1.37 Notice (通知公告) 72 43 24 5 59.72 33.33 6.94 Professional (专业文本) 63 34 27 2 53.97 42.86 3.17 Recruitment (招聘求职) 42 27 15 0 64.29 35.71 0.00 Technical (技术文本) 29 22 7 0 75.86 24.14 0.00 Review (介绍评价) 20 15 5 0 75.00 25.00 0.00 Subtotal (小计) 1349 879 436 34 65.16 32.32 2.52 Media Writing (媒体文本) Social Medi
[原文]Internal Evaluation (内部综合评估)
Category
Subcategory
#
DS win
Gem win
Tie
DS%
Gem%
Tie%
Business
Writing
(办公文本)
Report (报告)
527
350
162
15
66.41
30.74
2.85
Proposal (方案策划)
291
181
103
7
62.20
35.40
2.41
Education (教育培训)
159
100
56
3
62.89
35.22
1.89
Email & Letter (邮件书信)
146
107
37
2
73.29
25.34
1.37
Notice (通知公告)
72
43
24
5
59.72
33.33
6.94
Professional (专业文本)
63
34
27
2
53.97
42.86
3.17
Recruitment (招聘求职)
42
27
15
0
64.29
35.71
0.00
Technical (技术文本)
29
22
7
0
75.86
24.14
0.00
Review (介绍评价)
20
15
5
0
75.00
25.00
0.00
Subtotal (小计)
1349
879
436
34
65.16
32.32
2.52
Media
Writing
(媒体文本)
Social Medi...
Appendix
指令遵循(Instruction Following)与写作质量(Writing Quality)评估结果。 Subcategory (文体) # DS Gem Tie DS% Gem% Tie% Fiction (小说故事) 836 504 323 5 60.58 38.82 0.60 | 672 157 3 80.77 18.87 0.36 General Fiction (泛小说故事) 662 368 290 3 55.67 43.87 0.45 | 467 194 0 70.65 29.35 0.00 Fan Fiction (同人文) 410 253 150 3 62.32 36.95 0.74 | 338 67 1 83.25 16.50 0.25 General Fan Fic. (泛同人文) 202 111 90 1 54.95 44.55 0.50 | 161 40 1 79.70 19.80 0.50 Narrative (记叙文) 171 115 54 2 67.25 31.58 1.17 | 141 30 0 82.46 17.54 0.00 General Prose (泛散文) 124 83 40 1 66.94 32.26 0.81 | 88 36 0 70.97 29.03 0.00 Prose (散文) 112 74 38 0 6
[原文]Instruction Following(指令遵循)
Writing Quality (写作质量)
Subcategory (文体)
#
DS Gem Tie
DS% Gem% Tie%
DS Gem Tie
DS% Gem% Tie%
Fiction (小说故事)
836
504
323
5
60.58
38.82
0.60
672
157
3
80.77
18.87
0.36
General Fiction (泛小说故事)
662
368
290
3
55.67
43.87
0.45
467
194
0
70.65
29.35
0.00
Fan Fiction (同人文)
410
253
150
3
62.32
36.95
0.74
338
67
1
83.25
16.50
0.25
General Fan Fic. (泛同人文)
202
111
90
1
54.95
44.55
0.50
161
40
1
79.70
19.80
0.50
Narrative (记叙文)
171
115
54
2
67.25
31.58
1.17
141
30
0
82.46
17.54
0.00
General Prose (泛散文)
124
83
40
1
66.94
32.26
0.81
88
36
0
70.97
29.03
0.00
Prose (散文)
112
74
38
0
6...