1 Introduction
1 引言
近年来,大型语言模型(LLMs)正在快速迭代和演进(OpenAI, 2024a;Anthropic, 2024;Google, 2024),逐渐缩小与通用人工智能(AGI)的差距。除闭源模型外,包括DeepSeek系列(DeepSeek-AI, 2024b, c;Guo et al., 2024;DeepSeek-AI, 2024a)、LLaMA系列(Touvron et al.)在内的开源模型也在快速发展。
原文: In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and evolution (OpenAI, 2024a ; Anthropic, 2024 ; Google, 2024 ) , progressively diminishing the gap towards Artificial General Intelligence (AGI). Beyond closed-source models, open-source models, including DeepSeek series (DeepSeek-AI, 2024b , c ; Guo et al., 2024 ; DeepSeek-AI, 2024a ) , LLaMA series (Touvron et al., 2023a , b ; AI@Meta, 2024a , b ) , Qwen series (Qwen, 2023 , 2024a , 2024b ) , and Mistral series (Jiang et al., 2023 ; Mistral, 2024 ) , are also making significant strides, endeavoring to close t...
1 Introduction
训练框架的优化。低精度训练已成为高效训练的有前景的解决方案(Kalamkar et al., 2019;Narang et al., 2017;Peng et al., 2023b;Dettmers et al., 2022),其发展与硬件能力提升密切相关(Micikevicius et al., 2022;Luo et al., 2024;Rouhani et al., 2023a)。在这项工作中,我们引入了一种FP8混合精度训练方案。
原文: mizations for the training framework. Low-precision training has emerged as a promising solution for efficient training (Kalamkar et al., 2019 ; Narang et al., 2017 ; Peng et al., 2023b ; Dettmers et al., 2022 ) , its evolution being closely tied to advancements in hardware capabilities (Micikevicius et al., 2022 ; Luo et al., 2024 ; Rouhani et al., 2023a ) . In this work, we introduce an FP8 mixed precision training framework and, for the first time, validate its effectiveness on an extremely large-scale model. Through the support for FP8 computation and storage, we achieve both accelerated t...
1 Introduction
(引言部分详细内容,翻译见上面各对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: (RL) on the base model of DeepSeek-V3, to align it with human preferences and further unlock its potential. During the post-training stage, we distill the reasoning capability from the DeepSeek-R1 series of models, and meanwhile carefully maintain the balance between model accuracy and generation length. We evaluate DeepSeek-V3 on a comprehensive array of benchmarks. Despite its economical training costs, comprehensive evaluations reveal that DeepSeek-V3-Base has emerged as the strongest open-source base model currently available, especially in code and math. Its chat version also outperforms ...
1 Introduction
(引言部分详细内容,翻译见上面各对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: alancing Strategy and Training Objective • On top of the efficient architecture of DeepSeek-V2, we pioneer an auxiliary-loss-free strategy for load balancing, which minimizes the performance degradation that arises from encouraging load balancing. • We investigate a Multi-Token Prediction (MTP) objective and prove it beneficial to model performance. It can also be used for speculative decoding for inference acceleration. Pre-Training: Towards Ultimate Training Efficiency • We design an FP8 mixed precision training framework and, for the first time, validate the feasibility and effectiveness of...
1 Introduction
DeepSeek-V3在多个基准测试上表现出色。在事实性基准测试中,DeepSeek-V3在SimpleQA和中文SimpleQA上均展现出开源模型中的优越性能。
(论文后续章节概述见第5节。最后,我们在第6节总结本文,讨论DeepSeek-V3的现有局限性,并提出未来研究的潜在方向。)
原文: ce is comparable to leading closed-source models like GPT-4o and Claude-Sonnet-3.5, narrowing the gap between open-source and closed-source models in this domain. (2) For factuality benchmarks, DeepSeek-V3 demonstrates superior performance among open-source models on both SimpleQA and Chinese SimpleQA. While it trails behind GPT-4o and Claude-Sonnet-3.5 in English factual knowledge (SimpleQA), it surpasses these models in Chinese factual knowledge (Chinese SimpleQA), highlighting its strength in Chinese factual knowledge. • Code, Math, and Reasoning : (1) DeepSeek-V3 achieves state-of-the-art ...
1 Introduction
最后,我们总结这项工作,讨论DeepSeek-V3的现有局限性,并提出未来研究的潜在方向(第6节)。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Section 5 ). Lastly, we conclude this work, discuss existing limitations of DeepSeek-V3, and propose potential directions for future research (Section 6 ).
2 Architecture
2 架构
我们首先介绍DeepSeek-V3的基本架构,其特点是采用多头潜在注意力(MLA)以实现高效推理,以及DeepSeekMoE以实现经济实惠的训练。然后,我们介绍多令牌预测(MTP)训练目标,我们观察到它有助于提高预训练模型的质量。
DeepSeek-V3的架构设计遵循以下原则:
(1)推理效率:通过MLA架构显著降低KV缓存占用。
(2)训练成本效益:通过DeepSeekMoE架构实现参数高效利用。
(3)模型性能:通过MTP训练目标提高预训练模型质量。
原文: We first introduce the basic architecture of DeepSeek-V3, featured by Multi-head Latent Attention (MLA) (DeepSeek-AI, 2024c ) for efficient inference and DeepSeekMoE (Dai et al., 2024 ) for economical training. Then, we present a Multi-Token Prediction (MTP) training objective, which we have observed to enhance the overall performance on evaluation benchmarks. For other minor details not explicitly mentioned, DeepSeek-V3 adheres to the settings of DeepSeek-V2 (DeepSeek-AI, 2024c ) . Figure 2: Illustration of the basic architecture of DeepSeek-V3. Following DeepSeek-V2, we adopt MLA and DeepSee...
2 Architecture
MLA架构的KV压缩公式:
k_t^C = W^{UK} * c_t^{KV}
其中c_t^{KV}是键和值的压缩潜在向量;d_c(远小于d_h * n_h)表示KV压缩维度;W^{DKV}和W^{UK}分别是KV解压缩和压缩矩阵。
MLA的核心思想是:将高维的KV向量压缩到低维潜在空间,然后在注意力计算时再解压缩回高维空间。这种设计显著降低了KV缓存的内存占用,从而降低了推理成本。
原文: h C ] = 𝐤 t C \displaystyle[\mathbf{k}_{t,1}^{C};\mathbf{k}_{t,2}^{C};...;\mathbf{k}_{t,n_{h}}^{C}]=\mathbf{k}_{t}^{C} = W U K 𝐜 t K V , \displaystyle=W^{UK}\mathbf{c}_{t}^{KV}, (2) 𝐤 t R \displaystyle\boxed{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{k}_{t}^{R}} = RoPE ( W K R 𝐡 t ) , \displaystyle=\operatorname{RoPE}({W^{KR}}\mathbf{h}_{t}), (3) 𝐤 t , i \displaystyle\mathbf{k}_{t,i} = [ 𝐤 t , i C ; 𝐤 t R ] , \displaystyle=[\mathbf{k}_{t,i}^{C};\mathbf{k}_{t}^{R}], (4) [ 𝐯 t , 1 C ; 𝐯 t , 2 C ; … ; 𝐯 t , n h C ] = 𝐯 t C \displaystyle[\mathbf{v}_{t,1}^{...
2 Architecture
低秩压缩公式:
c_t^Q = W^{DQ} * h_t
[q_{t,1}^C; q_{t,2}^C; ...; q_{t,n_h}^C] = W^{UQ} * c_t^Q
其中c_t^Q是查询的压缩潜在向量;d_c'(远小于d_h * n_h)表示查询压缩维度。
这种低秩压缩设计可以在保持模型性能的同时,显著降低激活内存占用。
原文: -rank compression, which can reduce the activation memory during training: 𝐜 t Q \displaystyle\mathbf{c}_{t}^{Q} = W D Q 𝐡 t , \displaystyle=W^{DQ}\mathbf{h}_{t}, (6) [ 𝐪 t , 1 C ; 𝐪 t , 2 C ; … ; 𝐪 t , n h C ] = 𝐪 t C \displaystyle[\mathbf{q}_{t,1}^{C};\mathbf{q}_{t,2}^{C};...;\mathbf{q}_{t,n_{h}}^{C}]=\mathbf{q}_{t}^{C} = W U Q 𝐜 t Q , \displaystyle=W^{UQ}\mathbf{c}_{t}^{Q}, (7) [ 𝐪 t , 1 R ; 𝐪 t , 2 R ; … ; 𝐪 t , n h R ] = 𝐪 t R \displaystyle[\mathbf{q}_{t,1}^{R};\mathbf{q}_{t,2}^{R};...;\mathbf{q}_{t,n_{h}}^{R}]=\mathbf{q}_{t}^{R} = RoPE ( W Q R 𝐜 t Q ) , \displaystyle=\opera...
2 Architecture
2.1.2 无辅助损失负载均衡的DeepSeekMoE
DeepSeekMoE基本架构:对于前馈网络(FFNs),DeepSeek-V3采用DeepSeekMoE架构。与DeepSeek-V2类似,DeepSeek-V3的DeepSeekMoE也采用细粒度专家分割和共享专家隔离策略。
不同的是,DeepSeek-V3引入了无辅助损失的负载均衡策略。通过动态调整bias更新速度,DeepSeek-V3在训练过程中保持平衡的专家负载,并实现了比通过纯辅助损失鼓励负载均衡的模型更好的性能。
原文: o}_{t,n_{h}}], (11) where W O ∈ ℝ d × d h n h W^{O}\in\mathbb{R}^{d\times d_{h}n_{h}} denotes the output projection matrix. 2.1.2 DeepSeekMoE with Auxiliary-Loss-Free Load Balancing Basic Architecture of DeepSeekMoE. For Feed-Forward Networks (FFNs), DeepSeek-V3 employs the DeepSeekMoE architecture (Dai et al., 2024 ) . Compared with traditional MoE architectures like GShard (Lepikhin et al., 2021 ) , DeepSeekMoE uses finer-grained experts and isolates some experts as shared ones. Let 𝐮 t \mathbf{u}_{t} denote the FFN input of the t t -th token, we compute the FFN output 𝐡 t ′ \mathbf{h}_{t}...
2 Architecture
无辅助损失负载均衡的核心思想是:
(1)为每个路由专家维护一个centroid向量(质心向量)。
(2)在训练过程中,根据专家负载动态调整bias更新速度。
(3)通过动态调整,DeepSeek-V3在训练过程中保持平衡的专家负载。
与DeepSeek-V2略有不同,DeepSeek-V3使用sigmoid门控函数代替softmax门控函数,这有助于提高路由的稳定性和效率。
原文: _{i} is the centroid vector of the i i -th routed expert; and Topk ( ⋅ , K ) \operatorname{Topk}(\cdot,K) denotes the set comprising K K highest scores among the affinity scores calculated for the t t -th token and all routed experts. Slightly different from DeepSeek-V2, DeepSeek-V3 uses the sigmoid function to compute the affinity scores, and applies a normalization among all selected affinity scores to produce the gating values. Auxiliary-Loss-Free Load Balancing. For MoE models, an unbalanced expert load will lead to routing collapse (Shazeer et al., 2017 ) and diminish computational effi...
2 Architecture
辅助损失速度参数称为bias更新速度。通过动态调整,DeepSeek-V3在训练过程中保持平衡的专家负载,并实现了比通过纯辅助损失鼓励负载均衡的模型更好的性能。
互补的序列级辅助损失:虽然DeepSeek-V3主要通过无辅助损失策略实现负载均衡,但我们仍然引入了序列级辅助损失作为补充。这种辅助损失仅在训练初期使用,用于帮助模型快速收敛到均衡状态。
原文: arameter called bias update speed. Through the dynamic adjustment, DeepSeek-V3 keeps balanced expert load during training, and achieves better performance than models that encourage load balance through pure auxiliary losses. Complementary Sequence-Wise Auxiliary Loss. Although DeepSeek-V3 mainly relies on the auxiliary-loss-free strategy for load balance, to prevent extreme imbalance within any single sequence, we also employ a complementary sequence-wise balance loss: ℒ Bal \displaystyle\mathcal{L}_{\mathrm{Bal}} = α ∑ i = 1 N r f i P i , \displaystyle=\alpha\sum_{i=1}^{N_{r}}{f_{i}P_{i}...
2 Architecture
总之,DeepSeek-V3在整个训练过程中保持了良好的负载均衡。因此,DeepSeek-V3在训练期间不丢弃任何token。此外,我们还实施了特定的部署策略以确保推理负载均衡,所以DeepSeek-V3在推理期间也不丢弃token。
图3:DeepSeek-V3架构的完整图示。
原文: y, DeepSeek-V3 keeps a good load balance during its full training. Therefore, DeepSeek-V3 does not drop any tokens during training. In addition, we also implement specific deployment strategies to ensure inference load balance, so DeepSeek-V3 also does not drop tokens during inference. Figure 3: Illustration of our Multi-Token Prediction (MTP) implementation. We keep the complete causal chain for the prediction of each token at each depth. 2.2 Multi-Token Prediction Inspired by Gloeckle et al. ( 2024 ) , we investigate and set a Multi-Token Prediction (MTP) objective for DeepSeek-V3, which ext...
2 Architecture
多令牌预测(MTP):
h_i'^k = M_k [RMSNorm(h_i^{k-1}); RMSNorm(Emb(t_{i+k}))]
其中[·;·]表示拼接。特别地,当k=1时,MTP预测下一个token的概率分布。
MTP的核心思想是:在训练过程中,同时预测未来多个token的概率分布。这种多步预测目标有助于模型更好地学习序列结构,从而提高预训练模型的质量。
原文: tion: 𝐡 i ′ k = M k [ RMSNorm ( 𝐡 i k − 1 ) ; RMSNorm ( Emb ( t i + k ) ) ] , \mathbf{h}_{i}^{\prime k}=M_{k}[\operatorname{RMSNorm}(\mathbf{h}_{i}^{k-1});\operatorname{RMSNorm}(\operatorname{Emb}(t_{i+k}))], (21) where [ ⋅ ; ⋅ ] [\cdot;\cdot] denotes concatenation. Especially, when k = 1 k=1 , 𝐡 i k − 1 \mathbf{h}_{i}^{k-1} refers to the representation given by the main model. Note that for each MTP module, its embedding layer is shared with the main model. The combined 𝐡 i ′ k \mathbf{h}_{i}^{\prime k} serves as the input of the Transformer block at the k k -th depth to produce t...
2 Architecture
MTP损失函数:
L_MTP^k = CrossEntropy(P_{2+k:T+1}^k, t_{2+k:T+1}) = -(1/T) * sum_{i=2+k}^{T+1} log P_i^k[t_i]
其中T表示输入序列长度。
MTP的k值表示预测的步数。k=1表示预测下一个token,k=2表示预测下两个token,以此类推。通过同时优化多个步数的预测损失,模型可以学习到更丰富的序列依赖关系。
原文: : ℒ MTP k = CrossEntropy ( P 2 + k : T + 1 k , t 2 + k : T + 1 ) = − 1 T ∑ i = 2 + k T + 1 log P i k [ t i ] , \mathcal{L}_{\text{MTP}}^{k}=\operatorname{CrossEntropy}(P_{2+k:T+1}^{k},t_{2+k:T+1})=-\frac{1}{T}\sum_{i=2+k}^{T+1}\log P_{i}^{k}[t_{i}], (24) where T T denotes the input sequence length, t i t_{i} denotes the ground-truth token at the i i -th position, and P i k [ t i ] P_{i}^{k}[t_{i}] denotes the corresponding prediction probability of t i t_{i} , given by the k k -th MTP module. Finally, we compute the average of the MTP losses across all depths and multiply it by a wei...
2.1 Basic Architecture
(架构部分详细内容,翻译见上面各对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: The basic architecture of DeepSeek-V3 is still within the Transformer (Vaswani et al., 2017 ) framework. For efficient inference and economical training, DeepSeek-V3 also adopts MLA and DeepSeekMoE, which have been thoroughly validated by DeepSeek-V2. Compared with DeepSeek-V2, an exception is that we additionally introduce an auxiliary-loss-free load balancing strategy (Wang et al., 2024a ) for DeepSeekMoE to mitigate the performance degradation induced by the effort to ensure load balance. Figure 2 illustrates the basic architecture of DeepSeek-V3, and we will briefly review the details of M...
2.1 Basic Architecture
(架构部分详细内容,翻译见上面各对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: U V 𝐜 t K V , \displaystyle=W^{UV}\mathbf{c}_{t}^{KV}, (5) where 𝐜 t K V ∈ ℝ d c \mathbf{c}_{t}^{KV}\in\mathbb{R}^{d_{c}} is the compressed latent vector for keys and values; d c ( ≪ d h n h ) d_{c}(\ll d_{h}n_{h}) indicates the KV compression dimension; W D K V ∈ ℝ d c × d W^{DKV}\in\mathbb{R}^{d_{c}\times d} denotes the down-projection matrix; W U K , W U V ∈ ℝ d h n h × d c W^{UK},W^{UV}\in\mathbb{R}^{d_{h}n_{h}\times d_{c}} are the up-projection matrices for keys and values, respectively; W K R ∈ ℝ d h R × d W^{KR}\in\mathbb{R}^{d_{h}^{R}\times d} is the matrix used t...
2.1 Basic Architecture
(架构部分详细内容,翻译见上面各对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: }_{t,i} = [ 𝐪 t , i C ; 𝐪 t , i R ] , \displaystyle=[\mathbf{q}_{t,i}^{C};\mathbf{q}_{t,i}^{R}], (9) where 𝐜 t Q ∈ ℝ d c ′ \mathbf{c}_{t}^{Q}\in\mathbb{R}^{d_{c}^{\prime}} is the compressed latent vector for queries; d c ′ ( ≪ d h n h ) d_{c}^{\prime}(\ll d_{h}n_{h}) denotes the query compression dimension; W D Q ∈ ℝ d c ′ × d , W U Q ∈ ℝ d h n h × d c ′ W^{DQ}\in\mathbb{R}^{d_{c}^{\prime}\times d},W^{UQ}\in\mathbb{R}^{d_{h}n_{h}\times d_{c}^{\prime}} are the down-projection and up-projection matrices for queries, respectively; and W Q R ∈ ℝ d h R n h × d c ′ W^{QR}\in\mathbb{R}^{d...
2.1 Basic Architecture
DeepSeekMoE的公式化表示:
FFN(u_t) = sum_{i=1}^{N_s} FFN_i^{(s)}(u_t) + sum_{i=1}^{N_r} g_{i,t} * FFN_i^{(r)}(u_t)
其中N_s是共享专家数量,N_r是路由专家数量,g_{i,t}是路由器门控值。
原文: = 1 N s FFN i ( s ) ( 𝐮 t ) + ∑ i = 1 N r g i , t FFN i ( r ) ( 𝐮 t ) , \displaystyle=\mathbf{u}_{t}+\sum_{i=1}^{N_{s}}{\operatorname{FFN}^{(s)}_{i}\left(\mathbf{u}_{t}\right)}+\sum_{i=1}^{N_{r}}{g_{i,t}\operatorname{FFN}^{(r)}_{i}\left(\mathbf{u}_{t}\right)}, (12) g i , t \displaystyle g_{i,t} = g i , t ′ ∑ j = 1 N r g j , t ′ , \displaystyle=\frac{g^{\prime}_{i,t}}{\sum_{j=1}^{N_{r}}g^{\prime}_{j,t}}, (13) g i , t ′ \displaystyle g^{\prime}_{i,t} = { s i , t , s i , t ∈ Topk ( { s j , t | 1 ⩽ j ⩽ N r } , K r ) , 0 , otherwise , \displaystyle=\begin{cases}s_{i,t},&s_{i,t}\in\operatorn...
2.1 Basic Architecture
传统的MoE模型主要依赖辅助损失来避免负载不均衡。然而,过大的辅助损失会损害模型性能。
为实现负载均衡与模型性能之间的更好平衡,DeepSeek-V3采用了无辅助损失的负载均衡策略。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: ely on the auxiliary loss (Fedus et al., 2021 ; Lepikhin et al., 2021 ) to avoid unbalanced load. However, too large an auxiliary loss will impair the model performance (Wang et al., 2024a ) . To achieve a better trade-off between load balance and model performance, we pioneer an auxiliary-loss-free load balancing strategy (Wang et al., 2024a ) to ensure load balance. To be specific, we introduce a bias term b i b_{i} for each expert and add it to the corresponding affinity scores s i , t s_{i,t} to determine the top-K routing: g i , t ′ \displaystyle g^{\prime}_{i,t} = { s i , t , s i , t + b...
2.1 Basic Architecture
无辅助损失负载均衡的具体实现:
s'_{i,t} = s_{i,t} / sum_j(s_{j,t})
我们使用TopK路由策略,将每个token路由到得分最高的K个路由专家。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: }T}\sum_{t=1}^{T}\mathds{1} ( s i , t ∈ Topk ( { s j , t | 1 ⩽ j ⩽ N r } , K r ) ) , \displaystyle\left(s_{i,t}\in\operatorname{Topk}(\{s_{j,t}|1\leqslant j\leqslant N_{r}\},K_{r})\right), (18) s i , t ′ \displaystyle s^{\prime}_{i,t} = s i , t ∑ j = 1 N r s j , t , \displaystyle=\frac{s_{i,t}}{\sum_{j=1}^{N_{r}}s_{j,t}}, (19) P i \displaystyle P_{i} = 1 T ∑ t = 1 T s i , t ′ , \displaystyle=\frac{1}{T}\sum_{t=1}^{T}{s^{\prime}_{i,t}}, (20) where the balance factor α \alpha is a hyper-parameter, which will be assigned an extremely small value for DeepSeek-V3; 𝟙 ( ⋅ ) \mathds{1}(\cdot) de...
2.2 Multi-Token Prediction
2.2 多令牌预测(MTP)
受Gloeckle等的启发,我们为DeepSeek-V3设置了多令牌预测(MTP)目标,将预测范围扩展到每个位置的多个未来令牌。
MTP目标密集化了训练信号,提高了预训练效率。
原文: Inspired by Gloeckle et al. ( 2024 ) , we investigate and set a Multi-Token Prediction (MTP) objective for DeepSeek-V3, which extends the prediction scope to multiple future tokens at each position. On the one hand, an MTP objective densifies the training signals and may improve data efficiency. On the other hand, MTP may enable the model to pre-plan its representations for better prediction of future tokens. Figure 3 illustrates our implementation of MTP. Different from Gloeckle et al. ( 2024 ) , which parallelly predicts D D additional tokens using independent output heads, we sequentially p...
2.2 Multi-Token Prediction
MTP的实现方式:
在k深度,Transformer块的输入为组合向量,输出表示为当前深度的h^k。
每个MTP模块都使用与主模型相同的Transformer架构,但参数不共享。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: . The combined 𝐡 i ′ k \mathbf{h}_{i}^{\prime k} serves as the input of the Transformer block at the k k -th depth to produce the output representation at the current depth 𝐡 i k \mathbf{h}_{i}^{k} : 𝐡 1 : T − k k = TRM k ( 𝐡 1 : T − k ′ k ) , \mathbf{h}_{1:T-k}^{k}=\operatorname{TRM}_{k}(\mathbf{h}_{1:T-k}^{\prime k}), (22) where T T represents the input sequence length and i:j denotes the slicing operation (inclusive of both the left and right boundaries). Finally, taking 𝐡 i k \mathbf{h}_{i}^{k} as the input, the shared output head will compute the probability distribution for the k k...
2.2 Multi-Token Prediction
最后,我们计算所有深度的MTP损失的平均值,乘以权重因子lambda,获得整体MTP损失。
L_MTP = lambda * (1/K) * sum_{k=1}^{K} L_MTP^k DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: } , given by the k k -th MTP module. Finally, we compute the average of the MTP losses across all depths and multiply it by a weighting factor λ \lambda to obtain the overall MTP loss ℒ MTP \mathcal{L}_{\text{MTP}} , which serves as an additional training objective for DeepSeek-V3: ℒ MTP = λ D ∑ k = 1 D ℒ MTP k . \mathcal{L}_{\text{MTP}}=\frac{\lambda}{D}\sum_{k=1}^{D}\mathcal{L}_{\text{MTP}}^{k}. (25) MTP in Inference. Our MTP strategy mainly aims to improve the performance of the main model, so during inference, we can directly discard the MTP modules and the main model can function indepe...
3 Infrastructures
3 基础设施
3.1 计算集群
DeepSeek-V3在一个配备2048个NVIDIA H800 GPU的集群上训练。每个节点包含8个GPU,节点内通过NVLink和NVSwitch连接。节点间利用InfiniBand互连。
原文: 3.1 Compute Clusters DeepSeek-V3 is trained on a cluster equipped with 2048 NVIDIA H800 GPUs. Each node in the H800 cluster contains 8 GPUs connected by NVLink and NVSwitch within nodes. Across different nodes, InfiniBand (IB) interconnects are utilized to facilitate communications. 3.2 Training Framework The training of DeepSeek-V3 is supported by the HAI-LLM framework, an efficient and lightweight training framework crafted by our engineers from the ground up. On the whole, DeepSeek-V3 applies 16-way Pipeline Parallelism (PP) (Qi et al., 2023a ) , 64-way Expert Parallelism (EP) (Lepikhin et ...
3 Infrastructures
专家并行导致计算与通信比率低效,约为1:1。为解决这一挑战,我们设计了DualPipe创新流水线并行算法。
DualPipe通过有效减少流水线气泡来加速模型训练,通过计算-通信重叠隐藏了大部分通信开销。
原文: expert parallelism results in an inefficient computation-to-communication ratio of approximately 1:1. To tackle this challenge, we design an innovative pipeline parallelism algorithm called DualPipe, which not only accelerates model training by effectively overlapping forward and backward computation-communication phases, but also reduces the pipeline bubbles. The key idea of DualPipe is to overlap the computation and communication within a pair of individual forward and backward chunks. To be specific, we divide each chunk into four components: attention , all-to-all dispatch , MLP , and all-...
3 Infrastructures
重叠计算和通信。即使在没有沉重通信负担的场景中,DualPipe仍然表现出效率优势。
表2总结了不同策略下的流水线气泡和内存使用。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: erlapped computation and communication. In addition, even in more general scenarios without a heavy communication burden, DualPipe still exhibits efficiency advantages. In Table 2 , we summarize the pipeline bubbles and memory usage across different PP methods. As shown in the table, compared with ZB1P (Qi et al., 2023b ) and 1F1B (Harlap et al., 2018 ) , DualPipe significantly reduces the pipeline bubbles while only increasing the peak activation memory by 1 P P \frac{1}{PP} times. Although DualPipe requires keeping two copies of the model parameters, this does not significantly increase th...
3 Infrastructures
在我们的集群中,跨节点GPU通过IB完全互连,节点内通信通过NVLink处理。NVLink提供160 GB/s带宽,约为IB带宽的4倍。
我们设计了分层通信策略:优先使用NVLink进行节点内通信,必要时使用IB进行跨节点通信。
原文: ith the MoE gating algorithm and the network topology of our cluster. To be specific, in our cluster, cross-node GPUs are fully interconnected with IB, and intra-node communications are handled via NVLink. NVLink offers a bandwidth of 160 GB/s, roughly 3.2 times that of IB (50 GB/s). To effectively leverage the different bandwidths of IB and NVLink, we limit each token to be dispatched to at most 4 nodes, thereby reducing IB traffic. For each token, when its routing decision is made, it will first be transmitted via IB to the GPUs with the same in-node index on its target nodes. Once it reache...
3 Infrastructures
通信流对计算内核的影响:我们采用定制的PTX指令,自动调整通信块大小,显著减少了L2缓存使用,提高了计算和通信重叠效率。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: on stream, so we also consider their impact on other SM computation kernels. Specifically, we employ customized PTX (Parallel Thread Execution) instructions and auto-tune the communication chunk size, which significantly reduces the use of the L2 cache and the interference to other SMs. 3.2.3 Extremely Memory Saving with Minimal Overhead In order to reduce the memory footprint during training, we employ the following techniques. Recomputation of RMSNorm and MLA Up-Projection. We recompute all RMSNorm operations and MLA up-projections during back-propagation, thereby eliminating the need to per...
3 Infrastructures
3.3 混合精度训练
低精度训练很有前景,但受到激活、权重和梯度中异常值的限制。我们采用了混合精度FP8训练框架。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: -precision training holds great promise, it is often limited by the presence of outliers in activations, weights, and gradients (Sun et al., 2024 ; He et al., ; Fishman et al., 2024 ) . Although significant progress has been made in inference quantization (Xiao et al., 2023 ; Frantar et al., 2022 ) , there are relatively few studies demonstrating successful application of low-precision techniques in large-scale language model pre-training (Fishman et al., 2024 ) . To address this challenge and effectively extend the dynamic range of the FP8 format, we introduce a fine-grained quantization stra...
3 Infrastructures
大多数核心计算内核(GEMM操作)以FP8精度实现,接受FP8张量输入,产生BF16或FP32精度输出。
与注意力机制相关的三个GEMM操作都以FP8精度实现。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: raining, the majority of core computation kernels, i.e., GEMM operations, are implemented in FP8 precision. These GEMM operations accept FP8 tensors as inputs and produce outputs in BF16 or FP32. As depicted in Figure 6 , all three GEMMs associated with the Linear operator, namely Fprop (forward pass), Dgrad (activation backward pass), and Wgrad (weight backward pass), are executed in FP8. This design theoretically doubles the computational speed compared with the original BF16 method. Additionally, the FP8 Wgrad GEMM allows activations to be stored in FP8 for use in the backward pass. This si...
3 Infrastructures
3.3.1 细粒度量化
我们使用逐通道或逐块量化,而不是逐张量量化,显著减少量化误差,提高训练精度。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: ultiplication Based on our mixed precision FP8 framework, we introduce several strategies to enhance low-precision training accuracy, focusing on both the quantization method and the multiplication process. Fine-Grained Quantization. In low-precision training frameworks, overflows and underflows are common challenges due to the limited dynamic range of the FP8 format, which is constrained by its reduced exponent bits. As a standard practice, the input distribution is aligned to the representable range of the FP8 format by scaling the maximum absolute value of the input tensor to the maximum re...
3 Infrastructures
3.3.2 增加累积精度
低精度GEMM操作通常产生低精度输出。我们使用BF16或FP32累积器,显著改善训练稳定性。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: croscaling formats with smaller quantization granularity (NVIDIA, 2024a ) . We hope our design can serve as a reference for future work to keep pace with the latest GPU architectures. Increasing Accumulation Precision. Low-precision GEMM operations often suffer from underflow issues, and their accuracy largely depends on high-precision accumulation, which is commonly performed in an FP32 precision (Kalamkar et al., 2019 ; Narang et al., 2017 ) . However, we observe that the accumulation precision of FP8 GEMM on NVIDIA H800 GPUs is limited to retaining around 14 bits, which is significantly low...
3 Infrastructures
在H800架构上,两个WGMMA指令同时持续执行是典型的。这种设计充分利用了H800的计算资源。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: on reduces the WGMMA (Warpgroup-level Matrix Multiply-Accumulate) instruction issue rate for a single warpgroup. However, on the H800 architecture, it is typical for two WGMMA to persist concurrently: while one warpgroup performs the promotion operation, the other is able to execute the MMA operation. This design enables overlapping of the two operations, maintaining high utilization of Tensor Cores. Based on our experiments, setting N C = 128 N_{C}=128 elements, equivalent to 4 WGMMAs, represents the minimal accumulation interval that can significantly improve precision without introducing su...
3 Infrastructures
3.3.3 低精度优化器状态
我们采用BF16格式而不是FP32跟踪AdamW优化器的一阶和二阶矩,没有造成可观察到的性能下降。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: cision formats. Low-Precision Optimizer States. We adopt the BF16 data format instead of FP32 to track the first and second moments in the AdamW (Loshchilov and Hutter, 2017 ) optimizer, without incurring observable performance degradation. However, the master weights (stored by the optimizer) and gradients (used for batch size accumulation) are still retained in FP32 to ensure numerical stability throughout training. Low-Precision Activation. As illustrated in Figure 6 , the Wgrad operation is performed in FP8. To reduce the memory consumption, it is a natural choice to cache activations in F...
3 Infrastructures
在MoE下投影之前应用于激活梯度。前向和后向组合组件保留在BF16中,保持训练精度。
3.4 推理与部署
我们在推理阶段采用了专门的优化策略,减少跨节点通信开销。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: is applied to the activation gradient before MoE down-projections. For both the forward and backward combine components, we retain them in BF16 to preserve training precision in critical parts of the training pipeline. 3.4 Inference and Deployment We deploy DeepSeek-V3 on the H800 cluster, where GPUs within each node are interconnected using NVLink, and all GPUs across the cluster are fully interconnected via IB. To simultaneously ensure both the Service-Level Objective (SLO) for online services and high throughput, we employ the following deployment strategy that separates the prefilling and ...
3 Infrastructures
为预填充阶段设置了32个冗余专家。每个GPU除了原始8个专家外,还托管额外的冗余专家,提高负载均衡和容错能力。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: much as possible without increasing the cross-node all-to-all communication overhead. For the deployment of DeepSeek-V3, we set 32 redundant experts for the prefilling stage. For each GPU, besides the original 8 experts it hosts, it will also host one additional redundant expert. Furthermore, in the prefilling stage, to improve the throughput and hide the overhead of all-to-all and TP communication, we simultaneously process two micro-batches with similar computational workloads, overlapping the attention and MoE of one micro-batch with the dispatch and combine of another. Finally, we are expl...
3 Infrastructures
由于每个GPU只托管一个专家,不需要重新排列专家。我们还在探索解码阶段的动态冗余策略。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: eed to rearrange experts since each GPU only hosts one expert. We are also exploring the dynamic redundancy strategy for decoding. However, this requires more careful optimization of the algorithm that computes the globally optimal routing scheme and the fusion with the dispatch kernel to reduce overhead. Additionally, to enhance throughput and hide the overhead of all-to-all communication, we are also exploring processing two micro-batches with similar computational workloads simultaneously in the decoding stage. Unlike prefilling, attention consumes a larger portion of time in the decoding s...
3 Infrastructures
SM(流多处理器)主要为all-to-all通信执行以下任务:
- 在IB(InfiniBand)和NVLink域之间转发数据,同时聚合来自单个GPU的目的为同一节点内多个GPU的IB流量。
- 在RDMA缓冲区(注册的GPU内存)之间传输数据。
原文: y, the SMs primarily perform the following tasks for all-to-all communication: • Forwarding data between the IB (InfiniBand) and NVLink domain while aggregating IB traffic destined for multiple GPUs within the same node from a single GPU. • Transporting data between RDMA buffers (registered GPU memory regions) and input/output buffers. • Executing reduce operations for all-to-all combine . • Managing fine-grained memory layout during chunked data transferring to multiple experts across the IB and NVLink domain. We aspire to see future vendors developing hardware that offloads these communicati...
3 Infrastructures
这种方法确保错误保持在可接受范围内,同时保持计算效率。
支持Tile和Block级量化:当前GPU仅支持逐张量量化,缺乏原生Tile和Block级量化支持。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: the accuracy requirements of training and inference algorithms. This approach ensures that errors remain within acceptable bounds while maintaining computational efficiency. Support for Tile- and Block-Wise Quantization. Current GPUs only support per-tensor quantization, lacking the native support for fine-grained quantization like our tile- and block-wise quantization. In the current implementation, when the N C N_{C} interval is reached, the partial results will be copied from Tensor Cores to CUDA cores, multiplied by the scaling factors, and added to FP32 registers on CUDA cores. Although t...
3 Infrastructures
层归一化和FP8转换。或者可以采用近内存计算方式,将计算逻辑放置在HBM附近。BF16元素可以在从HBM读入GPU时直接转换为FP8,减少约50%的片外内存访问。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: f layer normalization and FP8 cast. Alternatively, a near-memory computing approach can be adopted, where compute logic is placed near the HBM. In this case, BF16 elements can be cast to FP8 directly as they are read from HBM into the GPU, reducing off-chip memory access by roughly 50%. Support for Transposed GEMM Operations. The current architecture makes it cumbersome to fuse matrix transposition with GEMM operations. In our workflow, activations during the forward pass are quantized into 1x128 FP8 tiles and stored. During the backward pass, the matrix needs to be read out, dequantized, tran...
3.1 Compute Clusters
3.1 计算集群
DeepSeek-V3在一个配备2048个NVIDIA H800 GPU的集群上训练。每个节点包含8个GPU,节点内通过NVLink和NVSwitch连接。节点间利用InfiniBand互连。
原文: DeepSeek-V3 is trained on a cluster equipped with 2048 NVIDIA H800 GPUs. Each node in the H800 cluster contains 8 GPUs connected by NVLink and NVSwitch within nodes. Across different nodes, InfiniBand (IB) interconnects are utilized to facilitate communications.
3.2 Training Framework
3.2 训练框架
DeepSeek-V3的训练由HAI-LLM框架支持,这是一个由我们工程师从头构建的高效轻量级训练框架。
总体而言,DeepSeek-V3采用16路流水线并行(PP)、64路专家并行(EP)和2路张量并行(TP)。
原文: The training of DeepSeek-V3 is supported by the HAI-LLM framework, an efficient and lightweight training framework crafted by our engineers from the ground up. On the whole, DeepSeek-V3 applies 16-way Pipeline Parallelism (PP) (Qi et al., 2023a ) , 64-way Expert Parallelism (EP) (Lepikhin et al., 2021 ) spanning 8 nodes, and ZeRO-1 Data Parallelism (DP) (Rajbhandari et al., 2020 ) . In order to facilitate efficient training of DeepSeek-V3, we implement meticulous engineering optimizations. Firstly, we design the DualPipe algorithm for efficient pipeline parallelism. Compared with existing PP m...
3.2 Training Framework
DualPipe不仅减少了通信阶段,还减少了流水线气泡。DualPipe的关键思想是重叠一对独立前向和后向块内的计算和通信。
具体来说,我们将每个块分为四个组件:注意力、all-to-all分发、MLP和all-to-all收集。
原文: unication phases, but also reduces the pipeline bubbles. The key idea of DualPipe is to overlap the computation and communication within a pair of individual forward and backward chunks. To be specific, we divide each chunk into four components: attention , all-to-all dispatch , MLP , and all-to-all combine . Specially, for a backward chunk, both attention and MLP are further split into two parts, backward for input and backward for weights , like in ZeroBubble (Qi et al., 2023b ) . In addition, we have a PP communication component. As illustrated in Figure 4 , for a pair of forward and backwa...
3.2 Training Framework
与1F1B流水线并行相比,DualPipe显著减少了流水线气泡,同时将峰值激活内存仅增加1/PP倍。
尽管DualPipe需要保留两份模型参数副本,但这并没有显著增加内存使用。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: et al., 2023b ) and 1F1B (Harlap et al., 2018 ) , DualPipe significantly reduces the pipeline bubbles while only increasing the peak activation memory by 1 P P \frac{1}{PP} times. Although DualPipe requires keeping two copies of the model parameters, this does not significantly increase the memory consumption since we use a large EP size during training. Compared with Chimera (Li and Hoefler, 2021 ) , DualPipe only requires that the pipeline stages and micro-batches be divisible by 2, without requiring micro-batches to be divisible by pipeline stages. In addition, for DualPipe, neither the b...
3.2 Training Framework
为了利用IB和NVLink的不同带宽,我们限制每个token最多分发到4个节点,从而减少IB流量。
对于每个token,当路由决策确定后,它首先通过IB传输到目标节点上具有相同节点内索引的GPU上。到达目标后,通过NVLink在节点内GPU之间传输。
原文: e the different bandwidths of IB and NVLink, we limit each token to be dispatched to at most 4 nodes, thereby reducing IB traffic. For each token, when its routing decision is made, it will first be transmitted via IB to the GPUs with the same in-node index on its target nodes. Once it reaches the target nodes, we will endeavor to ensure that it is instantaneously forwarded via NVLink to specific GPUs that host their target experts, without being blocked by subsequently arriving tokens. In this way, communications via IB and NVLink are fully overlapped, and each token can efficiently select an...
3.2 Training Framework
3.2.1 最小开销的内存节省
为了减少训练期间的内存占用,我们采用以下技术:
RMSNorm和MLA上投影的重新计算:我们在反向传播期间重新计算所有RMSNorm操作和MLA上投影,从而消除了持久化存储的需求。
原文: ory Saving with Minimal Overhead In order to reduce the memory footprint during training, we employ the following techniques. Recomputation of RMSNorm and MLA Up-Projection. We recompute all RMSNorm operations and MLA up-projections during back-propagation, thereby eliminating the need to persistently store their output activations. With a minor overhead, this strategy significantly reduces memory requirements for storing activations. Exponential Moving Average in CPU. During training, we preserve the Exponential Moving Average (EMA) of the model parameters for early estimation of the model pe...
3.3 FP8 Training
3.3 FP8训练
图6:采用FP8数据格式的整体混合精度框架。为简化说明,仅展示了Linear算子。
受低精度训练最新进展的启发,我们提出了细粒度混合精度框架。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Figure 6: The overall mixed precision framework with FP8 data format. For clarification, only the Linear operator is illustrated. Inspired by recent advances in low-precision training (Peng et al., 2023b ; Dettmers et al., 2022 ; Noune et al., 2022 ) , we propose a fine-grained mixed precision framework utilizing the FP8 data format for training DeepSeek-V3. While low-precision training holds great promise, it is often limited by the presence of outliers in activations, weights, and gradients (Sun et al., 2024 ; He et al., ; Fishman et al., 2024 ) . Although significant progress has been made ...
3.3 FP8 Training
我们提出了用于FP8训练的混合精度框架。在此框架中,大多数计算密集型操作在FP8中执行,而少数关键操作战略性地保持其原始数据格式,以平衡训练效率和数值稳定性。
整体框架如图6所示。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 17 ) , we propose a mixed precision framework for FP8 training. In this framework, most compute-density operations are conducted in FP8, while a few key operations are strategically maintained in their original data formats to balance training efficiency and numerical stability. The overall framework is illustrated in Figure 6 . Firstly, in order to accelerate model training, the majority of core computation kernels, i.e., GEMM operations, are implemented in FP8 precision. These GEMM operations accept FP8 tensors as inputs and produce outputs in BF16 or FP32. As depicted in Figure 6 , all thre...
3.3 FP8 Training
(a) 缓解特征异常值引起的量化误差的方法;为简化说明,仅展示了前向传播。
(b) 与我们的量化策略相结合,我们通过将N_C=128个元素MMA提升为CUDA Cores来提高FP8 GEMM精度。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: ethod to mitigate quantization errors caused by feature outliers; for illustration simplicity, only Fprop is illustrated. (b) In conjunction with our quantization strategy, we improve the FP8 GEMM precision by promoting to CUDA Cores at an interval of N C = 128 N_{C}=128 elements MMA for the high-precision accumulation. 3.3.2 Improved Precision from Quantization and Multiplication Based on our mixed precision FP8 framework, we introduce several strategies to enhance low-precision training accuracy, focusing on both the quantization method and the multiplication process. Fine-Grained Quantizati...
3.3 FP8 Training
标准FP8 GEMM不支持。然而,与我们的精确FP32累积策略相结合,它可以高效实现。
值得注意的是,我们的细粒度量化策略与microscaling格式的思想高度一致。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: orted in the standard FP8 GEMM. However, combined with our precise FP32 accumulation strategy, it can be efficiently implemented. Notably, our fine-grained quantization strategy is highly consistent with the idea of microscaling formats (Rouhani et al., 2023b ) , while the Tensor Cores of NVIDIA next-generation GPUs (Blackwell series) have announced the support for microscaling formats with smaller quantization granularity (NVIDIA, 2024a ) . We hope our design can serve as a reference for future work to keep pace with the latest GPU architectures. Increasing Accumulation Precision. Low-precisi...
3.3 FP8 Training
在CUDA Cores上执行FP32全精度累积。如前所述,我们的细粒度量化沿内维度K应用每组缩放因子。
这些缩放因子可以作为反量化过程在CUDA Cores上高效相乘,开销极小。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: ers on CUDA Cores, where full-precision FP32 accumulation is performed. As mentioned before, our fine-grained quantization applies per-group scaling factors along the inner dimension K . These scaling factors can be efficiently multiplied on the CUDA Cores as the dequantization process with minimal additional computational cost. It is worth noting that this modification reduces the WGMMA (Warpgroup-level Matrix Multiply-Accumulate) instruction issue rate for a single warpgroup. However, on the H800 architecture, it is typical for two WGMMA to persist concurrently: while one warpgroup performs ...
3.3 FP8 Training
或128x128权块。基于此,我们推导缩放因子,然后在线将激活或权重量化为FP8格式。
3.3.3 低精度存储和通信
与我们的FP8训练框架相结合,我们进一步减少了内存消耗和通信开销。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: or 128x128 weight block. Based on it, we derive the scaling factor and then quantize the activation or weight online into the FP8 format. 3.3.3 Low-Precision Storage and Communication In conjunction with our FP8 training framework, we further reduce the memory consumption and communication overhead by compressing cached activations and optimizer states into lower-precision formats. Low-Precision Optimizer States. We adopt the BF16 data format instead of FP32 to track the first and second moments in the AdamW (Loshchilov and Hutter, 2017 ) optimizer, without incurring observable performance deg...
3.3 FP8 Training
MoE模型训练中的通信瓶颈。为缓解这一挑战,我们将MoE上投影之前的激活量化为FP8,然后应用分发组件,与MoE上投影中的FP8前向传播兼容。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: al bottleneck in the training of MoE models. To alleviate this challenge, we quantize the activation before MoE up-projections into FP8 and then apply dispatch components, which is compatible with FP8 Fprop in MoE up-projections. Like the inputs of the Linear after the attention operator, scaling factors for this activation are integral power of 2. A similar strategy is applied to the activation gradient before MoE down-projections. For both the forward and backward combine components, we retain them in BF16 to preserve training precision in critical parts of the training pipeline.
3.4 Inference and Deployment
3.4 推理与部署
我们将DeepSeek-V3部署在H800集群上,节点内GPU通过NVLink互连,集群间GPU通过IB完全互连。
为同时确保在线服务的SLO(服务级别目标)和推理吞吐量,我们采用了专门的部署策略。
原文: We deploy DeepSeek-V3 on the H800 cluster, where GPUs within each node are interconnected using NVLink, and all GPUs across the cluster are fully interconnected via IB. To simultaneously ensure both the Service-Level Objective (SLO) for online services and high throughput, we employ the following deployment strategy that separates the prefilling and decoding stages. 3.4.1 Prefilling The minimum deployment unit of the prefilling stage consists of 4 nodes with 32 GPUs. The attention part employs 4-way Tensor Parallelism (TP4) with Sequence Parallelism (SP), combined with 8-way Data Parallelism (...
3.4 Inference and Deployment
在预填充阶段,为改善吞吐量并隐藏all-to-all和TP通信开销,我们同时处理两个计算工作量相似的微批次,重叠通信和计算。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: one additional redundant expert. Furthermore, in the prefilling stage, to improve the throughput and hide the overhead of all-to-all and TP communication, we simultaneously process two micro-batches with similar computational workloads, overlapping the attention and MoE of one micro-batch with the dispatch and combine of another. Finally, we are exploring a dynamic redundancy strategy for experts, where each GPU hosts more experts (e.g., 16 experts), but only 9 will be activated during each inference step. Before the all-to-all operation at each layer begins, we compute the globally optimal ro...
3.4 Inference and Deployment
与分发内核的融合以减少开销。此外,为提高吞吐量并隐藏all-to-all通信开销,我们还探索在解码阶段同时处理两个微批次。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: nd the fusion with the dispatch kernel to reduce overhead. Additionally, to enhance throughput and hide the overhead of all-to-all communication, we are also exploring processing two micro-batches with similar computational workloads simultaneously in the decoding stage. Unlike prefilling, attention consumes a larger portion of time in the decoding stage. Therefore, we overlap the attention of one micro-batch with the dispatch+MoE+combine of another. In the decoding stage, the batch size per expert is relatively small (usually within 256 tokens), and the bottleneck is memory access rather than...
3.5 Suggestions on Hardware Design
3.5 硬件设计建议
基于我们对all-to-all通信和FP8训练方案的实现,我们向AI硬件厂商提出以下芯片设计建议。
3.5.1 通信硬件
在DeepSeek-V3中,我们实现了计算与通信的重叠。为了进一步优化,我们建议硬件厂商提供更强大的通信硬件支持。
原文: Based on our implementation of the all-to-all communication and FP8 training scheme, we propose the following suggestions on chip design to AI hardware vendors. 3.5.1 Communication Hardware In DeepSeek-V3, we implement the overlap between computation and communication to hide the communication latency during computation. This significantly reduces the dependency on communication bandwidth compared to serial computation and communication. However, the current communication implementation relies on expensive SMs (e.g., we allocate 20 out of the 132 SMs available in the H800 GPU for this purpose)...
3.5 Suggestions on Hardware Design
3.5.2 Tensor Cores中的FP8精度
在当前NVIDIA Hopper架构的Tensor Core实现中,FP8 GEMM采用定点累积,通过基于最大指数的右移对齐尾数乘积。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Precision in Tensor Cores. In the current Tensor Core implementation of the NVIDIA Hopper architecture, FP8 GEMM (General Matrix Multiply) employs fixed-point accumulation, aligning the mantissa products by right-shifting based on the maximum exponent before addition. Our experiments reveal that it only uses the highest 14 bits of each mantissa product after sign-fill right shifting, and truncates bits exceeding this range. However, for example, to achieve precise FP32 results from the accumulation of 32 FP8 × \times FP8 multiplications, at least 34-bit precision is required. Thus, we recommen...
3.5 Suggestions on Hardware Design
尽管我们的研究证明了在线量化的有效性,但现有架构难以有效支持它。
在现有流程中,我们需要从HBM读取128个BF16激活值,然后转换为FP8。
我们建议硬件厂商支持直接的在线量化,这将显著减少内存带宽需求。
原文: truggle to effectively support online quantization, despite its effectiveness demonstrated in our research. In the existing process, we need to read 128 BF16 activation values (the output of the previous computation) from HBM (High Bandwidth Memory) for quantization, and the quantized FP8 values are then written back to HBM, only to be read again for MMA. To address this inefficiency, we recommend that future chips integrate FP8 cast and TMA (Tensor Memory Accelerator) access into a single fused operation, so quantization can be completed during the transfer of activations from global memory t...
4 Pre-Training
4 预训练
4.1 数据构建
与DeepSeek-V2相比,我们优化了预训练语料库,提高了数学和编程样本的比例,同时扩展了英语和中文之外的多语言覆盖范围。
此外,我们的数据处理流水线也更加完善。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 4.1 Data Construction Compared with DeepSeek-V2, we optimize the pre-training corpus by enhancing the ratio of mathematical and programming samples, while expanding multilingual coverage beyond English and Chinese. Also, our data processing pipeline is refined to minimize redundancy while maintaining corpus diversity. Inspired by Ding et al. ( 2024 ) , we implement the document packing method for data integrity but do not incorporate cross-sample attention masking during training. Finally, the training corpus for DeepSeek-V3 consists of 14.8T high-quality and diverse tokens in our tokenizer. I...
4 Pre-Training
为了解决这个问题,我们在训练期间随机拆分一定比例的此类组合token,使模型接触到更广泛的特殊情况,减轻这种偏差。
4.2 超参数
模型超参数:DeepSeek-V3采用671B总参数,其中37B激活参数。
原文: ot evaluation prompts. To address this issue, we randomly split a certain proportion of such combined tokens during training, which exposes the model to a wider array of special cases and mitigates this bias. 4.2 Hyper-Parameters Model Hyper-Parameters. We set the number of Transformer layers to 61 and the hidden dimension to 7168. All learnable parameters are randomly initialized with a standard deviation of 0.006. In MLA, we set the number of attention heads n h n_{h} to 128 and the per-head dimension d h d_{h} to 128. The KV compression dimension d c d_{c} is set to 512, and the query compr...
4 Pre-Training
学习率:初始学习率为2.2e-4,直到模型消耗10T训练token。然后,我们按照余弦衰减曲线,在4.3T token内逐渐将学习率衰减到2.2e-5。
在最终训练阶段,我们保持了较低的学习率以确保稳定收敛。
原文: rate of 2.2 × 10 − 4 2.2\times 10^{-4} until the model consumes 10T training tokens. Subsequently, we gradually decay the learning rate to 2.2 × 10 − 5 2.2\times 10^{-5} in 4.3T tokens, following a cosine decay curve. During the training of the final 500B tokens, we keep a constant learning rate of 2.2 × 10 − 5 2.2\times 10^{-5} in the first 333B tokens, and switch to another constant learning rate of 7.3 × 10 − 6 7.3\times 10^{-6} in the remaining 167B tokens. The gradient clipping norm is set to 1.0. We employ a batch size scheduling strategy, where the batch size is gradually increased from...
4 Pre-Training
共享键k^t_R的超参数:s=40, alpha=1, beta=32,缩放因子t=0.1*ln(s)+1。
在两个阶段中,超参数保持不变。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: led shared key 𝐤 t R \mathbf{k}^{R}_{t} . The hyper-parameters remain identical across both phases, with the scale s = 40 s=40 , α = 1 \alpha=1 , β = 32 \beta=32 , and the scaling factor t = 0.1 ln s + 1 \sqrt{t}=0.1\ln{s}+1 . In the first phase, the sequence length is set to 32K, and the batch size is 1920. During the second phase, the sequence length is increased to 128K, and the batch size is reduced to 480. The learning rate for both phases is set to 7.3 × 10 − 6 7.3\times 10^{-6} , matching the final learning rate from the pre-training stage. Through this two-phase extension training,...
4 Pre-Training
阅读理解数据集包括RACE、DROP、C3和CMRC。参考消歧数据集包括CLUEWSC。
这些数据集涵盖了多种语言和任务类型,用于全面评估模型能力。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: estions (Kwiatkowski et al., 2019 ) . Reading comprehension datasets include RACE Lai et al. ( 2017 ) , DROP (Dua et al., 2019 ) , C3 (Sun et al., 2019a ) , and CMRC (Cui et al., 2019 ) . Reference disambiguation datasets include CLUEWSC (Xu et al., 2020 ) and WinoGrande Sakaguchi et al. ( 2019 ) . Language modeling datasets include Pile (Gao et al., 2020 ) . Chinese understanding and culture datasets include CCPM (Li et al., 2021 ) . Math datasets include GSM8K (Cobbe et al., 2021 ) , MATH (Hendrycks et al., 2021 ) , MGSM (Shi et al., 2023 ) , and CMath (Wei et al., 2023 ) . Code datasets inc...
4 Pre-Training
表4:预训练模型评估结果。
GSM8K (EM) 8-shot: 79.2
MATH (EM) 4-shot: 44.1
HumanEval (Pass@1) 0-shot: 44.5
这些结果展示了DeepSeek-V3-Base在预训练阶段的强大能力。
原文: allenge (EM) 25-shot 92.2 94.5 95.3 95.3 HellaSwag (EM) 10-shot 87.1 84.8 89.2 88.9 PIQA (EM) 0-shot 83.9 82.6 85.9 84.7 WinoGrande (EM) 5-shot 86.3 82.3 85.2 84.9 RACE-Middle (EM) 5-shot 73.1 68.1 74.2 67.1 RACE-High (EM) 5-shot 52.6 50.3 56.8 51.3 TriviaQA (EM) 5-shot 80.0 71.9 82.7 82.9 NaturalQuestions (EM) 5-shot 38.6 33.2 41.5 40.0 AGIEval (EM) 0-shot 57.5 75.8 60.6 79.6 Code HumanEval (Pass@1) 0-shot 43.3 53.0 54.9 65.2 MBPP (Pass@1) 3-shot 65.0 72.6 68.4 75.4 LiveCodeBench-Base (Pass@1) 3-shot 11.6 12.9 15.5 19.4 CRUXEval-I (EM) 2-shot 52.5 59.1 58.5 67.3 CRUXEval-O (EM) 2-shot 49.8 59...
4 Pre-Training
总体而言,DeepSeek-V3-Base全面超越了DeepSeek-V2-Base和Qwen2.5 72B Base,在大多数基准测试上超越了LLaMA-3.1 405B Base,基本成为最强的开源模型。
从更细致的分析来看,DeepSeek-V3-Base在数学和编程任务上的提升尤为显著。
原文: rted results. Overall, DeepSeek-V3-Base comprehensively outperforms DeepSeek-V2-Base and Qwen2.5 72B Base, and surpasses LLaMA-3.1 405B Base in the majority of benchmarks, essentially becoming the strongest open-source model. From a more detailed perspective, we compare DeepSeek-V3-Base with the other open-source base models individually. (1) Compared with DeepSeek-V2-Base, due to the improvements in our model architecture, the scale-up of the model size and training tokens, and the enhancement of data quality, DeepSeek-V3-Base achieves significantly better performance as expected. (2) Compare...
4 Pre-Training
表5:与闭源模型的对比结果。
AIME 2024 (Pass@1): 13.3
LiveCodeBench: 36.5
MATH-500: 92.2
这些结果表明DeepSeek-V3在数学和编程任务上已经接近甚至超越了一些闭源模型。
原文: 0.657 BBH (EM) 3-shot 39.0 41.4 70.0 70.7 MMLU (EM) 5-shot 50.0 53.3 67.5 66.6 DROP (F1) 1-shot 39.2 41.3 68.5 70.6 TriviaQA (EM) 5-shot 56.9 57.7 67.0 67.3 NaturalQuestions (EM) 5-shot 22.7 22.3 27.2 28.5 HumanEval (Pass@1) 0-shot 20.7 26.8 44.5 53.7 MBPP (Pass@1) 3-shot 35.8 36.8 61.6 62.2 GSM8K (EM) 8-shot 25.4 31.4 72.3 74.0 MATH (EM) 4-shot 10.7 12.6 38.6 39.8 Table 4: Ablation results for the MTP strategy. The MTP strategy consistently enhances the model performance on most of the evaluation benchmarks. 4.5 Discussion 4.5.1 Ablation Studies for Multi-Token Prediction In Table 4 , we show...
4 Pre-Training
控制辅助损失强度的超参数分别与DeepSeek-V2-Lite和DeepSeek-V2相同。在这两个基线模型之上,保持训练数据和其他架构不变,我们移除了所有辅助损失,转而采用无辅助损失的负载均衡策略。
原文: r-parameters to control the strength of auxiliary losses are the same as DeepSeek-V2-Lite and DeepSeek-V2, respectively. On top of these two baseline models, keeping the training data and the other architectures the same, we remove all auxiliary losses and introduce the auxiliary-loss-free balancing strategy for comparison. From the table, we can observe that the auxiliary-loss-free strategy consistently achieves better model performance on most of the evaluation benchmarks. Benchmark (Metric) # Shots Small MoE Small MoE Large MoE Large MoE Aux-Loss-Based Aux-Loss-Free Aux-Loss-Based Aux-Loss-...
4 Pre-Training
如图9所示,我们观察到无辅助损失模型表现出更大的专家专业化模式,正如预期的那样。为了进一步研究这种灵活性与模型性能优势之间的相关性,我们进行了详尽的分析。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: e test set. As illustrated in Figure 9 , we observe that the auxiliary-loss-free model demonstrates greater expert specialization patterns as expected. To further investigate the correlation between this flexibility and the advantage in model performance, we additionally design and validate a batch-wise auxiliary loss that encourages load balance on each training batch instead of on each sequence. The experimental results show that, when achieving a similar level of batch-wise load balance, the batch-wise auxiliary loss can also achieve similar model performance to the auxiliary-loss-free meth...
4 Pre-Training
实际专家负载与理论上平衡的专家负载之间的比较。由于篇幅限制,我们仅展示了两层的结果作为示例,所有层的结果在附录C中提供。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: al expert load and the theoretically balanced expert load. Due to space constraints, we only present the results of two layers as an example, with the results of all layers provided in Appendix C .
4.1 Data Construction
4.1 数据构建
与DeepSeek-V2相比,我们优化了预训练语料库,提高了数学和编程样本的比例,同时扩展了英语和中文之外的多语言覆盖范围。此外,我们的数据处理流水线更加精细化,以最小化数据噪声。
原文: Compared with DeepSeek-V2, we optimize the pre-training corpus by enhancing the ratio of mathematical and programming samples, while expanding multilingual coverage beyond English and Chinese. Also, our data processing pipeline is refined to minimize redundancy while maintaining corpus diversity. Inspired by Ding et al. ( 2024 ) , we implement the document packing method for data integrity but do not incorporate cross-sample attention masking during training. Finally, the training corpus for DeepSeek-V3 consists of 14.8T high-quality and diverse tokens in our tokenizer. In the training process...
4.1 Data Construction
为了解决这个问题,我们在训练期间随机拆分一定比例的此类组合token,使模型接触到更广泛的特殊情况,减轻这种偏差。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: To address this issue, we randomly split a certain proportion of such combined tokens during training, which exposes the model to a wider array of special cases and mitigates this bias.
4.2 Hyper-Parameters
4.2 超参数
模型超参数:我们将Transformer层数设为61,隐藏维度设为7168。所有可学习参数以标准差0.006随机初始化。在MLA中,我们将注意力头数n_h设为128。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Model Hyper-Parameters. We set the number of Transformer layers to 61 and the hidden dimension to 7168. All learnable parameters are randomly initialized with a standard deviation of 0.006. In MLA, we set the number of attention heads n h n_{h} to 128 and the per-head dimension d h d_{h} to 128. The KV compression dimension d c d_{c} is set to 512, and the query compression dimension d c ′ d_{c}^{\prime} is set to 1536. For the decoupled queries and key, we set the per-head dimension d h R d_{h}^{R} to 64. We substitute all FFNs except for the first three layers with MoE layers. Each MoE layer...
4.2 Hyper-Parameters
在最后500B token的训练中,我们在前333B token保持恒定学习率2.2e-5,然后在剩余167B token切换到另一个恒定学习率7.3e-6。梯度裁剪阈值设为1.0。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: raining of the final 500B tokens, we keep a constant learning rate of 2.2 × 10 − 5 2.2\times 10^{-5} in the first 333B tokens, and switch to another constant learning rate of 7.3 × 10 − 6 7.3\times 10^{-6} in the remaining 167B tokens. The gradient clipping norm is set to 1.0. We employ a batch size scheduling strategy, where the batch size is gradually increased from 3072 to 15360 in the training of the first 469B tokens, and then keeps 15360 in the remaining training. We leverage pipeline parallelism to deploy different layers of a model on different GPUs, and for each layer, the routed expe...
4.3 Long Context Extension
4.3 长上下文扩展
我们采用与DeepSeek-V2类似的方法,为DeepSeek-V3启用长上下文能力。预训练阶段后,我们应用YaRN进行上下文扩展,并执行两个额外的训练阶段。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: We adopt a similar approach to DeepSeek-V2 (DeepSeek-AI, 2024c ) to enable long context capabilities in DeepSeek-V3. After the pre-training stage, we apply YaRN (Peng et al., 2023a ) for context extension and perform two additional training phases, each comprising 1000 steps, to progressively expand the context window from 4K to 32K and then to 128K. The YaRN configuration is consistent with that used in DeepSeek-V2, being applied exclusively to the decoupled shared key 𝐤 t R \mathbf{k}^{R}_{t} . The hyper-parameters remain identical across both phases, with the scale s = 40 s=40 , α = 1 \alph...
4.4 Evaluations
4.4 评估
4.4.1 评估基准
DeepSeek-V3的基础模型在包含英语和中文为主的多语言语料库上预训练,因此我们在一系列主要为英语和中文的基准上评估其性能。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 4.4.1 Evaluation Benchmarks The base model of DeepSeek-V3 is pretrained on a multilingual corpus with English and Chinese constituting the majority, so we evaluate its performance on a series of benchmarks primarily in English and Chinese, as well as on a multilingual benchmark. Our evaluation is based on our internal evaluation framework integrated in our HAI-LLM framework. Considered benchmarks are categorized and listed as follows, where underlined benchmarks are in Chinese and double-underlined benchmarks are multilingual ones: Multi-subject multiple-choice datasets include MMLU (Hendrycks...
4.4 Evaluations
我们采用基于困惑度的评估方法处理包括HellaSwag、PIQA、WinoGrande、RACE-Middle、RACE-High、MMLU、MMLU-Redux、MMLU-Pro、MMMLU、ARC-Easy、ARC-Challenge、C-Eval、CMMLU、C3和CCPM等数据集,并采用基于生成的评估方法处理TriviaQA、NaturalQuestions等数据集。
原文: perplexity-based evaluation for datasets including HellaSwag, PIQA, WinoGrande, RACE-Middle, RACE-High, MMLU, MMLU-Redux, MMLU-Pro, MMMLU, ARC-Easy, ARC-Challenge, C-Eval, CMMLU, C3, and CCPM, and adopt generation-based evaluation for TriviaQA, NaturalQuestions, DROP, MATH, GSM8K, MGSM, HumanEval, MBPP, LiveCodeBench-Base, CRUXEval, BBH, AGIEval, CLUEWSC, CMRC, and CMath. In addition, we perform language-modeling-based evaluation for Pile-test and use Bits-Per-Byte (BPB) as the metric to guarantee fair comparison among models using different tokenizers. Benchmark (Metric) # Shots DeepSeek-V2 Q...
4.4 Evaluations
表3:DeepSeek-V3-Base与其他代表性开源基础模型的比较。所有模型都在我们内部评估框架下进行评估。
DeepSeek-V3-Base在大多数基准上全面超越了DeepSeek-V2-Base、Qwen2.5 72B Base和LLaMA-3.1 405B Base。
原文: hot 77.4 76.7 79.7 78.6 CCPM (EM) 0-shot 93.0 88.5 78.6 92.0 Multilingual MMMLU-non-English (EM) 5-shot 64.0 74.8 73.8 79.4 Table 3: Comparison among DeepSeek-V3-Base and other representative open-source base models. All models are evaluated in our internal framework and share the same evaluation setting. Scores with a gap not exceeding 0.3 are considered to be at the same level. DeepSeek-V3-Base achieves the best performance on most benchmarks, especially on math and code tasks. 4.4.2 Evaluation Results In Table 3 , we compare the base model of DeepSeek-V3 with the state-of-the-art open-sourc...
4.4 Evaluations
主要发现:
(1)DeepSeek-V3-Base全面超越了DeepSeek-V2-Base和Qwen2.5 72B Base。
(2)与拥有11倍激活参数的最大开源模型LLaMA-3.1 405B Base相比,DeepSeek-V3-Base在多语言、代码和数学任务上也表现出更好的性能。
原文: -Base also shows better performance than Qwen2.5 72B. (3) Compared with LLaMA-3.1 405B Base, the largest open-source model with 11 times the activated parameters, DeepSeek-V3-Base also exhibits much better performance on multilingual, code, and math benchmarks. As for English and Chinese language benchmarks, DeepSeek-V3-Base shows competitive or better performance, and is especially good on BBH, MMLU-series, DROP, C-Eval, CMMLU, and CCPM. Due to our efficient architectures and comprehensive engineering optimizations, DeepSeek-V3 achieves extremely high training efficiency. Under our training f...
4.5 Discussion
4.5 讨论
4.5.1 多令牌预测的消融研究
在表4中,我们展示了MTP策略的消融结果。具体来说,我们在两个不同规模的基线模型上验证了MTP策略。在小规模下,我们训练了一个基线模型。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 4.5.1 Ablation Studies for Multi-Token Prediction In Table 4 , we show the ablation results for the MTP strategy. To be specific, we validate the MTP strategy on top of two baseline models across different scales. At the small scale, we train a baseline MoE model comprising 15.7B total parameters on 1.33T tokens. At the large scale, we train a baseline MoE model comprising 228.7B total parameters on 540B tokens. On top of them, keeping the training data and the other architectures the same, we append a 1-depth MTP module onto them and train two models with the MTP strategy for comparison. Note...
4.5 Discussion
表4:MTP策略的消融结果。
小规模MoE:激活参数2.4B,总参数15.7B,训练token 1.33T。
大规模MoE:激活参数20.9B,总参数228.7B,训练token 578B。
MTP策略在两个规模下都带来了显著的性能提升。
原文: ge MoE Large MoE Aux-Loss-Based Aux-Loss-Free Aux-Loss-Based Aux-Loss-Free # Activated Params - 2.4B 2.4B 20.9B 20.9B # Total Params - 15.7B 15.7B 228.7B 228.7B # Training Tokens - 1.33T 1.33T 578B 578B Pile-test (BPB) - 0.727 0.724 0.656 0.652 BBH (EM) 3-shot 37.3 39.3 66.7 67.9 MMLU (EM) 5-shot 51.0 51.8 68.3 67.2 DROP (F1) 1-shot 38.1 39.0 67.1 67.1 TriviaQA (EM) 5-shot 58.3 58.5 66.7 67.7 NaturalQuestions (EM) 5-shot 23.2 23.4 27.1 28.1 HumanEval (Pass@1) 0-shot 22.0 22.6 40.2 46.3 MBPP (Pass@1) 3-shot 36.6 35.8 59.2 61.2 GSM8K (EM) 8-shot 27.1 29.6 70.7 74.5 MATH (EM) 4-shot 10.9 11.1 37....
4.5 Discussion
使用序列级辅助损失的方法也实现了与无辅助损失方法类似的性能。具体来说,在1B MoE模型实验中,验证损失为:2.258(使用序列级辅助损失)、2.253(使用无辅助损失方法)。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: also achieve similar model performance to the auxiliary-loss-free method. To be specific, in our experiments with 1B MoE models, the validation losses are: 2.258 (using a sequence-wise auxiliary loss), 2.253 (using the auxiliary-loss-free method), and 2.253 (using a batch-wise auxiliary loss). We also observe similar results on 3B MoE models: the model using a sequence-wise auxiliary loss achieves a validation loss of 2.085, and the models using the auxiliary-loss-free method or a batch-wise auxiliary loss achieve the same validation loss of 2.080. In addition, although the batch-wise load bal...
5 Post-Training
5 后训练
5.1 监督微调
我们策划了指令微调数据集,包含150万个实例,涵盖多个领域,每个领域采用不同的数据创建方法。推理数据:对于推理相关数据集,我们使用了知识蒸馏方法,从DeepSeek-R1专家模型生成高质量的推理数据。
原文: 5.1 Supervised Fine-Tuning We curate our instruction-tuning datasets to include 1.5M instances spanning multiple domains, with each domain employing distinct data creation methods tailored to its specific requirements. Reasoning Data. For reasoning-related datasets, including those focused on mathematics, code competition problems, and logic puzzles, we generate the data by leveraging an internal DeepSeek-R1 model. Specifically, while the R1-generated data demonstrates strong accuracy, it suffers from issues such as overthinking, poor formatting, and excessive length. Our objective is to balan...
5 Post-Training
非推理数据:对于非推理数据,如对话和常识问答,我们使用专家模型作为数据生成源。这种方法确保最终训练数据保留了DeepSeek-R1的优势,同时产生简洁有效的响应。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: a for the final model, where the expert models are used as data generation sources. This method ensures that the final training data retains the strengths of DeepSeek-R1 while producing responses that are concise and effective. Non-Reasoning Data. For non-reasoning data, such as creative writing, role-play, and simple question answering, we utilize DeepSeek-V2.5 to generate responses and enlist human annotators to verify the accuracy and correctness of the data. SFT Settings. We fine-tune DeepSeek-V3-Base for two epochs using the SFT dataset, using the cosine decay learning rate scheduling tha...
5 Post-Training
奖励模型从DeepSeek-V3 SFT检查点训练。为提高其可靠性,我们构建了偏好数据,不仅提供最终奖励,还包括导致奖励的思维链。这种方法有助于缓解奖励模型的偏见。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: The reward model is trained from the DeepSeek-V3 SFT checkpoints. To enhance its reliability, we construct preference data that not only provides the final reward but also includes the chain-of-thought leading to the reward. This approach helps mitigate the risk of reward hacking in specific tasks. 5.2.2 Group Relative Policy Optimization Similar to DeepSeek-V2 (DeepSeek-AI, 2024c ) , we adopt Group Relative Policy Optimization (GRPO) (Shao et al., 2024 ) , which foregoes the critic model that is typically with the same size as the policy model, and estimates the baseline from group scores ins...
5 Post-Training
优势函数计算:A_i = (r_i - mean({r_1, r_2, ..., r_G})) / std({r_1, r_2, ..., r_G})
其中r_i是从对应输出的奖励中得出的优势。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: the advantage, derived from the rewards { r 1 , r 2 , … , r G } \{r_{1},r_{2},\ldots,r_{G}\} corresponding to the outputs within each group: A i = r i − mean ( { r 1 , r 2 , ⋯ , r G } ) std ( { r 1 , r 2 , ⋯ , r G } ) . A_{i}=\frac{r_{i}-{\operatorname{mean}(\{r_{1},r_{2},\cdots,r_{G}\})}}{{\operatorname{std}(\{r_{1},r_{2},\cdots,r_{G}\})}}. (28) We incorporate prompts from diverse domains, such as coding, math, writing, role-playing, and question answering, during the RL process. This approach not only aligns the model more closely with human preferences but also enhances performance on b...
5 Post-Training
我们使用Zero-Eval提示格式进行MMLU-Redux的零样本评估。对于其他数据集,我们遵循其原始评估协议。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: mework 4 4 4 https://github.com/openai/simple-evals . We utilize the Zero-Eval prompt format (Lin, 2024 ) for MMLU-Redux in a zero-shot setting. For other datasets, we follow their original evaluation protocols with default prompts as provided by the dataset creators. For code and math benchmarks, the HumanEval-Mul dataset includes 8 mainstream programming languages (Python, Java, Cpp, C#, JavaScript, TypeScript, PHP, and Bash) in total. We use CoT and non-CoT methods to evaluate model performance on LiveCodeBench, where the data are collected from August 2024 to November 2024. The Codeforces ...
5 Post-Training
表5:后训练模型的评估结果。
DeepSeek-V3在多项基准上表现出色:
SWE Verified (Resolved): 42.0%
AIME 2024 (Pass@1): 73.3%
LiveCodeBench: 69.6%
原文: centile) 17.5 35.6 24.8 25.3 20.3 23.6 51.6 SWE Verified (Resolved) - 22.6 23.8 24.5 50.8 38.8 42.0 Aider-Edit (Acc.) 60.3 71.6 65.4 63.9 84.2 72.9 79.7 Aider-Polyglot (Acc.) - 18.2 7.6 5.8 45.3 16.0 49.6 Math AIME 2024 (Pass@1) 4.6 16.7 23.3 23.3 16.0 9.3 39.2 MATH-500 (EM) 56.3 74.7 80.0 73.8 78.3 74.6 90.2 CNMO 2024 (Pass@1) 2.8 10.8 15.9 6.8 13.1 10.8 43.2 Chinese CLUEWSC (EM) 89.9 90.4 91.4 84.7 85.4 87.9 90.9 C-Eval (EM) 78.6 79.5 86.1 61.5 76.7 76.0 86.5 C-SimpleQA (Correct) 48.5 54.1 48.4 50.4 51.3 59.3 64.8 Table 6: Comparison between DeepSeek-V3 and other representative chat models. ...
5 Post-Training
在长上下文理解基准(DROP、LongBench v2、FRAMES)中,DeepSeek-V3继续展示了其顶级模型地位。在DROP的3-shot设置下,它实现了令人印象深刻的91.6 F1分数,超越了所有其他模型。
原文: n. In long-context understanding benchmarks such as DROP, LongBench v2, and FRAMES, DeepSeek-V3 continues to demonstrate its position as a top-tier model. It achieves an impressive 91.6 F1 score in the 3-shot setting on DROP, outperforming all other models in this category. On FRAMES, a benchmark requiring question-answering over 100k token contexts, DeepSeek-V3 closely trails GPT-4o while outperforming all other models by a significant margin. This demonstrates the strong capability of DeepSeek-V3 in handling extremely long-context tasks. The long-context capability of DeepSeek-V3 is further ...
5 Post-Training
在编码基准测试中,DeepSeek-V3表现出色,在HumanEval-Mul和LiveCodeBench等基准上超越所有基线。这种成功归因于其先进的知识蒸馏技术,有效增强了代码生成和解决问题的能力。
原文: or performance, outperforming all baselines on benchmarks like HumanEval-Mul and LiveCodeBench. This success can be attributed to its advanced knowledge distillation technique, which effectively enhances its code generation and problem-solving capabilities in algorithm-focused tasks. On math benchmarks, DeepSeek-V3 demonstrates exceptional performance, significantly surpassing baselines and setting a new state-of-the-art for non-o1-like models. Specifically, on AIME, MATH-500, and CNMO 2024, DeepSeek-V3 outperforms the second-best model, Qwen2.5 72B, by approximately 10% in absolute scores, wh...
5 Post-Training
表7:与AlpacaEval 2.0和Arena-Hard的比较。DeepSeek-V3在Arena-Hard上实现了令人印象深刻的96.9%胜率,超越了所有其他模型。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: able 7 . Specifically, we adhere to the original configurations of AlpacaEval 2.0 (Dubois et al., 2024 ) and Arena-Hard (Li et al., 2024a ) , which leverage GPT-4-Turbo-1106 as judges for pairwise comparisons. On Arena-Hard, DeepSeek-V3 achieves an impressive win rate of over 86% against the baseline GPT-4-0314, performing on par with top-tier models like Claude-Sonnet-3.5-1022. This underscores the robust capabilities of DeepSeek-V3, especially in dealing with complex prompts, including coding and debugging tasks. Furthermore, DeepSeek-V3 achieves a groundbreaking milestone as the first open-...
5 Post-Training
表7结果:DeepSeek-V3在AlpacaEval 2.0上达到96.9%胜率,在Arena-Hard上达到87.0%胜率,与GPT-4o-0806和Claude-3.5-sonnet-1022等顶级模型相当。
原文: T-4o-0513 96.6 70.4 86.7 84.9 84.7 GPT-4o-0806 96.1 76.1 88.1 86.6 86.7 GPT-4o-1120 95.8 71.3 86.2 85.2 84.6 Claude-3.5-sonnet-0620 96.4 74.0 81.6 84.7 84.2 Claude-3.5-sonnet-1022 96.4 79.7 91.1 87.6 88.7 DeepSeek-V3 96.9 79.8 87.0 84.3 87.0 DeepSeek-V3 (maj@6) 96.9 82.6 89.5 89.2 89.6 Table 8: Performances of GPT-4o, Claude-3.5-sonnet and DeepSeek-V3 on RewardBench. 5.4 Discussion 5.4.1 Distillation from DeepSeek-R1 We ablate the contribution of distillation from DeepSeek-R1 based on DeepSeek-V2.5. The baseline is trained on short CoT data, whereas its competitor uses data generated by the ex...
5 Post-Training
奖励在RL中发挥关键作用,引导优化过程。在通过外部工具验证简单的领域(如某些编码或数学场景)中,RL表现出非凡的效力。然而,在大多数情况下,奖励模型的质量直接影响RL的效果。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: ewarding Rewards play a pivotal role in RL, steering the optimization process. In domains where verification through external tools is straightforward, such as some coding or mathematics scenarios, RL demonstrates exceptional efficacy. However, in more general scenarios, constructing a feedback mechanism through hard coding is impractical. During the development of DeepSeek-V3, for these broader contexts, we employ the constitutional AI approach (Bai et al., 2022 ) , leveraging the voting evaluation results of DeepSeek-V3 itself as a feedback source. This method has produced notable alignment ...
5 Post-Training
每秒Token数(Tokens Per Second)。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: (Tokens Per Second).
5.1 Supervised Fine-Tuning
5.1 监督微调
我们策划了指令微调数据集,包含150万个实例,涵盖多个领域。推理数据:对于推理相关数据集,包括数学、代码和常识推理,我们使用了知识蒸馏方法生成高质量数据。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: We curate our instruction-tuning datasets to include 1.5M instances spanning multiple domains, with each domain employing distinct data creation methods tailored to its specific requirements. Reasoning Data. For reasoning-related datasets, including those focused on mathematics, code competition problems, and logic puzzles, we generate the data by leveraging an internal DeepSeek-R1 model. Specifically, while the R1-generated data demonstrates strong accuracy, it suffers from issues such as overthinking, poor formatting, and excessive length. Our objective is to balance the high accuracy of R1-...
5.1 Supervised Fine-Tuning
专家模型作为数据生成源。这种方法确保最终训练数据保留了DeepSeek-R1的优势,同时产生简洁有效的响应。非推理数据:对于对话和常识问答,我们使用人类专家标注的数据。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: e the expert models are used as data generation sources. This method ensures that the final training data retains the strengths of DeepSeek-R1 while producing responses that are concise and effective. Non-Reasoning Data. For non-reasoning data, such as creative writing, role-play, and simple question answering, we utilize DeepSeek-V2.5 to generate responses and enlist human annotators to verify the accuracy and correctness of the data. SFT Settings. We fine-tune DeepSeek-V3-Base for two epochs using the SFT dataset, using the cosine decay learning rate scheduling that starts at 5 × 10 − 6 5\ti...
5.2 Reinforcement Learning
5.2 强化学习
5.2.1 奖励模型
我们在RL过程中采用基于规则的奖励模型和基于模型的奖励模型。基于规则的RM:对于可以使用特定规则验证的问题,我们采用基于规则的奖励系统来确定反馈。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 5.2.1 Reward Model We employ a rule-based Reward Model (RM) and a model-based RM in our RL process. Rule-Based RM. For questions that can be validated using specific rules, we adopt a rule-based reward system to determine the feedback. For instance, certain math problems have deterministic results, and we require the model to provide the final answer within a designated format (e.g., in a box), allowing us to apply rules to verify the correctness. Similarly, for LeetCode problems, we can utilize a compiler to generate feedback based on test cases. By leveraging rule-based validation wherever p...
5.2 Reinforcement Learning
GRPO损失函数:L_GRPO = E[min(ratio * A_i, clip(ratio, 1-epsilon, 1+epsilon) * A_i) - beta * D_KL]
其中ratio是新旧策略的概率比,A_i是优势函数,D_KL是KL散度。
原文: [ q ∼ P ( Q ) , { o i } i = 1 G ∼ π θ o l d ( O | q ) ] 1 G ∑ i = 1 G ( min ( π θ ( o i | q ) π θ o l d ( o i | q ) A i , clip ( π θ ( o i | q ) π θ o l d ( o i | q ) , 1 − ε , 1 + ε ) A i ) − β 𝔻 K L ( π θ | | π r e f ) ) , \begin{split}\mathcal{J}_{GRPO}(\theta)&=\mathbb{E}{[q\sim P(Q),\{o_{i}\}_{i=1}^{G}\sim\pi_{\theta_{old}}(O|q)]}\\ &\frac{1}{G}\sum_{i=1}^{G}\left(\min\left(\frac{\pi_{\theta}(o_{i}|q)}{\pi_{\theta_{old}}(o_{i}|q)}A_{i},\text{clip}\left(\frac{\pi_{\theta}(o_{i}|q)}{\pi_{\theta_{old}}(o_{i}|q)},1-\varepsilon,1+\varepsilon\right)A_{i}\right)-\...
5.3 Evaluations
5.3 评估
5.3.1 评估设置
除了用于基础模型测试的基准外,我们还在IFEval、FRAMES、LongBench v2、GPQA等基准上进一步评估了指令模型。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 5.3.1 Evaluation Settings Evaluation Benchmarks. Apart from the benchmark we used for base model testing, we further evaluate instructed models on IFEval (Zhou et al., 2023 ) , FRAMES (Krishna et al., 2024 ) , LongBench v2 (Bai et al., 2024 ) , GPQA (Rein et al., 2023 ) , SimpleQA (OpenAI, 2024c ) , C-SimpleQA (He et al., 2024 ) , SWE-Bench Verified (OpenAI, 2024d ) , Aider 1 1 1 https://aider.chat , LiveCodeBench (Jain et al., 2024 ) (questions from August 2024 to November 2024), Codeforces 2 2 2 https://codeforces.com , Chinese National High School Mathematics Olympiad (CNMO 2024) 3 3 3 http...
5.3 Evaluations
对于数学评估,AIME和CNMO 2024使用温度0.7评估,结果平均16次运行。对于包含少于1000个样本的数据集,我们使用不同的温度设置多次测试以得出稳健的最终结果。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: sing the agentless framework (Xia et al., 2024 ) . We use the “diff” format to evaluate the Aider-related benchmarks. For mathematical assessments, AIME and CNMO 2024 are evaluated with a temperature of 0.7, and the results are averaged over 16 runs, while MATH-500 employs greedy decoding. We allow all models to output a maximum of 8192 tokens for each benchmark. Benchmark (Metric) DeepSeek DeepSeek Qwen2.5 LLaMA-3.1 Claude-3.5- GPT-4o DeepSeek V2-0506 V2.5-0905 72B-Inst. 405B-Inst. Sonnet-1022 0513 V3 Architecture MoE MoE Dense Dense - - MoE # Activated Params 21B 21B 72B 405B - - 37B # Total...
5.3 Evaluations
DeepSeek-V3是表现最佳的开源模型,在与前沿闭源模型的竞争中也表现出色。在事实知识基准SimpleQA上,DeepSeek-V3落后于GPT-4o和Claude-Sonnet,主要原因是其设计重点和资源分配。
原文: s containing fewer than 1000 samples are tested multiple times using varying temperature settings to derive robust final results. DeepSeek-V3 stands as the best-performing open-source model, and also exhibits competitive performance against frontier closed-source models. 5.3.2 Standard Evaluation Table 6 presents the evaluation results, showcasing that DeepSeek-V3 stands as the best-performing open-source model. Additionally, it is competitive against frontier closed-source models like GPT-4o and Claude-3.5-Sonnet. English Benchmarks. MMLU is a widely recognized benchmark designed to assess th...
5.3 Evaluations
DeepSeek-V3将更多训练token分配给推理能力,这导致了在事实性基准上的差距。我们相信,通过增加事实性数据的训练比例,可以进一步缩小这一差距。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: st a few weeks before the launch of DeepSeek V3. On the factual knowledge benchmark, SimpleQA, DeepSeek-V3 falls behind GPT-4o and Claude-Sonnet, primarily due to its design focus and resource allocation. DeepSeek-V3 assigns more training tokens to learn Chinese knowledge, leading to exceptional performance on the C-SimpleQA. On the instruction-following benchmark, DeepSeek-V3 significantly outperforms its predecessor, DeepSeek-V2-series, highlighting its improved ability to understand and adhere to user-defined format constraints. Code and Math Benchmarks. Coding is a challenging and practica...
5.3 Evaluations
(5.3 Evaluations的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: ghlights the effectiveness of the distillation technique from DeepSeek-R1, which has been proven highly beneficial for non-o1-like models. Chinese Benchmarks. Qwen and DeepSeek are two representative model series with robust support for both Chinese and English. On the factual benchmark Chinese SimpleQA, DeepSeek-V3 surpasses Qwen2.5-72B by 16.4 points, despite Qwen2.5 being trained on a larger corpus compromising 18T tokens, which are 20% more than the 14.8T tokens that DeepSeek-V3 is pre-trained on. On C-Eval, a representative benchmark for Chinese educational knowledge evaluation, and CLUEW...
5.3 Evaluations
(5.3 Evaluations的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: idges the performance gap between open-source and closed-source models, setting a new standard for what open-source models can accomplish in challenging domains. Similarly, DeepSeek-V3 showcases exceptional performance on AlpacaEval 2.0, outperforming both closed-source and open-source models. This demonstrates its outstanding proficiency in writing tasks and handling straightforward question-answering scenarios. Notably, it surpasses DeepSeek-V2.5-0905 by a significant margin of 20%, highlighting substantial improvements in tackling simple tasks and showcasing the effectiveness of its advance...
5.4 Discussion
(5.4 Discussion的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 5.4.1 Distillation from DeepSeek-R1 We ablate the contribution of distillation from DeepSeek-R1 based on DeepSeek-V2.5. The baseline is trained on short CoT data, whereas its competitor uses data generated by the expert checkpoints described above. Table 9 demonstrates the effectiveness of the distillation data, showing significant improvements in both LiveCodeBench and MATH-500 benchmarks. Our experiments reveal an interesting trade-off: the distillation leads to better performance but also substantially increases the average response length. To maintain a balance between model accuracy and c...
5.4 Discussion
(5.4 Discussion的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: these broader contexts, we employ the constitutional AI approach (Bai et al., 2022 ) , leveraging the voting evaluation results of DeepSeek-V3 itself as a feedback source. This method has produced notable alignment effects, significantly enhancing the performance of DeepSeek-V3 in subjective evaluations. By integrating additional constitutional inputs, DeepSeek-V3 can optimize towards the constitutional direction. We believe that this paradigm, which combines supplementary information with LLMs as a feedback source, is of paramount importance. The LLM serves as a versatile processor capable of...
6 Conclusion, Limitations, and Future Directions
(6 Conclusion, Limitations, and Future Di的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: In this paper, we introduce DeepSeek-V3, a large MoE language model with 671B total parameters and 37B activated parameters, trained on 14.8T tokens. In addition to the MLA and DeepSeekMoE architectures, it also pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. The training of DeepSeek-V3 is cost-effective due to the support of FP8 training and meticulous engineering optimizations. The post-training also makes a success in distilling the reasoning capability from the DeepSeek-R1 series of models. Comprehen...
6 Conclusion, Limitations, and Future Directions
(6 Conclusion, Limitations, and Future Di的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: er improve both the training and inference efficiency, striving to approach efficient support for infinite context length. Additionally, we will try to break through the architectural limitations of Transformer, thereby pushing the boundaries of its modeling capabilities. • We will continuously iterate on the quantity and quality of our training data, and explore the incorporation of additional training signal sources, aiming to drive data scaling across a more comprehensive range of dimensions. • We will consistently explore and iterate on the deep thinking capabilities of our models, aiming ...
Appendix A Contributions and Acknowledgments
(Appendix A Contributions and Acknowledgm的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Research & Engineering Aixin Liu Bing Xue Bingxuan Wang Bochao Wu Chengda Lu Chenggang Zhao Chengqi Deng Chenyu Zhang* Chong Ruan Damai Dai Daya Guo Dejian Yang Deli Chen Erhang Li Fangyun Lin Fucong Dai Fuli Luo* Guangbo Hao Guanting Chen Guowei Li H. Zhang Han Bao* Hanwei Xu Haocheng Wang* Haowei Zhang Honghui Ding Huajian Xin* Huazuo Gao Hui Qu Jianzhong Guo Jiashi Li Jiawei Wang* Jingchang Chen Jingyang Yuan Junjie Qiu Junlong Li Junxiao Song Kai Dong Kai Hu* Kaige Gao Kang Guan Kexin Huang Kuai Yu Lean Wang Lecong Zhang Liang Zhao Litong Wang Liyue Zhang Mingchuan Zhang Minghua Zhang Ming...
Appendix A Contributions and Acknowledgments
(Appendix A Contributions and Acknowledgm的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Ma Zhen Huang Zhipeng Xu Zhongyu Zhang Business & Compliance Dongjie Ji Jian Liang Jin Chen Leyi Xia Miaojun Wang Mingming Li Peng Zhang Shaoqing Wu Shengfeng Ye T. Wang W.L. Xiao Wei An Xianzu Wang Xinxia Shan Ying Tang Yukun Zha Yuting Yan Zhen Zhang Within each role, authors are listed alphabetically by the first name. Names marked with * denote individuals who have departed from our team.
Appendix B Ablation Studies for Low-Precision Training
(Appendix B Ablation Studies for Low-Prec的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Figure 10: Loss curves comparison between BF16 and FP8 training. Results are smoothed by Exponential Moving Average (EMA) with a coefficient of 0.9. B.1 FP8 v.s. BF16 Training We validate our FP8 mixed precision framework with a comparison to BF16 training on top of two baseline models across different scales. At the small scale, we train a baseline MoE model comprising approximately 16B total parameters on 1.33T tokens. At the large scale, we train a baseline MoE model comprising approximately 230B total parameters on around 0.9T tokens. We show the training curves in Figure 10 and demonstrat...
B.1 FP8 v.s. BF16 Training
(B.1 FP8 v.s. BF16 Training的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: We validate our FP8 mixed precision framework with a comparison to BF16 training on top of two baseline models across different scales. At the small scale, we train a baseline MoE model comprising approximately 16B total parameters on 1.33T tokens. At the large scale, we train a baseline MoE model comprising approximately 230B total parameters on around 0.9T tokens. We show the training curves in Figure 10 and demonstrate that the relative error remains below 0.25% with our high-precision accumulation and fine-grained quantization strategies.
B.2 Discussion About Block-Wise Quantization
(B.2 Discussion About Block-Wise Quantiza的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Although our tile-wise fine-grained quantization effectively mitigates the error introduced by feature outliers, it requires different groupings for activation quantization, i.e., 1x128 in forward pass and 128x1 for backward pass. A similar process is also required for the activation gradient. A straightforward strategy is to apply block-wise quantization per 128x128 elements like the way we quantize the model weights. In this way, only transposition is required for backward. Therefore, we conduct an experiment where all tensors associated with Dgrad are quantized on a block-wise basis. The re...
Appendix C Expert Specialization Patterns of the 16B Aux-Loss-Based and Aux-Loss-Free Models
(Appendix C Expert Specialization Pattern的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: We record the expert load of the 16B auxiliary-loss-based baseline and the auxiliary-loss-free model on the Pile test set. The auxiliary-loss-free model tends to have greater expert specialization across all layers, as demonstrated in Figure 11 . (a) Layers 1-7 (b) Layers 7-13 (c) Layers 13-19 (d) Layers 19-25 (e) Layers 25-27 Figure 11: Expert load of auxiliary-loss-free and auxiliary-loss-based models on three domains in the Pile test set. The auxiliary-loss-free model shows greater expert specialization patterns than the auxiliary-loss-based one. The relative expert load denotes the ratio b...