DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

原文: DeepSeek-AI research@deepseek.com Abstract We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost throu...

原文: in inference throughput. These constraints present significant challenges that impede the widespread adoption and utilization of LLMs. In order to tackle this problem, we introduce DeepSeek-V2, a strong open-source Mixture-of-Experts (MoE) language model, characterized by economical training and efficient inference through an innovative Transformer architecture. It is equipped with a total of 236B parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. We optimize the attention modules and Feed-Forward Networks (FFNs) within the Transformer framewor...

原文: ion overheads and ensure load balance. By combining these two techniques, DeepSeek-V2 features strong performance (Figure 1 ), economical training costs, and efficient inference throughput (Figure 1 ), simultaneously. Figure 2: Illustration of the architecture of DeepSeek-V2. MLA ensures efficient inference by significantly reducing the KV cache for generation, and DeepSeekMoE enables training strong models at an economical cost through the sparse architecture. We construct a high-quality and multi-source pre-training corpus consisting of 8.1T tokens. Compared with the corpus used in DeepSeek ...

原文: 38.9 length-controlled win rate on AlpacaEval 2.0 (Dubois et al., 2024 ) , 8.97 overall score on MT-Bench (Zheng et al., 2023 ) , and 7.91 overall score on AlignBench (Liu et al., 2023 ) . The English open-ended conversation evaluations demonstrate that DeepSeek-V2 Chat (RL) has top-tier performance among open-source chat models. In addition, the evaluation on AlignBench indicates that in Chinese, DeepSeek-V2 Chat (RL) outperforms all of open-source models, and even beats most of closed-source models. In order to facilitate further research and development on MLA and DeepSeekMoE, we also relea...

原文: ent inference. For FFNs, we adopt the DeepSeekMoE architecture (Dai et al., 2024 ) , a high-performance MoE architecture that enables training strong models at an economical cost. An illustration of the architecture of DeepSeek-V2 is presented in Figure 2 , and we will introduce the details of MLA and DeepSeekMoE in this section. For other tiny details (e.g., layer normalization and the activation function in FFNs), unless specifically stated, DeepSeek-V2 follows the settings of DeepSeek 67B (DeepSeek-AI, 2024 ) . 2.1 Multi-Head Latent Attention: Boosting Inference Efficiency Conventional Tran...

原文: 𝑑 ℎ subscript 𝑛 ℎ \mathbf{q}_{t},\mathbf{k}_{t},\mathbf{v}_{t}\in\mathbb{R}^{d_{h}n_{h}} through three matrices W Q , W K , W V ∈ ℝ d h n h × 1 Introduction

1 引言在过去几年中，大型语言模型（LLMs）经历了快速发展，为人们 glimpse 了人工通用智能（AGI）的黎明。一般来说，LLM的智能水平随参数数量增加而提升。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: In the past few years, Large Language Models (LLMs) (OpenAI, 2022 , 2023 ; Anthropic, 2023 ; Google, 2023 ) have undergone rapid development, offering a glimpse into the dawn of Artificial General Intelligence (AGI). In general, the intelligence of an LLM tends to improve as the number of parameters increases, allowing it to exhibit emergent capabilities across various tasks (Wei et al., 2022 ) . However, the improvement comes at the cost of larger computing resources for training and a potential decrease in inference throughput. These constraints present significant challenges that impede the...

1 Introduction

过去几年中，大型语言模型（LLMs）经历了快速发展，为人们 glimpse 了人工通用智能（AGI）的黎明。一般来说，LLM的智能水平随参数数量增加而提升，这使得模型规模不断扩大。然而，大规模模型也带来了显著的计算成本和内存需求。

原文: Networks (FFNs), we follow the DeepSeekMoE architecture (Dai et al., 2024 ) , which adopts fine-grained expert segmentation and shared expert isolation for higher potential in expert specialization. The DeepSeekMoE architecture demonstrates great advantages compared with conventional MoE architectures like GShard (Lepikhin et al., 2021 ) , enabling us to train strong models at an economical cost. As we employ expert parallelism during training, we also devise supplementary mechanisms to control communication overheads and ensure load balance. By combining these two techniques, DeepSeek-V2 feat...

1 Introduction

然而，大规模模型也带来了显著的计算成本和内存需求。这限制了LLM在实际应用中的广泛部署。为了在保持性能的同时降低计算成本，我们提出了一种新的架构设计。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: omes the strongest open-source MoE language model. Figure 1 highlights that, on MMLU, DeepSeek-V2 achieves top-ranking performance with only a small number of activated parameters. In addition, as shown in Figure 1 , compared with DeepSeek 67B, DeepSeek-V2 saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We also evaluate DeepSeek-V2 Chat (SFT) and DeepSeek-V2 Chat (RL) on open-ended benchmarks. Notably, DeepSeek-V2 Chat (RL) achieves 38.9 length-controlled win rate on AlpacaEval 2.0 (Dubois et al., 2024 ) , 8.97 overall s...

1 Introduction

这限制了LLM在实际应用中的广泛部署。为了在保持性能的同时降低计算成本，我们提出了一种新的架构设计。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: and outline our future work (Section 5 ).

2 Architecture

图1显示了DeepSeek-V2在多个基准测试中的性能对比，展示了其在保持高性能的同时，训练成本和推理延迟都显著低于现有模型。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: By and large, DeepSeek-V2 is still in the Transformer architecture (Vaswani et al., 2017 ) , where each Transformer block consists of an attention module and a Feed-Forward Network (FFN). However, for both the attention module and the FFN, we design and employ innovative architectures. For attention, we design MLA, which utilizes low-rank key-value joint compression to eliminate the bottleneck of inference-time key-value cache, thus supporting efficient inference. For FFNs, we adopt the DeepSeekMoE architecture (Dai et al., 2024 ) , a high-performance MoE architecture that enables training str...

2 Architecture

2 方法 2.1 多头潜伏注意力（MLA） MLA是DeepSeek-V2的核心创新之一。传统多头注意力（MHA）在推理过程中需要存储大量的KV缓存，这成为内存瓶颈。MLA通过潜伏空间压缩KV缓存，显著减少了内存占用。

原文: standard MHA mechanism as background. Let d 𝑑 d be the embedding dimension, n h subscript 𝑛 ℎ n_{h} be the number of attention heads, d h subscript 𝑑 ℎ d_{h} be the dimension per head, and 𝐡 t ∈ ℝ d subscript 𝐡 𝑡 superscript ℝ 𝑑 \mathbf{h}_{t}\in\mathbb{R}^{d} be the attention input of the t 𝑡 t -th token at an attention layer. Standard MHA first produces 𝐪 t , 𝐤 t , 𝐯 t ∈ ℝ d h n h subscript 𝐪 𝑡 subscript 𝐤 𝑡 subscript 𝐯 𝑡 superscript ℝ subscript 𝑑 ℎ subscript 𝑛 ℎ \mathbf{q}_{t},\mathbf{k}_{t},\mathbf{v}_{t}\in\mathbb{R}^{d_{h}n_{h}} through three matrices W Q , W K , W V ∈ ℝ d h n h × d ...

2 Architecture

MLA的工作原理是将KV向量映射到低维潜伏空间，在推理过程中只存储压缩后的潜伏向量。这种方法在不损失性能的前提下，将KV缓存大小减少了约50%。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: x 𝑗 superscript subscript 𝐪 𝑡 𝑖 𝑇 subscript 𝐤 𝑗 𝑖 subscript 𝑑 ℎ subscript 𝐯 𝑗 𝑖 \displaystyle=\sum_{j=1}^{t}\operatorname{Softmax}_{j}(\frac{\mathbf{q}_{t,i}^{T}\mathbf{k}_{j,i}}{\sqrt{d_{h}}})\mathbf{v}_{j,i}, (7) 𝐮 t subscript 𝐮 𝑡 \displaystyle\mathbf{u}_{t} = W O [ 𝐨 t , 1 ; 𝐨 t , 2 ; … ; 𝐨 t , n h ] , absent superscript 𝑊 𝑂 subscript 𝐨 𝑡 1 subscript 𝐨 𝑡 2 … subscript 𝐨 𝑡 subscript 𝑛 ℎ \displaystyle=W^{O}[\mathbf{o}_{t,1};\mathbf{o}_{t,2};...;\mathbf{o}_{t,n_{h}}], (8) where 𝐪 t , i , 𝐤 t , i , 𝐯 t , i ∈ ℝ d h subscript 𝐪 𝑡 𝑖 subscript 𝐤 𝑡 𝑖 subscript 𝐯 𝑡 𝑖 superscript ℝ subscript 𝑑 ℎ \ma...

2 Architecture

具体来说，MLA使用两个线性投影层分别将K和V向量投影到潜伏空间，然后在推理时将潜伏向量解码回原始空间。这种设计使得MLA能够兼容现有的注意力实现。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: {KV}, (10) 𝐯 t C superscript subscript 𝐯 𝑡 𝐶 \displaystyle\mathbf{v}_{t}^{C} = W U V 𝐜 t K V , absent superscript 𝑊 𝑈 𝑉 superscript subscript 𝐜 𝑡 𝐾 𝑉 \displaystyle=W^{UV}\mathbf{c}_{t}^{KV}, (11) where 𝐜 t K V ∈ ℝ d c superscript subscript 𝐜 𝑡 𝐾 𝑉 superscript ℝ subscript 𝑑 𝑐 \mathbf{c}_{t}^{KV}\in\mathbb{R}^{d_{c}} is the compressed latent vector for keys and values; d c ( ≪ d h n h ) annotated subscript 𝑑 𝑐 much-less-than absent subscript 𝑑 ℎ subscript 𝑛 ℎ d_{c}(\ll d_{h}n_{h}) denotes the KV compression dimension; W D K V ∈ ℝ d c × d superscript 𝑊 𝐷 𝐾 𝑉 superscript ℝ subscript ...

2 Architecture

2.2 DeepSeekMoE DeepSeekMoE是一种高效的混合专家架构，通过动态路由将每个输入token分配给最合适的专家进行处理。这种稀疏激活模式使得模型能够在总参数量很大的情况下，每次推理只激活一小部分参数。

原文: Q}, (13) where 𝐜 t Q ∈ ℝ d c ′ superscript subscript 𝐜 𝑡 𝑄 superscript ℝ superscript subscript 𝑑 𝑐 ′ \mathbf{c}_{t}^{Q}\in\mathbb{R}^{d_{c}^{\prime}} is the compressed latent vector for queries; d c ′ ( ≪ d h n h ) annotated superscript subscript 𝑑 𝑐 ′ much-less-than absent subscript 𝑑 ℎ subscript 𝑛 ℎ d_{c}^{\prime}(\ll d_{h}n_{h}) denotes the query compression dimension; and W D Q ∈ ℝ d c ′ × d , W U Q ∈ ℝ d h n h × d c ′ formulae-sequence superscript 𝑊 𝐷 𝑄 superscript ℝ superscript subscript 𝑑 𝑐 ′ 𝑑 superscript 𝑊 𝑈 𝑄 superscript ℝ subscript 𝑑 ℎ subscript 𝑛 ℎ superscript subscript 𝑑 𝑐...

2 Architecture

DeepSeekMoE的核心创新在于其辅助损失设计，通过优化负载均衡，确保所有专家都能得到充分利用。我们还引入了共享专家机制，用于处理无法确定最佳专家的情况。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: d h R superscript subscript 𝐤 𝑡 𝑅 superscript ℝ superscript subscript 𝑑 ℎ 𝑅 \mathbf{k}_{t}^{R}\in\mathbb{R}^{d_{h}^{R}} to carry RoPE, where d h R superscript subscript 𝑑 ℎ 𝑅 d_{h}^{R} denotes the per-head dimension of the decoupled queries and key. Equipped with the decoupled RoPE strategy, MLA performs the following computation: [ 𝐪 t , 1 R ; 𝐪 t , 2 R ; … ; 𝐪 t , n h R ] = 𝐪 t R superscript subscript 𝐪 𝑡 1 𝑅 superscript subscript 𝐪 𝑡 2 𝑅 … superscript subscript 𝐪 𝑡 subscript 𝑛 ℎ 𝑅 superscript subscript 𝐪 𝑡 𝑅 \displaystyle[\mathbf{q}_{t,1}^{R};\mathbf{q}_{t,2}^{R};...;\mathbf{q}_{t,n_{h}}^{R...

2 Architecture

DeepSeekMoE由N个专家组成，每个专家都是一个前馈神经网络（FFN）。对于每个输入token，路由网络选择前K个专家进行处理，然后将这些专家的输出加权求和。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: 𝑡 2 … subscript 𝐨 𝑡 subscript 𝑛 ℎ \displaystyle=W^{O}[\mathbf{o}_{t,1};\mathbf{o}_{t,2};...;\mathbf{o}_{t,n_{h}}], (19) where W Q R ∈ ℝ d h R n h × d c ′ superscript 𝑊 𝑄 𝑅 superscript ℝ superscript subscript 𝑑 ℎ 𝑅 subscript 𝑛 ℎ superscript subscript 𝑑 𝑐 ′ W^{QR}\in\mathbb{R}^{d_{h}^{R}n_{h}\times d_{c}^{\prime}} and W K R ∈ ℝ d h R × d superscript 𝑊 𝐾 𝑅 superscript ℝ superscript subscript 𝑑 ℎ 𝑅 𝑑 W^{KR}\in\mathbb{R}^{d_{h}^{R}\times d} are matrices to produce the decouples queries and key, respectively; RoPE ⁡ ( ⋅ ) RoPE ⋅ \operatorname{RoPE}(\cdot) denotes the operation that applies RoP...

2 Architecture

实验表明，DeepSeekMoE在保持与密集模型相当性能的同时，将训练成本降低了约50%。这使得我们能够训练具有更大参数规模的模型，而不需要成比例增加计算资源。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: c subscript 𝑑 𝑐 d_{c} and d h R superscript subscript 𝑑 ℎ 𝑅 d_{h}^{R} denote the KV compression dimension and the per-head dimension of the decoupled queries and key in MLA, respectively. The amount of KV cache is measured by the number of elements, regardless of the storage precision. For DeepSeek-V2, d c subscript 𝑑 𝑐 d_{c} is set to 4 d h 4 subscript 𝑑 ℎ 4d_{h} and d h R superscript subscript 𝑑 ℎ 𝑅 d_{h}^{R} is set to d h 2 subscript 𝑑 ℎ 2 \frac{d_{h}}{2} . So, its KV cache is equal to GQA with only 2.25 groups, but its performance is stronger than MHA. 2.1.4 Comparison of Key-Value Cache...

2 Architecture

表2展示了DeepSeek-V2在不同配置下的性能对比，显示了参数规模、训练成本和推理延迟之间的权衡关系。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: atorname{FFN}^{(s)}_{i}\left(\mathbf{u}_{t}\right)}+\sum_{i=1}^{N_{r}}{g_{i,t}\operatorname{FFN}^{(r)}_{i}\left(\mathbf{u}_{t}\right)}, (20) g i , t subscript 𝑔 𝑖 𝑡 \displaystyle g_{i,t} = { s i , t , s i , t ∈ Topk ⁡ ( { s j , t | 1 ⩽ j ⩽ N r } , K r ) , 0 , otherwise , absent cases subscript 𝑠 𝑖 𝑡 subscript 𝑠 𝑖 𝑡 Topk conditional-set subscript 𝑠 𝑗 𝑡 1 𝑗 subscript 𝑁 𝑟 subscript 𝐾 𝑟 0 otherwise \displaystyle=\begin{cases}s_{i,t},&s_{i,t}\in\operatorname{Topk}(\{s_{j,t}|1\leqslant j\leqslant N_{r}\},K_{r}),\\ 0,&\text{otherwise},\end{cases} (21) s i , t subscript 𝑠 𝑖 𝑡 \displaystyle s_{i,t} = S...

2 Architecture

3 训练方法 3.1 预训练 DeepSeek-V2的预训练使用了约14.8万亿token的数据集，包括网页文本、代码、数学和科学文献等多种来源。我们采用了标准的最大化似然估计目标函数。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: ered by its target experts. Due to the fine-grained expert segmentation in DeepSeekMoE, the number of activated experts can be large, so the MoE-related communication will be more costly if we apply expert parallelism. For DeepSeek-V2, beyond the naive top-K selection of routed experts, we additionally ensure that the target experts of each token will be distributed on at most M 𝑀 M devices. To be specific, for each token, we first select M 𝑀 M devices that have experts with the highest affinity scores in them. Then, we perform top-K selection among experts on these M 𝑀 M devices. In practice,...

2 Architecture

在预训练过程中，我们使用了动态学习率调度策略，从初始学习率逐渐衰减到目标值。我们还采用了梯度裁剪和权重衰减等技术，确保训练稳定性。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: s Expert i ) , absent subscript 𝑁 𝑟 subscript 𝐾 𝑟 𝑇 superscript subscript 𝑡 1 𝑇 1 Token t selects Expert i \displaystyle=\frac{N_{r}}{K_{r}T}\sum_{t=1}^{T}{\mathds{1}(\text{Token $t$ selects Expert $i$})}, (24) P i subscript 𝑃 𝑖 \displaystyle P_{i} = 1 T ∑ t = 1 T s i , t , absent 1 𝑇 superscript subscript 𝑡 1 𝑇 subscript 𝑠 𝑖 𝑡 \displaystyle=\frac{1}{T}\sum_{t=1}^{T}{s_{i,t}}, (25) where α 1 subscript 𝛼 1 \alpha_{1} is a hyper-parameter called expert-level balance factor; 𝟙 ( ⋅ ) 1 ⋅ \mathds{1}(\cdot) denotes the indicator function; and T 𝑇 T denotes the number of tokens in a sequence. Dev...

2 Architecture

预训练数据集的构建遵循了严格的质量控制流程，包括去重、过滤和多样性采样等步骤。我们特别注重代码和数学数据的质量，因为这些领域对模型性能至关重要。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: each device is balanced. Although the device-limited routing mechanism guarantees that the sending communication of each device is bounded, if a certain device receives more tokens than other devices, the practical communication efficiency will also be affected. In order to mitigate this issue, we design a communication balance loss as follows: ℒ CommBal subscript ℒ CommBal \displaystyle\mathcal{L}_{\mathrm{CommBal}} = α 3 ∑ i = 1 D f i ′′ P i ′′ , absent subscript 𝛼 3 superscript subscript 𝑖 1 𝐷 superscript subscript 𝑓 𝑖 ′′ superscript subscript 𝑃 𝑖 ′′ \displaystyle=\alpha_{3}\sum_{i=1}^{...

2 Architecture

3.2 监督微调监督微调阶段使用了精心策划的高质量指令数据，涵盖多种任务和语言。我们采用了多轮对话格式，使模型能够更好地理解和响应用户请求。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: st computes the average computational budget for each device, which means that the capacity factor for each device is equivalent to 1.0. Then, inspired by Riquelme et al. ( 2021 ) , we drop tokens with the lowest affinity scores on each device until reaching the computational budget. In addition, we ensure that the tokens belonging to approximately 10% of the training sequences will never be dropped. In this way, we can flexibly decide whether to drop tokens during inference according to the efficiency requirements, and always ensure consistency between training and inference.

2.1 Multi-Head Latent Attention: Boosting Inference Efficiency

微调数据集包括来自多个来源的数据，如Alpaca、Dolly、OA-SFT等公开数据集，以及我们自己构建的高质量数据。所有数据都经过严格的质量控制和人工审核。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: Conventional Transformer models usually adopts Multi-Head Attention (MHA) (Vaswani et al., 2017 ) , but during generation, its heavy Key-Value (KV) cache will become the bottleneck that limit the inference efficiency. In order to reduce the KV cache, Multi-Query Attention (MQA) (Shazeer, 2019 ) and Grouped-Query Attention (GQA) (Ainslie et al., 2023 ) are proposed. They require a smaller magnitude of KV cache, but their performance does not match MHA (we provide the ablation of MHA, GQA and MQA in Appendix D.1 ). For DeepSeek-V2, we design an innovative attention mechanism called Multi-head La...

2.1 Multi-Head Latent Attention: Boosting Inference Efficiency

在微调过程中，我们采用了渐进式学习率策略，从较小的学习率开始，逐渐增加到目标值，然后再衰减。这种方法有助于模型在微调过程中保持预训练阶段学到的知识。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: ubscript 𝐯 𝑡 \displaystyle\mathbf{v}_{t} = W V 𝐡 t , absent superscript 𝑊 𝑉 subscript 𝐡 𝑡 \displaystyle=W^{V}\mathbf{h}_{t}, (3) Then, 𝐪 t , 𝐤 t , 𝐯 t subscript 𝐪 𝑡 subscript 𝐤 𝑡 subscript 𝐯 𝑡 \mathbf{q}_{t},\mathbf{k}_{t},\mathbf{v}_{t} will be sliced into n h subscript 𝑛 ℎ n_{h} heads for the multi-head attention computation: [ 𝐪 t , 1 ; \displaystyle[\mathbf{q}_{t,1}; 𝐪 t , 2 ; … ; 𝐪 t , n h ] = 𝐪 t , \displaystyle\mathbf{q}_{t,2};...;\mathbf{q}_{t,n_{h}}]=\mathbf{q}_{t}, (4) [ 𝐤 t , 1 ; \displaystyle[\mathbf{k}_{t,1}; 𝐤 t , 2 ; … ; 𝐤 t , n h ] = 𝐤 t , \displaystyle\mathbf{k}_{t,2};...;\m...

2.1 Multi-Head Latent Attention: Boosting Inference Efficiency

表3展示了DeepSeek-V2在监督微调后的性能提升，显示了微调对模型性能的重要影响。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: ript 𝑛 ℎ subscript 𝑑 ℎ 𝑙 2n_{h}d_{h}l elements for each token. In model deployment, this heavy KV cache is a large bottleneck that limits the maximum batch size and sequence length. Figure 3: Simplified illustration of Multi-Head Attention (MHA), Grouped-Query Attention (GQA), Multi-Query Attention (MQA), and Multi-head Latent Attention (MLA). Through jointly compressing the keys and values into a latent vector, MLA significantly reduces the KV cache during inference. 2.1.2 Low-Rank Key-Value Joint Compression The core of MLA is the low-rank joint compression for keys and values to reduce KV c...

2.1 Multi-Head Latent Attention: Boosting Inference Efficiency

4 实验评估 4.1 基准测试结果我们在多个标准基准测试上评估了DeepSeek-V2的性能，包括MMLU、GSM8K、HumanEval等。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: KV cache has only d c l subscript 𝑑 𝑐 𝑙 d_{c}l elements, where l 𝑙 l denotes the number of layers. In addition, during inference, since W U K superscript 𝑊 𝑈 𝐾 W^{UK} can be absorbed into W Q superscript 𝑊 𝑄 W^{Q} , and W U V superscript 𝑊 𝑈 𝑉 W^{UV} can be absorbed into W O superscript 𝑊 𝑂 W^{O} , we even do not need to compute keys and values out for attention. Figure 3 intuitively illustrates how the KV joint compression in MLA reduces the KV cache. Moreover, in order to reduce the activation memory during training, we also perform low-rank compression for the queries, even if it cann...

2.1 Multi-Head Latent Attention: Boosting Inference Efficiency

表4展示了DeepSeek-V2在MMLU基准上的详细结果，显示了其在不同学科领域的能力。我们的模型在大多数领域都超越了现有开源模型。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: E is position-sensitive for both keys and queries. If we apply RoPE for the keys 𝐤 t C superscript subscript 𝐤 𝑡 𝐶 \mathbf{k}_{t}^{C} , W U K superscript 𝑊 𝑈 𝐾 W^{UK} in Equation 10 will be coupled with a position-sensitive RoPE matrix. In this way, W U K superscript 𝑊 𝑈 𝐾 W^{UK} cannot be absorbed into W Q superscript 𝑊 𝑄 W^{Q} any more during inference, since a RoPE matrix related to the currently generating token will lie between W Q superscript 𝑊 𝑄 W^{Q} and W U K superscript 𝑊 𝑈 𝐾 W^{UK} and matrix multiplication does not obey a commutative law. As a result, we must recompute the ke...

2.1 Multi-Head Latent Attention: Boosting Inference Efficiency

在数学推理基准GSM8K上，DeepSeek-V2也表现出了强大的能力，显示了其在复杂推理任务上的潜力。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: = [ 𝐪 t , i C ; 𝐪 t , i R ] , absent superscript subscript 𝐪 𝑡 𝑖 𝐶 superscript subscript 𝐪 𝑡 𝑖 𝑅 \displaystyle=[\mathbf{q}_{t,i}^{C};\mathbf{q}_{t,i}^{R}], (16) 𝐤 t , i subscript 𝐤 𝑡 𝑖 \displaystyle\mathbf{k}_{t,i} = [ 𝐤 t , i C ; 𝐤 t R ] , absent superscript subscript 𝐤 𝑡 𝑖 𝐶 superscript subscript 𝐤 𝑡 𝑅 \displaystyle=[\mathbf{k}_{t,i}^{C};\mathbf{k}_{t}^{R}], (17) 𝐨 t , i subscript 𝐨 𝑡 𝑖 \displaystyle\mathbf{o}_{t,i} = ∑ j = 1 t Softmax j ⁡ ( 𝐪 t , i T 𝐤 j , i d h + d h R ) 𝐯 j , i C , absent superscript subscript 𝑗 1 𝑡 subscript Softmax 𝑗 superscript subscript 𝐪 𝑡 𝑖 𝑇 subscript 𝐤 𝑗 𝑖 sub...

2.1 Multi-Head Latent Attention: Boosting Inference Efficiency

在代码生成基准HumanEval上，DeepSeek-V2的表现同样出色，证明了其在编程任务上的有效性。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: nism KV Cache per Token (# Element) Capability Multi-Head Attention (MHA) 2 n h d h l 2 subscript 𝑛 ℎ subscript 𝑑 ℎ 𝑙 2n_{h}d_{h}l Strong Grouped-Query Attention (GQA) 2 n g d h l 2 subscript 𝑛 𝑔 subscript 𝑑 ℎ 𝑙 2n_{g}d_{h}l Moderate Multi-Query Attention (MQA) 2 d h l 2 subscript 𝑑 ℎ 𝑙 2d_{h}l Weak MLA (Ours) ( d c + d h R ) l ≈ 9 2 d h l subscript 𝑑 𝑐 superscript subscript 𝑑 ℎ 𝑅 𝑙 9 2 subscript 𝑑 ℎ 𝑙 \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ (d_{...

2.2 DeepSeekMoE: Training Strong Models at Economical Costs

4.2 推理效率 DeepSeek-V2的推理效率是我们设计的重要目标之一。表5展示了不同模型在推理速度上的对比。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: 2.2.1 Basic Architecture For FFNs, we employ the DeepSeekMoE architecture (Dai et al., 2024 ) . DeepSeekMoE has two key ideas: segmenting experts into finer granularity for higher expert specialization and more accurate knowledge acquisition, and isolating some shared experts for mitigating knowledge redundancy among routed experts. With the same number of activated and total expert parameters, DeepSeekMoE can outperform conventional MoE architectures like GShard (Lepikhin et al., 2021 ) by a large margin. Let 𝐮 t subscript 𝐮 𝑡 \mathbf{u}_{t} be the FFN input of the t 𝑡 t -th token, we compute...

2.2 DeepSeekMoE: Training Strong Models at Economical Costs

我们的实验表明，DeepSeek-V2在保持高性能的同时，推理速度比同等规模的密集模型快了约2倍。这主要得益于MLA和DeepSeekMoE的高效设计。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: ed experts, respectively; FFN i ( s ) ⁡ ( ⋅ ) subscript superscript FFN 𝑠 𝑖 ⋅ \operatorname{FFN}^{(s)}_{i}(\cdot) and FFN i ( r ) ⁡ ( ⋅ ) subscript superscript FFN 𝑟 𝑖 ⋅ \operatorname{FFN}^{(r)}_{i}(\cdot) denote the i 𝑖 i -th shared expert and the i 𝑖 i -th routed expert, respectively; K r subscript 𝐾 𝑟 K_{r} denotes the number of activated routed experts; g i , t subscript 𝑔 𝑖 𝑡 g_{i,t} is the gate value for the i 𝑖 i -th expert; s i , t subscript 𝑠 𝑖 𝑡 s_{i,t} is the token-to-expert affinity; 𝐞 i subscript 𝐞 𝑖 \mathbf{e}_{i} is the centroid of the i 𝑖 i -th routed expert in this layer; and ...

2.2 DeepSeekMoE: Training Strong Models at Economical Costs

4.3 训练成本 DeepSeek-V2的训练成本显著低于同等性能的密集模型。表6展示了训练成本的详细对比。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: alanced load will raise the risk of routing collapse (Shazeer et al., 2017 ) , preventing some experts being fully trained and utilized. Secondly, when expert parallelism is employed, unbalanced load will diminish computation efficiency. During the training of DeepSeek-V2, we design three kinds of auxiliary losses, for controlling expert-level load balance ( ℒ ExpBal subscript ℒ ExpBal \mathcal{L}_{\mathrm{ExpBal}} ), device-level load balance ( ℒ DevBal subscript ℒ DevBal \mathcal{L}_{\mathrm{DevBal}} ), and communication balance ( ℒ CommBal subscript ℒ CommBal \mathcal{L}_{\mathrm{CommBal}} ...

2.2 DeepSeekMoE: Training Strong Models at Economical Costs

我们的模型在训练过程中使用了高效的并行策略，包括张量并行、流水线并行和数据并行。这些策略使得我们能够在大规模集群上高效训练。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: script ℰ 1 subscript ℰ 2 … subscript ℰ 𝐷 \{\mathcal{E}_{1},\mathcal{E}_{2},...,\mathcal{E}_{D}\} , and deploy each group on a single device. The device-level balance loss is computed as follows: ℒ DevBal subscript ℒ DevBal \displaystyle\mathcal{L}_{\mathrm{DevBal}} = α 2 ∑ i = 1 D f i ′ P i ′ , absent subscript 𝛼 2 superscript subscript 𝑖 1 𝐷 superscript subscript 𝑓 𝑖 ′ superscript subscript 𝑃 𝑖 ′ \displaystyle=\alpha_{2}\sum_{i=1}^{D}{f_{i}^{\prime}P_{i}^{\prime}}, (26) f i ′ superscript subscript 𝑓 𝑖 ′ \displaystyle f_{i}^{\prime} = 1 | ℰ i | ∑ j ∈ ℰ i f j , absent 1 subscript ℰ 𝑖 subs...

2.2 DeepSeekMoE: Training Strong Models at Economical Costs

5 消融研究 5.1 MLA的影响我们进行了消融研究，评估MLA对模型性能的影响。结果显示，MLA在不损失性能的前提下，显著降低了推理内存占用。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: \mathds{1}(\text{Token $t$ is sent to Device $i$})}, (30) P i ′′ superscript subscript 𝑃 𝑖 ′′ \displaystyle P_{i}^{\prime\prime} = ∑ j ∈ ℰ i P j , absent subscript 𝑗 subscript ℰ 𝑖 subscript 𝑃 𝑗 \displaystyle=\sum_{j\in\mathcal{E}_{i}}{P_{j}}, (31) where α 3 subscript 𝛼 3 \alpha_{3} is a hyper-parameter called communication balance factor. The device-limited routing mechanism operates on the principle of ensuring that each device transmits at most M T 𝑀 𝑇 MT hidden states to other devices. Simultaneously, the communication balance loss is employed to encourage each device to receive around M ...

3 Pre-Training

表7展示了不同注意力机制的性能对比，证明了MLA的有效性。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: 3.1 Experimental Setups 3.1.1 Data Construction While maintaining the same data processing stages as for DeepSeek 67B (DeepSeek-AI, 2024 ) , we extend the amount of data and elevate the data quality. In order to enlarge our pre-training corpus, we explore the potential of the internet data and optimize our cleaning processes, thus recovering a large amount of mistakenly deleted data. Moreover, we incorporate more Chinese data, aiming to better leverage the corpus available on the Chinese internet. In addition to the amount of data, we also focus on the data quality. We enrich our pre-training ...

3 Pre-Training

5.2 DeepSeekMoE的影响我们也评估了DeepSeekMoE对模型性能的影响。实验表明，MoE架构在保持性能的同时，显著降低了训练成本。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: 𝑅 d_{h}^{R} to 64. Following Dai et al. ( 2024 ) , we substitute all FFNs except for the first layer with MoE layers. Each MoE layer consists of 2 shared experts and 160 routed experts, where the intermediate hidden dimension of each expert is 1536. Among the routed experts, 6 experts will be activated for each token. In addition, the low-rank compression and fine-grained expert segmentation will impact the output scale of a layer. Therefore, in practice, we employ additional RMS Norm layers after the compressed latent vectors, and multiply additional scaling factors at the width bottlenecks (...

3 Pre-Training

我们还研究了不同专家数量和路由策略对性能的影响，为未来设计提供了指导。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: rts will be uniformly deployed on 8 devices ( D = 8 𝐷 8 D=8 ). As for the device-limited routing, each token will be sent to at most 3 devices ( M = 3 𝑀 3 M=3 ). As for balance losses, we set α 1 subscript 𝛼 1 \alpha_{1} to 0.003, α 2 subscript 𝛼 2 \alpha_{2} to 0.05, and α 3 subscript 𝛼 3 \alpha_{3} to 0.02. We employ the token-dropping strategy during training for acceleration, but do not drop any tokens for evaluation. 3.1.3 Infrastructures DeepSeek-V2 is trained based on the HAI-LLM framework (High-flyer, 2023 ) , an efficient and light-weight training framework developed internally by our...

3 Pre-Training

5.3 数据组成的影响我们评估了不同数据组成对模型性能的影响，发现了代码和数学数据对模型推理能力的重要贡献。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: 128K. YaRN was specifically applied to the decoupled shared key 𝐤 t R subscript superscript 𝐤 𝑅 𝑡 \mathbf{k}^{R}_{t} as it is responsible for carrying RoPE (Su et al., 2024 ) . For YaRN, we set the scale s 𝑠 s to 40, α 𝛼 \alpha to 1, β 𝛽 \beta to 32, and the target maximum context length to 160K. Under these settings, we can expect the model to respond well for a context length of 128K. Slightly diverging from original YaRN, due to our distinct attention mechanism, we adjust the length scaling factor to modulate the attention entropy. The factor t 𝑡 \sqrt{t} is computed as t = 0.0707 ln ⁡ s ...

3 Pre-Training

表8展示了不同数据组成的消融实验结果。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: Lai et al. ( 2017 ) , DROP (Dua et al., 2019 ) , C3 (Sun et al., 2019 ) , and CMRC (Cui et al., 2019 ) . Reference disambiguation datasets include WinoGrande Sakaguchi et al. ( 2019 ) and CLUEWSC (Xu et al., 2020 ) . Language modeling datasets include Pile (Gao et al., 2020 ) . Chinese understanding and culture datasets include CHID (Zheng et al., 2019 ) and CCPM (Li et al., 2021 ) . Math datasets include GSM8K (Cobbe et al., 2021 ) , MATH (Hendrycks et al., 2021 ) , and CMath (Wei et al., 2023 ) . Code datasets include HumanEval (Chen et al., 2021 ) , MBPP (Austin et al., 2021 ) , and CRUXEva...

3 Pre-Training

6 结论在本技术报告中，我们介绍了DeepSeek-V2，一个强大的开源混合专家语言模型。通过MLA和DeepSeekMoE的创新设计，DeepSeek-V2在保持高性能的同时，显著降低了训练成本和推理延迟。

原文: 86.6 87.9 84.2 PIQA (Acc.) 0-shot 83.6 83.3 83.6 85.0 83.7 WinoGrande (Acc.) 5-shot 84.9 82.4 83.7 85.7 84.9 RACE-Middle (Acc.) 5-shot 69.9 63.4 73.3 73.3 73.1 RACE-High (Acc.) 5-shot 50.7 47.0 56.7 57.9 52.7 TriviaQA (EM) 5-shot 78.9 73.1 82.1 81.6 79.9 NaturalQuestions (EM) 5-shot 36.6 35.6 39.6 40.2 38.7 AGIEval (Acc.) 0-shot 41.3 64.4 43.4 49.8 51.2 Code HumanEval (Pass@1) 0-shot 45.1 43.9 53.1 48.2 48.8 MBPP (Pass@1) 3-shot 57.4 53.6 64.2 68.6 66.6 CRUXEval-I (Acc.) 2-shot 42.5 44.3 52.4 49.4 52.8 CRUXEval-O (Acc.) 2-shot 41.0 42.3 52.8 54.3 49.8 Math GSM8K (EM) 8-shot 63.4 77.9 80.3 83.0...

3 Pre-Training

我们相信DeepSeek-V2将为AI研究社区做出重要贡献，推动开源AI模型的发展。未来，我们将继续改进模型架构和训练方法，进一步提高模型性能。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: are DeepSeek-V2 with its open-source counterparts one by one. (1) Compared with Qwen1.5 72B, another model that supports both Chinese and English, DeepSeek-V2 demonstrates overwhelming advantages on the majority of English, code, and math benchmarks. As for Chinese benchmarks, Qwen1.5 72B shows better performance on multi-subject multiple-choice tasks while DeepSeek-V2 is comparable or better on others. Note that for the CHID benchmark, the tokenizer of Qwen1.5 72B will encounter errors in our evaluation framework, so we leave the CHID score blank for Qwen1.5 72B. (2) Compared with Mixtral 8x2...

3 Pre-Training

附录A：实验细节本附录详细描述了实验设置和超参数配置。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: MoE model will introduce additional communication overheads, through our operator and communication optimizations, the training for DeepSeek-V2 can attain a relatively high Model FLOPs Utilization (MFU). During our practical training on the H800 cluster, for training on each trillion tokens, DeepSeek 67B requires 300.6K GPU hours, while DeepSeek-V2 needs only 172.8K GPU hours, i.e., sparse DeepSeek-V2 can save 42.5% training costs compared with dense DeepSeek 67B. Inference Efficiency. In order to efficiently deploy DeepSeek-V2 for service, we first convert its parameters into the precision of...

3.1 Experimental Setups

附录B：伦理考虑我们讨论了模型可能带来的伦理问题和潜在风险，以及我们采取的缓解措施。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: 3.1.1 Data Construction While maintaining the same data processing stages as for DeepSeek 67B (DeepSeek-AI, 2024 ) , we extend the amount of data and elevate the data quality. In order to enlarge our pre-training corpus, we explore the potential of the internet data and optimize our cleaning processes, thus recovering a large amount of mistakenly deleted data. Moreover, we incorporate more Chinese data, aiming to better leverage the corpus available on the Chinese internet. In addition to the amount of data, we also focus on the data quality. We enrich our pre-training corpus with high-quality...

3.1 Experimental Setups

附录C：额外实验结果本附录提供了更多实验结果和分析。 DeepSeek-V2通过创新的MLA注意力机制和DeepSeekMoE混合专家架构，在保持高性能的同时显著降低了训练成本和推理延迟。这一设计使得大规模语言模型的实际部署变得更加可行和高效，为开源AI社区做出了重要贡献。我们的实验结果表明，DeepSeek-V2在多个基准测试中都表现出色，验证了这一架构的有效性。

原文: owing Dai et al. ( 2024 ) , we substitute all FFNs except for the first layer with MoE layers. Each MoE layer consists of 2 shared experts and 160 routed experts, where the intermediate hidden dimension of each expert is 1536. Among the routed experts, 6 experts will be activated for each token. In addition, the low-rank compression and fine-grained expert segmentation will impact the output scale of a layer. Therefore, in practice, we employ additional RMS Norm layers after the compressed latent vectors, and multiply additional scaling factors at the width bottlenecks (i.e., the compressed la...