DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Haoyu Lu, Wen Liu, Bo Zhang等,DeepSeek-AI 摘要:我们提出DeepSeek-V2,一个强大的开源混合专家(MoE)语言模型,特点是经济高效的训练和推理吞吐量。DeepSeek-V2采用多头潜伏注意力(MLA)和DeepSeekMoE架构,在保持高性能的同时,显著降低了训练成本和推理延迟。我们的模型在广泛基准测试中超越了现有开源模型,同时提供了卓越的推理效率。
[原文]DeepSeek-AI research@deepseek.com Abstract We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost throu...
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
[原文]in inference throughput. These constraints present significant challenges that impede the widespread adoption and utilization of LLMs. In order to tackle this problem, we introduce DeepSeek-V2, a strong open-source Mixture-of-Experts (MoE) language model, characterized by economical training and efficient inference through an innovative Transformer architecture. It is equipped with a total of 236B parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. We optimize the attention modules and Feed-Forward Networks (FFNs) within the Transformer framewor...
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
[原文]ion overheads and ensure load balance. By combining these two techniques, DeepSeek-V2 features strong performance (Figure 1 ), economical training costs, and efficient inference throughput (Figure 1 ), simultaneously. Figure 2: Illustration of the architecture of DeepSeek-V2. MLA ensures efficient inference by significantly reducing the KV cache for generation, and DeepSeekMoE enables training strong models at an economical cost through the sparse architecture. We construct a high-quality and multi-source pre-training corpus consisting of 8.1T tokens. Compared with the corpus used in DeepSeek ...
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
[原文]38.9 length-controlled win rate on AlpacaEval 2.0 (Dubois et al., 2024 ) , 8.97 overall score on MT-Bench (Zheng et al., 2023 ) , and 7.91 overall score on AlignBench (Liu et al., 2023 ) . The English open-ended conversation evaluations demonstrate that DeepSeek-V2 Chat (RL) has top-tier performance among open-source chat models. In addition, the evaluation on AlignBench indicates that in Chinese, DeepSeek-V2 Chat (RL) outperforms all of open-source models, and even beats most of closed-source models. In order to facilitate further research and development on MLA and DeepSeekMoE, we also relea...
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
[原文]ent inference. For FFNs, we adopt the DeepSeekMoE architecture (Dai et al., 2024 ) , a high-performance MoE architecture that enables training strong models at an economical cost. An illustration of the architecture of DeepSeek-V2 is presented in Figure 2 , and we will introduce the details of MLA and DeepSeekMoE in this section. For other tiny details (e.g., layer normalization and the activation function in FFNs), unless specifically stated, DeepSeek-V2 follows the settings of DeepSeek 67B (DeepSeek-AI, 2024 ) . 2.1 Multi-Head Latent Attention: Boosting Inference Efficiency Conventional Tran...
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
[原文]In the past few years, Large Language Models (LLMs) (OpenAI, 2022 , 2023 ; Anthropic, 2023 ; Google, 2023 ) have undergone rapid development, offering a glimpse into the dawn of Artificial General Intelligence (AGI). In general, the intelligence of an LLM tends to improve as the number of parameters increases, allowing it to exhibit emergent capabilities across various tasks (Wei et al., 2022 ) . However, the improvement comes at the cost of larger computing resources for training and a potential decrease in inference throughput. These constraints present significant challenges that impede the...
1 Introduction
在前馈网络(FFNs)方面,我们采用了 DeepSeekMoE 架构(Dai et al., 2024),该架构通过细粒度专家划分与共享专家隔离,以提升专家专业化的潜力。相较于 GShard(Lepikhin et al., 2021)等传统 MoE 架构,DeepSeekMoE 架构展现出显著优势,使我们能够以较低的成本训练出强大的模型。由于在训练过程中采用了专家并行策略,我们还设计了辅助机制来控制通信开销并确保负载均衡。通过结合这两种技术,DeepSeek-V2 具备……
[原文]Networks (FFNs), we follow the DeepSeekMoE architecture (Dai et al., 2024 ) , which adopts fine-grained expert segmentation and shared expert isolation for higher potential in expert specialization. The DeepSeekMoE architecture demonstrates great advantages compared with conventional MoE architectures like GShard (Lepikhin et al., 2021 ) , enabling us to train strong models at an economical cost. As we employ expert parallelism during training, we also devise supplementary mechanisms to control communication overheads and ensure load balance. By combining these two techniques, DeepSeek-V2 feat...
[原文]omes the strongest open-source MoE language model. Figure 1 highlights that, on MMLU, DeepSeek-V2 achieves top-ranking performance with only a small number of activated parameters. In addition, as shown in Figure 1 , compared with DeepSeek 67B, DeepSeek-V2 saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We also evaluate DeepSeek-V2 Chat (SFT) and DeepSeek-V2 Chat (RL) on open-ended benchmarks. Notably, DeepSeek-V2 Chat (RL) achieves 38.9 length-controlled win rate on AlpacaEval 2.0 (Dubois et al., 2024 ) , 8.97 overall s...
[原文]standard MHA mechanism as background. Let d 𝑑 d be the embedding dimension, n h subscript 𝑛 ℎ n_{h} be the number of attention heads, d h subscript 𝑑 ℎ d_{h} be the dimension per head, and 𝐡 t ∈ ℝ d subscript 𝐡 𝑡 superscript ℝ 𝑑 \mathbf{h}_{t}\in\mathbb{R}^{d} be the attention input of the t 𝑡 t -th token at an attention layer. Standard MHA first produces 𝐪 t , 𝐤 t , 𝐯 t ∈ ℝ d h n h subscript 𝐪 𝑡 subscript 𝐤 𝑡 subscript 𝐯 𝑡 superscript ℝ subscript 𝑑 ℎ subscript 𝑛 ℎ \mathbf{q}_{t},\mathbf{k}_{t},\mathbf{v}_{t}\in\mathbb{R}^{d_{h}n_{h}} through three matrices W Q , W K , W V ∈ ℝ d h n h × d ...
[原文]{KV}, (10) 𝐯 t C superscript subscript 𝐯 𝑡 𝐶 \displaystyle\mathbf{v}_{t}^{C} = W U V 𝐜 t K V , absent superscript 𝑊 𝑈 𝑉 superscript subscript 𝐜 𝑡 𝐾 𝑉 \displaystyle=W^{UV}\mathbf{c}_{t}^{KV}, (11) where 𝐜 t K V ∈ ℝ d c superscript subscript 𝐜 𝑡 𝐾 𝑉 superscript ℝ subscript 𝑑 𝑐 \mathbf{c}_{t}^{KV}\in\mathbb{R}^{d_{c}} is the compressed latent vector for keys and values; d c ( ≪ d h n h ) annotated subscript 𝑑 𝑐 much-less-than absent subscript 𝑑 ℎ subscript 𝑛 ℎ d_{c}(\ll d_{h}n_{h}) denotes the KV compression dimension; W D K V ∈ ℝ d c × d superscript 𝑊 𝐷 𝐾 𝑉 superscript ℝ subscript ...
[原文]Q}, (13) where 𝐜 t Q ∈ ℝ d c ′ superscript subscript 𝐜 𝑡 𝑄 superscript ℝ superscript subscript 𝑑 𝑐 ′ \mathbf{c}_{t}^{Q}\in\mathbb{R}^{d_{c}^{\prime}} is the compressed latent vector for queries; d c ′ ( ≪ d h n h ) annotated superscript subscript 𝑑 𝑐 ′ much-less-than absent subscript 𝑑 ℎ subscript 𝑛 ℎ d_{c}^{\prime}(\ll d_{h}n_{h}) denotes the query compression dimension; and W D Q ∈ ℝ d c ′ × d , W U Q ∈ ℝ d h n h × d c ′ formulae-sequence superscript 𝑊 𝐷 𝑄 superscript ℝ superscript subscript 𝑑 𝑐 ′ 𝑑 superscript 𝑊 𝑈 𝑄 superscript ℝ subscript 𝑑 ℎ subscript 𝑛 ℎ superscript subscript 𝑑 𝑐...
[原文]d h R superscript subscript 𝐤 𝑡 𝑅 superscript ℝ superscript subscript 𝑑 ℎ 𝑅 \mathbf{k}_{t}^{R}\in\mathbb{R}^{d_{h}^{R}} to carry RoPE, where d h R superscript subscript 𝑑 ℎ 𝑅 d_{h}^{R} denotes the per-head dimension of the decoupled queries and key. Equipped with the decoupled RoPE strategy, MLA performs the following computation: [ 𝐪 t , 1 R ; 𝐪 t , 2 R ; … ; 𝐪 t , n h R ] = 𝐪 t R superscript subscript 𝐪 𝑡 1 𝑅 superscript subscript 𝐪 𝑡 2 𝑅 … superscript subscript 𝐪 𝑡 subscript 𝑛 ℎ 𝑅 superscript subscript 𝐪 𝑡 𝑅 \displaystyle[\mathbf{q}_{t,1}^{R};\mathbf{q}_{t,2}^{R};...;\mathbf{q}_{t,n_{h}}^{R...
[原文]𝑡 2 … subscript 𝐨 𝑡 subscript 𝑛 ℎ \displaystyle=W^{O}[\mathbf{o}_{t,1};\mathbf{o}_{t,2};...;\mathbf{o}_{t,n_{h}}], (19) where W Q R ∈ ℝ d h R n h × d c ′ superscript 𝑊 𝑄 𝑅 superscript ℝ superscript subscript 𝑑 ℎ 𝑅 subscript 𝑛 ℎ superscript subscript 𝑑 𝑐 ′ W^{QR}\in\mathbb{R}^{d_{h}^{R}n_{h}\times d_{c}^{\prime}} and W K R ∈ ℝ d h R × d superscript 𝑊 𝐾 𝑅 superscript ℝ superscript subscript 𝑑 ℎ 𝑅 𝑑 W^{KR}\in\mathbb{R}^{d_{h}^{R}\times d} are matrices to produce the decouples queries and key, respectively; RoPE ( ⋅ ) RoPE ⋅ \operatorname{RoPE}(\cdot) denotes the operation that applies RoP...
[原文]c subscript 𝑑 𝑐 d_{c} and d h R superscript subscript 𝑑 ℎ 𝑅 d_{h}^{R} denote the KV compression dimension and the per-head dimension of the decoupled queries and key in MLA, respectively. The amount of KV cache is measured by the number of elements, regardless of the storage precision. For DeepSeek-V2, d c subscript 𝑑 𝑐 d_{c} is set to 4 d h 4 subscript 𝑑 ℎ 4d_{h} and d h R superscript subscript 𝑑 ℎ 𝑅 d_{h}^{R} is set to d h 2 subscript 𝑑 ℎ 2 \frac{d_{h}}{2} . So, its KV cache is equal to GQA with only 2.25 groups, but its performance is stronger than MHA. 2.1.4 Comparison of Key-Value Cache...
[原文]ered by its target experts. Due to the fine-grained expert segmentation in DeepSeekMoE, the number of activated experts can be large, so the MoE-related communication will be more costly if we apply expert parallelism. For DeepSeek-V2, beyond the naive top-K selection of routed experts, we additionally ensure that the target experts of each token will be distributed on at most M 𝑀 M devices. To be specific, for each token, we first select M 𝑀 M devices that have experts with the highest affinity scores in them. Then, we perform top-K selection among experts on these M 𝑀 M devices. In practice,...
[原文]s Expert i ) , absent subscript 𝑁 𝑟 subscript 𝐾 𝑟 𝑇 superscript subscript 𝑡 1 𝑇 1 Token t selects Expert i \displaystyle=\frac{N_{r}}{K_{r}T}\sum_{t=1}^{T}{\mathds{1}(\text{Token $t$ selects Expert $i$})}, (24) P i subscript 𝑃 𝑖 \displaystyle P_{i} = 1 T ∑ t = 1 T s i , t , absent 1 𝑇 superscript subscript 𝑡 1 𝑇 subscript 𝑠 𝑖 𝑡 \displaystyle=\frac{1}{T}\sum_{t=1}^{T}{s_{i,t}}, (25) where α 1 subscript 𝛼 1 \alpha_{1} is a hyper-parameter called expert-level balance factor; 𝟙 ( ⋅ ) 1 ⋅ \mathds{1}(\cdot) denotes the indicator function; and T 𝑇 T denotes the number of tokens in a sequence. Dev...
[原文]each device is balanced. Although the device-limited routing mechanism guarantees that the sending communication of each device is bounded, if a certain device receives more tokens than other devices, the practical communication efficiency will also be affected. In order to mitigate this issue, we design a communication balance loss as follows: ℒ CommBal subscript ℒ CommBal \displaystyle\mathcal{L}_{\mathrm{CommBal}} = α 3 ∑ i = 1 D f i ′′ P i ′′ , absent subscript 𝛼 3 superscript subscript 𝑖 1 𝐷 superscript subscript 𝑓 𝑖 ′′ superscript subscript 𝑃 𝑖 ′′ \displaystyle=\alpha_{3}\sum_{i=1}^{...
[原文]st computes the average computational budget for each device, which means that the capacity factor for each device is equivalent to 1.0. Then, inspired by Riquelme et al. ( 2021 ) , we drop tokens with the lowest affinity scores on each device until reaching the computational budget. In addition, we ensure that the tokens belonging to approximately 10% of the training sequences will never be dropped. In this way, we can flexibly decide whether to drop tokens during inference according to the efficiency requirements, and always ensure consistency between training and inference.
[原文]Conventional Transformer models usually adopts Multi-Head Attention (MHA) (Vaswani et al., 2017 ) , but during generation, its heavy Key-Value (KV) cache will become the bottleneck that limit the inference efficiency. In order to reduce the KV cache, Multi-Query Attention (MQA) (Shazeer, 2019 ) and Grouped-Query Attention (GQA) (Ainslie et al., 2023 ) are proposed. They require a smaller magnitude of KV cache, but their performance does not match MHA (we provide the ablation of MHA, GQA and MQA in Appendix D.1 ). For DeepSeek-V2, we design an innovative attention mechanism called Multi-head La...
[原文]ubscript 𝐯 𝑡 \displaystyle\mathbf{v}_{t} = W V 𝐡 t , absent superscript 𝑊 𝑉 subscript 𝐡 𝑡 \displaystyle=W^{V}\mathbf{h}_{t}, (3) Then, 𝐪 t , 𝐤 t , 𝐯 t subscript 𝐪 𝑡 subscript 𝐤 𝑡 subscript 𝐯 𝑡 \mathbf{q}_{t},\mathbf{k}_{t},\mathbf{v}_{t} will be sliced into n h subscript 𝑛 ℎ n_{h} heads for the multi-head attention computation: [ 𝐪 t , 1 ; \displaystyle[\mathbf{q}_{t,1}; 𝐪 t , 2 ; … ; 𝐪 t , n h ] = 𝐪 t , \displaystyle\mathbf{q}_{t,2};...;\mathbf{q}_{t,n_{h}}]=\mathbf{q}_{t}, (4) [ 𝐤 t , 1 ; \displaystyle[\mathbf{k}_{t,1}; 𝐤 t , 2 ; … ; 𝐤 t , n h ] = 𝐤 t , \displaystyle\mathbf{k}_{t,2};...;\m...
[原文]KV cache has only d c l subscript 𝑑 𝑐 𝑙 d_{c}l elements, where l 𝑙 l denotes the number of layers. In addition, during inference, since W U K superscript 𝑊 𝑈 𝐾 W^{UK} can be absorbed into W Q superscript 𝑊 𝑄 W^{Q} , and W U V superscript 𝑊 𝑈 𝑉 W^{UV} can be absorbed into W O superscript 𝑊 𝑂 W^{O} , we even do not need to compute keys and values out for attention. Figure 3 intuitively illustrates how the KV joint compression in MLA reduces the KV cache. Moreover, in order to reduce the activation memory during training, we also perform low-rank compression for the queries, even if it cann...
[原文]= [ 𝐪 t , i C ; 𝐪 t , i R ] , absent superscript subscript 𝐪 𝑡 𝑖 𝐶 superscript subscript 𝐪 𝑡 𝑖 𝑅 \displaystyle=[\mathbf{q}_{t,i}^{C};\mathbf{q}_{t,i}^{R}], (16) 𝐤 t , i subscript 𝐤 𝑡 𝑖 \displaystyle\mathbf{k}_{t,i} = [ 𝐤 t , i C ; 𝐤 t R ] , absent superscript subscript 𝐤 𝑡 𝑖 𝐶 superscript subscript 𝐤 𝑡 𝑅 \displaystyle=[\mathbf{k}_{t,i}^{C};\mathbf{k}_{t}^{R}], (17) 𝐨 t , i subscript 𝐨 𝑡 𝑖 \displaystyle\mathbf{o}_{t,i} = ∑ j = 1 t Softmax j ( 𝐪 t , i T 𝐤 j , i d h + d h R ) 𝐯 j , i C , absent superscript subscript 𝑗 1 𝑡 subscript Softmax 𝑗 superscript subscript 𝐪 𝑡 𝑖 𝑇 subscript 𝐤 𝑗 𝑖 sub...
[原文]nism KV Cache per Token (# Element) Capability Multi-Head Attention (MHA) 2 n h d h l 2 subscript 𝑛 ℎ subscript 𝑑 ℎ 𝑙 2n_{h}d_{h}l Strong Grouped-Query Attention (GQA) 2 n g d h l 2 subscript 𝑛 𝑔 subscript 𝑑 ℎ 𝑙 2n_{g}d_{h}l Moderate Multi-Query Attention (MQA) 2 d h l 2 subscript 𝑑 ℎ 𝑙 2d_{h}l Weak MLA (Ours) ( d c + d h R ) l ≈ 9 2 d h l subscript 𝑑 𝑐 superscript subscript 𝑑 ℎ 𝑅 𝑙 9 2 subscript 𝑑 ℎ 𝑙 \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ (d_{...
2.2 DeepSeekMoE: Training Strong Models at Economical Costs
[原文]2.2.1 Basic Architecture For FFNs, we employ the DeepSeekMoE architecture (Dai et al., 2024 ) . DeepSeekMoE has two key ideas: segmenting experts into finer granularity for higher expert specialization and more accurate knowledge acquisition, and isolating some shared experts for mitigating knowledge redundancy among routed experts. With the same number of activated and total expert parameters, DeepSeekMoE can outperform conventional MoE architectures like GShard (Lepikhin et al., 2021 ) by a large margin. Let 𝐮 t subscript 𝐮 𝑡 \mathbf{u}_{t} be the FFN input of the t 𝑡 t -th token, we compute...
2.2 DeepSeekMoE: Training Strong Models at Economical Costs
[原文]ed experts, respectively; FFN i ( s ) ( ⋅ ) subscript superscript FFN 𝑠 𝑖 ⋅ \operatorname{FFN}^{(s)}_{i}(\cdot) and FFN i ( r ) ( ⋅ ) subscript superscript FFN 𝑟 𝑖 ⋅ \operatorname{FFN}^{(r)}_{i}(\cdot) denote the i 𝑖 i -th shared expert and the i 𝑖 i -th routed expert, respectively; K r subscript 𝐾 𝑟 K_{r} denotes the number of activated routed experts; g i , t subscript 𝑔 𝑖 𝑡 g_{i,t} is the gate value for the i 𝑖 i -th expert; s i , t subscript 𝑠 𝑖 𝑡 s_{i,t} is the token-to-expert affinity; 𝐞 i subscript 𝐞 𝑖 \mathbf{e}_{i} is the centroid of the i 𝑖 i -th routed expert in this layer; and ...
2.2 DeepSeekMoE: Training Strong Models at Economical Costs
[原文]alanced load will raise the risk of routing collapse (Shazeer et al., 2017 ) , preventing some experts being fully trained and utilized. Secondly, when expert parallelism is employed, unbalanced load will diminish computation efficiency. During the training of DeepSeek-V2, we design three kinds of auxiliary losses, for controlling expert-level load balance ( ℒ ExpBal subscript ℒ ExpBal \mathcal{L}_{\mathrm{ExpBal}} ), device-level load balance ( ℒ DevBal subscript ℒ DevBal \mathcal{L}_{\mathrm{DevBal}} ), and communication balance ( ℒ CommBal subscript ℒ CommBal \mathcal{L}_{\mathrm{CommBal}} ...
2.2 DeepSeekMoE: Training Strong Models at Economical Costs
[原文]script ℰ 1 subscript ℰ 2 … subscript ℰ 𝐷 \{\mathcal{E}_{1},\mathcal{E}_{2},...,\mathcal{E}_{D}\} , and deploy each group on a single device. The device-level balance loss is computed as follows: ℒ DevBal subscript ℒ DevBal \displaystyle\mathcal{L}_{\mathrm{DevBal}} = α 2 ∑ i = 1 D f i ′ P i ′ , absent subscript 𝛼 2 superscript subscript 𝑖 1 𝐷 superscript subscript 𝑓 𝑖 ′ superscript subscript 𝑃 𝑖 ′ \displaystyle=\alpha_{2}\sum_{i=1}^{D}{f_{i}^{\prime}P_{i}^{\prime}}, (26) f i ′ superscript subscript 𝑓 𝑖 ′ \displaystyle f_{i}^{\prime} = 1 | ℰ i | ∑ j ∈ ℰ i f j , absent 1 subscript ℰ 𝑖 subs...
2.2 DeepSeekMoE: Training Strong Models at Economical Costs
[原文]\mathds{1}(\text{Token $t$ is sent to Device $i$})}, (30) P i ′′ superscript subscript 𝑃 𝑖 ′′ \displaystyle P_{i}^{\prime\prime} = ∑ j ∈ ℰ i P j , absent subscript 𝑗 subscript ℰ 𝑖 subscript 𝑃 𝑗 \displaystyle=\sum_{j\in\mathcal{E}_{i}}{P_{j}}, (31) where α 3 subscript 𝛼 3 \alpha_{3} is a hyper-parameter called communication balance factor. The device-limited routing mechanism operates on the principle of ensuring that each device transmits at most M T 𝑀 𝑇 MT hidden states to other devices. Simultaneously, the communication balance loss is employed to encourage each device to receive around M ...
[原文]𝑅 d_{h}^{R} to 64. Following Dai et al. ( 2024 ) , we substitute all FFNs except for the first layer with MoE layers. Each MoE layer consists of 2 shared experts and 160 routed experts, where the intermediate hidden dimension of each expert is 1536. Among the routed experts, 6 experts will be activated for each token. In addition, the low-rank compression and fine-grained expert segmentation will impact the output scale of a layer. Therefore, in practice, we employ additional RMS Norm layers after the compressed latent vectors, and multiply additional scaling factors at the width bottlenecks (...
[原文]rts will be uniformly deployed on 8 devices ( D = 8 𝐷 8 D=8 ). As for the device-limited routing, each token will be sent to at most 3 devices ( M = 3 𝑀 3 M=3 ). As for balance losses, we set α 1 subscript 𝛼 1 \alpha_{1} to 0.003, α 2 subscript 𝛼 2 \alpha_{2} to 0.05, and α 3 subscript 𝛼 3 \alpha_{3} to 0.02. We employ the token-dropping strategy during training for acceleration, but do not drop any tokens for evaluation. 3.1.3 Infrastructures DeepSeek-V2 is trained based on the HAI-LLM framework (High-flyer, 2023 ) , an efficient and light-weight training framework developed internally by our...
[原文]128K. YaRN was specifically applied to the decoupled shared key 𝐤 t R subscript superscript 𝐤 𝑅 𝑡 \mathbf{k}^{R}_{t} as it is responsible for carrying RoPE (Su et al., 2024 ) . For YaRN, we set the scale s 𝑠 s to 40, α 𝛼 \alpha to 1, β 𝛽 \beta to 32, and the target maximum context length to 160K. Under these settings, we can expect the model to respond well for a context length of 128K. Slightly diverging from original YaRN, due to our distinct attention mechanism, we adjust the length scaling factor to modulate the attention entropy. The factor t 𝑡 \sqrt{t} is computed as t = 0.0707 ln s ...
[原文]are DeepSeek-V2 with its open-source counterparts one by one. (1) Compared with Qwen1.5 72B, another model that supports both Chinese and English, DeepSeek-V2 demonstrates overwhelming advantages on the majority of English, code, and math benchmarks. As for Chinese benchmarks, Qwen1.5 72B shows better performance on multi-subject multiple-choice tasks while DeepSeek-V2 is comparable or better on others. Note that for the CHID benchmark, the tokenizer of Qwen1.5 72B will encounter errors in our evaluation framework, so we leave the CHID score blank for Qwen1.5 72B. (2) Compared with Mixtral 8x2...
[原文]MoE model will introduce additional communication overheads, through our operator and communication optimizations, the training for DeepSeek-V2 can attain a relatively high Model FLOPs Utilization (MFU). During our practical training on the H800 cluster, for training on each trillion tokens, DeepSeek 67B requires 300.6K GPU hours, while DeepSeek-V2 needs only 172.8K GPU hours, i.e., sparse DeepSeek-V2 can save 42.5% training costs compared with dense DeepSeek 67B. Inference Efficiency. In order to efficiently deploy DeepSeek-V2 for service, we first convert its parameters into the precision of...
[原文]3.1.1 Data Construction While maintaining the same data processing stages as for DeepSeek 67B (DeepSeek-AI, 2024 ) , we extend the amount of data and elevate the data quality. In order to enlarge our pre-training corpus, we explore the potential of the internet data and optimize our cleaning processes, thus recovering a large amount of mistakenly deleted data. Moreover, we incorporate more Chinese data, aiming to better leverage the corpus available on the Chinese internet. In addition to the amount of data, we also focus on the data quality. We enrich our pre-training corpus with high-quality...
[原文]owing Dai et al. ( 2024 ) , we substitute all FFNs except for the first layer with MoE layers. Each MoE layer consists of 2 shared experts and 160 routed experts, where the intermediate hidden dimension of each expert is 1536. Among the routed experts, 6 experts will be activated for each token. In addition, the low-rank compression and fine-grained expert segmentation will impact the output scale of a layer. Therefore, in practice, we employ additional RMS Norm layers after the compressed latent vectors, and multiply additional scaling factors at the width bottlenecks (i.e., the compressed la...
[原文]ployed on 8 devices ( D = 8 𝐷 8 D=8 ). As for the device-limited routing, each token will be sent to at most 3 devices ( M = 3 𝑀 3 M=3 ). As for balance losses, we set α 1 subscript 𝛼 1 \alpha_{1} to 0.003, α 2 subscript 𝛼 2 \alpha_{2} to 0.05, and α 3 subscript 𝛼 3 \alpha_{3} to 0.02. We employ the token-dropping strategy during training for acceleration, but do not drop any tokens for evaluation. 3.1.3 Infrastructures DeepSeek-V2 is trained based on the HAI-LLM framework (High-flyer, 2023 ) , an efficient and light-weight training framework developed internally by our engineers. It employs a...
[原文]lly applied to the decoupled shared key 𝐤 t R subscript superscript 𝐤 𝑅 𝑡 \mathbf{k}^{R}_{t} as it is responsible for carrying RoPE (Su et al., 2024 ) . For YaRN, we set the scale s 𝑠 s to 40, α 𝛼 \alpha to 1, β 𝛽 \beta to 32, and the target maximum context length to 160K. Under these settings, we can expect the model to respond well for a context length of 128K. Slightly diverging from original YaRN, due to our distinct attention mechanism, we adjust the length scaling factor to modulate the attention entropy. The factor t 𝑡 \sqrt{t} is computed as t = 0.0707 ln s + 1 𝑡 0.0707 𝑠 1 \sqrt{t...
[原文]3.2.1 Evaluation Benchmarks DeepSeek-V2 is pretrained on a bilingual corpus, so we evaluate it on a series of benchmarks in English and Chinese. Our evaluation is based on our internal evaluation framework integrated in our HAI-LLM framework. Included benchmarks are categorized and listed as follows, where underlined benchmarks are in Chinese: Multi-subject multiple-choice datasets include MMLU (Hendrycks et al., 2020 ) , C-Eval (Huang et al., 2023 ) , and CMMLU (Li et al., 2023 ) . Language understanding and reasoning datasets include HellaSwag (Zellers et al., 2019 ) , PIQA (Bisk et al., 202...
[原文]e perform language-modeling-based evaluation for Pile-test and use Bits-Per-Byte (BPB) as the metric to guarantee fair comparison among models with different tokenizers. For an intuitive overview of these benchmarks, we additionally provide our evaluation formats for each benchmark in Appendix G . 3.2.2 Evaluation Results Benchmark (Metric) # Shots DeepSeek Qwen1.5 Mixtral LLaMA 3 DeepSeek-V2 67B 72B 8x22B 70B Architecture - Dense Dense MoE Dense MoE # Activated Params - 67B 72B 39B 70B 21B # Total Params - 67B 72B 141B 70B 236B English Pile-test (BPB) - 0.642 0.637 0.623 0.602 0.606 BBH (EM) ...
[原文]are the same evaluation setting. Bold denotes the best and underline denotes the second-best. Scores with a gap smaller than 0.3 are regarded as at the same level. With only 21B activated parameters, DeepSeek-V2 achieves top-tier performance among open-source models. In Table 2 , we compare DeepSeek-V2 with several representative open-source models, including DeepSeek 67B (DeepSeek-AI, 2024 ) (our previous release), Qwen1.5 72B (Bai et al., 2023 ) , LLaMA3 70B (AI@Meta, 2024 ) , and Mixtral 8x22B (Mistral, 2024 ) . We evaluate all these models with our internal evaluation framework, and ensure...
[原文]s. Therefore, we acknowledge that DeepSeek-V2 still has a slight gap in basic English capabilities with LLaMA3 70B. However, even with much fewer training tokens and activated parameters, DeepSeek-V2 still demonstrates comparable code and math capability with LLaMA3 70B. Also, as a bilingual language model, DeepSeek-V2 outperforms LLaMA3 70B overwhelmingly on Chinese benchmarks. Finally, it is worth mentioning that certain prior studies (Hu et al., 2024 ) incorporate SFT data during the pre-training stage, whereas DeepSeek-V2 has never been exposed to SFT data during pre-training. 3.2.3 Traini...
[原文]8 H800 GPUs, DeepSeek-V2 achieves a generation throughput exceeding 50K tokens per second, which is 5.76 times the maximum generation throughput of DeepSeek 67B. In addition, the prompt input throughput of DeepSeek-V2 exceeds 100K tokens per second.
[原文]4.1 Supervised Fine-Tuning Building upon our prior research (DeepSeek-AI, 2024 ) , we curate our instruction tuning datasets to include 1.5M instances, comprising 1.2M instances for helpfulness and 0.3M instances for safety. In comparison to the initial version, we improve the data quality to mitigate hallucinatory responses and enhance writing proficiency. We fine-tune DeepSeek-V2 with 2 epochs, and the learning rate is set to 5 × 10 − 6 5 superscript 10 6 5\times 10^{-6} . For the evaluation of DeepSeek-V2 Chat (SFT), we mainly include generation-based benchmarks, except for several represen...
[原文]subscript 𝑜 2 ⋯ subscript 𝑜 𝐺 \{o_{1},o_{2},\cdots,o_{G}\} from the old policy π θ o l d subscript 𝜋 subscript 𝜃 𝑜 𝑙 𝑑 \pi_{\theta_{old}} and then optimizes the policy model π θ subscript 𝜋 𝜃 \pi_{\theta} by maximizing the following objective: 𝒥 G R P O ( θ ) = 𝔼 [ q ∼ P ( Q ) , { o i } i = 1 G ∼ π θ o l d ( O | q ) ] 1 G ∑ i = 1 G ( min ( π θ ( o i | q ) π θ o l d ( o i | q ) A i , clip ( π θ ( o i | q ) π θ o l d ( o i | q ) , 1 − ε , 1 + ε ) A i ) − β 𝔻 K L ( π θ | | π r e f ) ) , \begin{split}\mathcal{J}_{GRPO}(\theta)&=\mathbb{E}{[q\sim P(Q),\{o...
[原文]h prompts, exhibits unique characteristics that are distinct from the training on general data. For example, the mathematical and coding abilities of our model can keep improving over a longer period of training steps. Therefore, we employ a two-stage RL training strategy, which first performs reasoning alignment, and then performs human preference alignment. In the first reasoning alignment stage, we train a reward model R M r e a s o n i n g 𝑅 subscript 𝑀 𝑟 𝑒 𝑎 𝑠 𝑜 𝑛 𝑖 𝑛 𝑔 RM_{reasoning} for code and math reasoning tasks, and optimize the policy model with the feedback of R...
[原文]n adjustments. We obtain code preference data based on compiler-feedback, and mathematical preference data based on the ground-truth labels. For reward model training, we initialize the reward models with DeepSeek-V2 Chat (SFT) and train them with either a point-wise or a pair-wise loss. In our experiments, we observe that the RL training can fully tap into and activate the potential of our model, enabling it to select the correct and satisfactory answer from possible responses. Optimizations for Training Efficiency. Conducting RL training on extremely large models places high demands on the t...
[原文].5 72B Chat, and find that DeepSeek-V2 Chat (SFT) surpasses Qwen1.5 72B Chat on almost all of English, math, and code benchmarks. On Chinese benchmarks, DeepSeek-V2 Chat (SFT) demonstrates slightly lower scores than Qwen1.5 72B Chat on multi-subject multiple-choice tasks, consistent with the performance observed from their base versions. When compared with the state-of-the-art open-source MoE model, Mixtral 8x22B Instruct, DeepSeek-V2 Chat (SFT) exhibits better performance on most benchmarks, except for NaturalQuestions and IFEval. Furthermore, in comparison to the state-of-the-art open-source...
[原文]onversation evaluations. For AlpacaEval 2.0, we use the length-controlled win rate as the metric. In addition, we evaluate the Chinese open-ended generation capability based on AlignBench. As presented in Table 5 , DeepSeek-V2 Chat (RL) exhibits a slight advantage over DeepSeek-V2 Chat (SFT). Notably, DeepSeek-V2 Chat (SFT) surpasses all open-source Chinese models by a significant margin. It significantly outperforms the second-best open-source model, Qwen1.5 72B Chat on both Chinese reasoning and language. Moreover, both DeepSeek-V2 Chat (SFT) and DeepSeek-V2 Chat (RL) outperform GPT-4-0613 a...
[原文]t cannot be entirely eliminated. Our observation underscores the critical need for sufficient data to equip an LLM with desired capabilities. Moreover, the quality of SFT data is also crucial, especially for tasks involving writing or open-ended questions. Alignment Tax of Reinforcement Learning. During human preference alignment, we observe a significant performance enhancement on the open-ended generation benchmarks, in terms of the scores rated by both AI and human evaluators. However, we also notice a phenomenon of “alignment tax” (Ouyang et al., 2022 ) , i.e., the alignment process can ne...
[原文]Building upon our prior research (DeepSeek-AI, 2024 ) , we curate our instruction tuning datasets to include 1.5M instances, comprising 1.2M instances for helpfulness and 0.3M instances for safety. In comparison to the initial version, we improve the data quality to mitigate hallucinatory responses and enhance writing proficiency. We fine-tune DeepSeek-V2 with 2 epochs, and the learning rate is set to 5 × 10 − 6 5 superscript 10 6 5\times 10^{-6} . For the evaluation of DeepSeek-V2 Chat (SFT), we mainly include generation-based benchmarks, except for several representative multiple-choice task...
[原文]In order to further unlock the potential of DeepSeek-V2 and align it with human preference, we conduct Reinforcement Learning (RL) to adjust its preference. Reinforcement Learning Algorithm. In order to save the training costs of RL, we adopt Group Relative Policy Optimization (GRPO) (Shao et al., 2024 ) , which foregoes the critic model that is typically with the same size as the policy model, and estimates the baseline from group scores instead. Specifically, for each question q 𝑞 q , GRPO samples a group of outputs { o 1 , o 2 , ⋯ , o G } subscript 𝑜 1 subscript 𝑜 2 ⋯ subscript 𝑜 𝐺 \{o_{1},...
[原文]𝑟 2 … subscript 𝑟 𝐺 \{r_{1},r_{2},\ldots,r_{G}\} corresponding to the outputs within each group: A i = r i − m e a n ( { r 1 , r 2 , ⋯ , r G } ) s t d ( { r 1 , r 2 , ⋯ , r G } ) . subscript 𝐴 𝑖 subscript 𝑟 𝑖 m 𝑒 𝑎 𝑛 subscript 𝑟 1 subscript 𝑟 2 ⋯ subscript 𝑟 𝐺 s 𝑡 𝑑 subscript 𝑟 1 subscript 𝑟 2 ⋯ subscript 𝑟 𝐺 A_{i}=\frac{r_{i}-{\mathrm{m}ean(\{r_{1},r_{2},\cdots,r_{G}\})}}{{\mathrm{s}td(\{r_{1},r_{2},\cdots,r_{G}\})}}. (34) Training Strategy. In our preliminary experiments, we find that the RL training on reasoning data, such as code and math prompts, exhibits unique characterist...
[原文]Evaluations on Standard Benchmarks. Initially, we evaluate DeepSeek-V2 Chat (SFT) and DeepSeek-V2 Chat (RL) on standard benchmarks. Notably, DeepSeek-V2 Chat (SFT) demonstrates substantial improvements in GSM8K, MATH, and HumanEval evaluations compared with its base version. This progress can be attributed to the inclusion of our SFT data, which comprises a considerable volume of math and code related content. In addition, DeepSeek-V2 Chat (RL) further boosts the performance on math and code benchmarks. We show more code and math evaluations in Appendix F . As for the comparisons with other mo...
[原文]Amount of SFT Data. The discussion surrounding the necessity of a large SFT corpus has been a topic of intense debate. Previous works (Young et al., 2024 ; Zhou et al., 2024 ) argue that fewer than 10K instances of SFT data are enough to produce satisfactory results. However, in our experiments, we observe a significant performance decline on the IFEval benchmark if we use fewer than 10K instances. A possible explanation is that, a language model necessitates a certain amount of data to develop specific skills. Although the requisite data amount may diminish with the model size increasing, it ...
[原文]In this paper, we introduce DeepSeek-V2, a large MoE language model that supports 128K context length. In addition to strong performance, it is also characterized by economical training and efficient inference, benefiting from its innovative architecture including MLA and DeepSeekMoE. In practice, compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. Evaluation results further demonstrate that with only 21B activated parameters, ...
[原文]e dedicated to creating a positive and beneficial impact on society. • Currently, DeepSeek-V2 is designed to support the text modality exclusively. In our forward-looking agenda, we intend to enable our model to support multiple modalities, enhancing its versatility and utility in a wider range of scenarios.
[原文]Research & Engineering Aixin Liu Bingxuan Wang Bo Liu Chenggang Zhao Chengqi Deng Chong Ruan Damai Dai Daya Guo Dejian Yang Deli Chen Erhang Li Fangyun Lin Fuli Luo Guangbo Hao Guanting Chen Guowei Li H. Zhang Hanwei Xu Hao Yang Haowei Zhang Honghui Ding Huajian Xin Huazuo Gao Hui Qu Jianzhong Guo Jiashi Li Jingyang Yuan Junjie Qiu Junxiao Song Kai Dong Kaige Gao Kang Guan Lean Wang Lecong Zhang Liang Zhao Liyue Zhang Mingchuan Zhang Minghua Zhang Minghui Tang Panpan Huang Peiyi Wang Qihao Zhu Qinyu Chen Qiushi Du Ruiqi Ge Ruizhe Pan Runxin Xu Shanghao Lu Shangyan Zhou Shanhuang Chen Shengfeng...
[原文]Su for his helpful discussion on position embedding. We thank all those who have contributed to DeepSeek-V2 but are not mentioned in the paper. DeepSeek believes that innovation, novelty, and curiosity are essential in the path to AGI.
Appendix B DeepSeek-V2-Lite: A 16B Model Equipped with MLA and DeepSeekMoE
[原文]B.1 Model Description Architectures. DeepSeek-V2-Lite has 27 layers and a hidden dimension of 2048. It also employs MLA and has 16 attention heads, where each head has a dimension of 128. Its KV compression dimension is 512, but slightly different from DeepSeek-V2, it does not compress the queries. For the decoupled queries and key, it has a per-head dimension of 64. DeepSeek-V2-Lite also employs DeepSeekMoE, and all FFNs except for the first layer are replaced with MoE layers. Each MoE layer consists of 2 shared experts and 64 routed experts, where the intermediate hidden dimension of each ex...
Appendix B DeepSeek-V2-Lite: A 16B Model Equipped with MLA and DeepSeekMoE
[原文]st 2K steps. Subsequently, the learning rate is multiplied by 0.316 after training about 80% of tokens, and again by 0.316 after training about 90% of tokens. The maximum learning rate is set to 4.2 × 10 − 4 4.2 superscript 10 4 4.2\times 10^{-4} , and the gradient clipping norm is set to 1.0. We do not employ the batch size scheduling strategy for it, and it is trained with a constant batch size of 4608 sequences. During pre-training, we set the maximum sequence length to 4K, and train DeepSeek-V2-Lite on 5.7T tokens. We leverage pipeline parallelism to deploy different layers of it on differ...
Appendix B DeepSeek-V2-Lite: A 16B Model Equipped with MLA and DeepSeekMoE
[原文]performance of DeepSeek-V2-Lite Chat and compare it with our previous small-size chat models in Table 7 . DeepSeek-V2-Lite also outperforms our previous small-size chat models by a large margin.
[原文]Architectures. DeepSeek-V2-Lite has 27 layers and a hidden dimension of 2048. It also employs MLA and has 16 attention heads, where each head has a dimension of 128. Its KV compression dimension is 512, but slightly different from DeepSeek-V2, it does not compress the queries. For the decoupled queries and key, it has a per-head dimension of 64. DeepSeek-V2-Lite also employs DeepSeekMoE, and all FFNs except for the first layer are replaced with MoE layers. Each MoE layer consists of 2 shared experts and 64 routed experts, where the intermediate hidden dimension of each expert is 1408. Among th...
[原文]tly, the learning rate is multiplied by 0.316 after training about 80% of tokens, and again by 0.316 after training about 90% of tokens. The maximum learning rate is set to 4.2 × 10 − 4 4.2 superscript 10 4 4.2\times 10^{-4} , and the gradient clipping norm is set to 1.0. We do not employ the batch size scheduling strategy for it, and it is trained with a constant batch size of 4608 sequences. During pre-training, we set the maximum sequence length to 4K, and train DeepSeek-V2-Lite on 5.7T tokens. We leverage pipeline parallelism to deploy different layers of it on different devices, but for e...
[原文]Base Model. We evaluate the performance of DeepSeek-V2-Lite and compare it with our previous small-size base models in Table 6 . DeepSeek-V2-Lite exhibits overwhelming performance advantages, especially in reasoning, coding, and math. Chat Model. We evaluate the performance of DeepSeek-V2-Lite Chat and compare it with our previous small-size chat models in Table 7 . DeepSeek-V2-Lite also outperforms our previous small-size chat models by a large margin.
[原文]In order to demonstrate the complete computation process of MLA, we provide its full formulas in the following: 𝐜 t Q superscript subscript 𝐜 𝑡 𝑄 \displaystyle\mathbf{c}_{t}^{Q} = W D Q 𝐡 t , absent superscript 𝑊 𝐷 𝑄 subscript 𝐡 𝑡 \displaystyle=W^{DQ}\mathbf{h}_{t}, (37) [ 𝐪 t , 1 C ; 𝐪 t , 2 C ; … ; 𝐪 t , n h C ] = 𝐪 t C superscript subscript 𝐪 𝑡 1 𝐶 superscript subscript 𝐪 𝑡 2 𝐶 … superscript subscript 𝐪 𝑡 subscript 𝑛 ℎ 𝐶 superscript subscript 𝐪 𝑡 𝐶 \displaystyle[\mathbf{q}_{t,1}^{C};\mathbf{q}_{t,2}^{C};...;\mathbf{q}_{t,n_{h}}^{C}]=\mathbf{q}_{t}^{C} = W U Q 𝐜 t Q , absent superscr...
[原文]V}, (42) 𝐤 t R superscript subscript 𝐤 𝑡 𝑅 \displaystyle\boxed{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{k}_{t}^{R}} = RoPE ( W K R 𝐡 t ) , absent RoPE superscript 𝑊 𝐾 𝑅 subscript 𝐡 𝑡 \displaystyle=\operatorname{RoPE}({W^{KR}}\mathbf{h}_{t}), (43) 𝐤 t , i subscript 𝐤 𝑡 𝑖 \displaystyle\mathbf{k}_{t,i} = [ 𝐤 t , i C ; 𝐤 t R ] , absent superscript subscript 𝐤 𝑡 𝑖 𝐶 superscript subscript 𝐤 𝑡 𝑅 \displaystyle=[\mathbf{k}_{t,i}^{C};\mathbf{k}_{t}^{R}], (44) [ 𝐯 t , 1 C ; 𝐯 t , 2 C ; … ; 𝐯 t , n h C ] = 𝐯 t C superscript subscript 𝐯 𝑡 1 𝐶 superscript subscript 𝐯 𝑡 ...
[原文]can absorb W U K superscript 𝑊 𝑈 𝐾 W^{UK} into W U Q superscript 𝑊 𝑈 𝑄 W^{UQ} , and W U V superscript 𝑊 𝑈 𝑉 W^{UV} into W O superscript 𝑊 𝑂 W^{O} . Since this optimization is related to only model parameters, it can be completed offline at once. Through this optimization, we avoid the computational overhead for recomputing 𝐤 t C superscript subscript 𝐤 𝑡 𝐶 \mathbf{k}_{t}^{C} and 𝐯 t C superscript subscript 𝐯 𝑡 𝐶 \mathbf{v}_{t}^{C} during inference.
[原文]D.1 Ablation of MHA, GQA, and MQA We show the evaluation results for 7B dense models with MHA, GQA, and MQA on four hard benchmarks in Table 8 . All of these three models are trained on 1.33T tokens, and share the same architecture except for the attention mechanisms. In addition, for a fair comparison, we align the number of parameters of them to around 7B by adjusting the number of layers. From the table, we can find that MHA demonstrates significant advantages over GQA and MQA on these benchmarks. Benchmark (Metric) # Shots Dense 7B Dense 7B Dense 7B w/ MQA w/ GQA (8 Groups) w/ MHA # Params...
[原文]) 5-shot 51.6 50.9 57.9 59.2 CMMLU (Acc.) 5-shot 52.3 53.4 60.7 62.5 Table 9: Comparison between MLA and MHA on hard benchmarks. DeepSeek-V2 shows better performance than MHA, but requires a significantly smaller amount of KV cache.
[原文]We show the evaluation results for 7B dense models with MHA, GQA, and MQA on four hard benchmarks in Table 8 . All of these three models are trained on 1.33T tokens, and share the same architecture except for the attention mechanisms. In addition, for a fair comparison, we align the number of parameters of them to around 7B by adjusting the number of layers. From the table, we can find that MHA demonstrates significant advantages over GQA and MQA on these benchmarks. Benchmark (Metric) # Shots Dense 7B Dense 7B Dense 7B w/ MQA w/ GQA (8 Groups) w/ MHA # Params - 7.1B 6.9B 6.9B BBH (EM) 3-shot ...
[原文]In Table 9 , we show the evaluation results for MoE models equipped with MLA and MHA, respectively, on four hard benchmarks. For a solid conclusion, we train and evaluate models across two scales. Two small MoE models comprise about 16B total parameters, and we train them on 1.33T tokens. Two large MoE models comprise about 250B total parameters, and we train them on 420B tokens. Also, two small MoE models and two large MoE models respectively share the same architecture except for the attention mechanisms. From the table, we can observe that MLA shows better performance than MHA. More importa...
Appendix E Discussion About Pre-Training Data Debiasing
[原文]During pre-training data preparation, we identify and filter out contentious content, such as values influenced by regional cultures, to avoid our model exhibiting unnecessary subjective biases on these controversial topics. Consequently, we observe that DeepSeek-V2 performs slightly worse on the test sets that are closely associated with specific regional cultures. For example, when evaluated on MMLU, although DeepSeek-V2 achieves comparable or superior performance on the majority of testsets compared with its competitors like Mixtral 8x22B, it still lags behind on the Humanity-Moral subset, ...
Appendix F Additional Evaluations on Math and Code
[原文]We present our evaluation formats for each benchmark in Table 12 - 37 , respectively. PROMPT 以下是一道中国高考生物选择题,请选择正确的答案。 问题:下列有关高尔基体、线粒体和叶绿体的叙述, 正确的是 选项:(A)三者都存在于蓝藻中 (B)三者都含有 DNA (C)三者都是 ATP 合成的场所 (D)三者的膜结构中都含有蛋白质 答案:从A到D, 我们应选择 Table 12: An example of AGIEval. PROMPT Question: A sample in a cylindrical container has a cylindrical shape and a fixed volume. The state of matter of the sample _ A. must be solid B. could be either solid or liquid C. must be liquid D. could be either liquid or gas Answer: B Question: The speed of sound is generally greatest in _ A. solids and lowest in liquids B. soli...
[原文]ify this expression "Z" as follows: "Z = True and False and not True and True = A and B" where "A = True and False" and "B = not True and True". Let’s evaluate A: A = True and False = False. Let’s evaluate B: B = not True and True = not (True and True) = not (True) = False. Plugging in A and B, we get: Z = A and B = False and False = False. So the answer is False. Q: not not ( not ( False ) ) is A: Let’s think step by step. Remember that (i) expressions inside brackets are always evaluated first and that (ii) the order of operations from highest priority to lowest priority is "not", "and", "or...