[原文]Damai Dai ∗1,2 Chengqi Deng 1 Chenggang Zhao ∗1,3 R.X. Xu 1 Huazuo Gao 1 Deli Chen 1 Jiashi Li 1 Wangding Zeng 1 Xingkai Yu ∗1,4 Y. Wu 1 Zhenda Xie 1 Y.K. Li 1 Panpan Huang 1 Fuli Luo 1 Chong Ruan 1 Zhifang Sui 2 Wenfeng Liang 1 1 DeepSeek-AI 2 National Key Laboratory for Multimedia Information Processing Peking University 3 Institute for Interdisciplinary Information Sciences Tsinghua University 4 National Key Laboratory for Novel Software Technology Nanjing University {daidamai, szf}@pku.edu.cn {wenfeng.liang}@deepseek.com https://github.com/deepseek-ai/DeepSeek-MoE Abstract In the era of la...
DeepSeekMoE: Towards Ultimate Expert Specialization in
Mixture-of-Experts Language Models
[原文]ntly validate its substantial advantages over the GShard architecture, and show its performance comparable with DeepSeek 67B, using only 28.5% (maybe even 18.2%) of computations. Figure 1: Comparison between DeepSeekMoE 16B and open source models on the Open LLM Leaderboard. The red dashed line is linearly fitted from data points of all models except DeepSeekMoE 16B. DeepSeekMoE 16B consistently outperforms models with a similar number of activated parameters by a large margin, and achieves comparable performance with LLaMA2 7B, which has approximately 2.5 times the activated parameters. 1 Int...
DeepSeekMoE: Towards Ultimate Expert Specialization in
Mixture-of-Experts Language Models
[原文]wledge. Conventional MoE architectures substitute the Feed-Forward Networks (FFNs) in a Transformer with MoE layers. Each MoE layer consists of multiple experts, with each structurally identical to a standard FFN, and each token is assigned to one (Fedus et al., 2021 ) or two (Lepikhin et al., 2021 ) experts. This architecture manifests two potential issues: (1) Knowledge Hybridity : existing MoE practices often employ a limited number of experts (e.g., 8 or 16), and thus tokens assigned to a specific expert will be likely to cover diverse knowledge. Consequently, the designated expert will in...
DeepSeekMoE: Towards Ultimate Expert Specialization in
Mixture-of-Experts Language Models
[原文]to a more accurate and targeted knowledge acquisition. (2) Shared Expert Isolation: we isolate certain experts to serve as shared experts that are always activated, aiming at capturing and consolidating common knowledge across varying contexts. Through compressing common knowledge into these shared experts, redundancy among other routed experts will be mitigated. This can enhance the parameter efficiency and ensure that each routed expert retains specialized by focusing on distinctive aspects. These architectural innovations in DeepSeekMoE offer opportunities to train a parameter-efficient MoE...
DeepSeekMoE: Towards Ultimate Expert Specialization in
Mixture-of-Experts Language Models
[原文]MoE with open source models and the evaluations demonstrate that DeepSeekMoE 16B consistently outperforms models with a similar number of activated parameters by a large margin, and achieves comparable performance with LLaMA2 7B (Touvron et al., 2023b ) , which has approximately 2.5 times the activated parameters. Figure 1 demonstrates the evaluation results on the Open LLM Leaderboard 1 1 1 https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard . Additionally, we conduct supervised fine-tuning (SFT) for alignment, transforming the model into a chat model. Evaluation results show tha...
DeepSeekMoE: Towards Ultimate Expert Specialization in
Mixture-of-Experts Language Models
[原文]g a comparable performance with DeepSeek 67B. • Alignment for MoE. We successfully perform supervised fine-tuning on DeepSeekMoE 16B to create an aligned chat model, showcasing the adaptability and versatility of DeepSeekMoE 16B. • Public Release. In the spirit of open research, we release the model checkpoint of DeepSeekMoE 16B to the public. Notably, this model can be deployed on a single GPU with 40GB of memory without the need for quantization. 2 Preliminaries: Mixture-of-Experts for Transformers We first introduce a generic MoE architecture commonly used in Transformer language models. A ...
1 Introduction
近期的研究与实践已实证表明,在训练数据充足的前提下,通过增加参数规模与计算预算来扩展语言模型,能够显著提升模型能力(Brown et al., 2020; OpenAI, 2023; Touvron et al., 2023a; Hoffmann et al., 2022)。然而,必须承认的是,将模型扩展至极大规模的努力也伴随着极高的计算成本。鉴于高昂的成本,混合专家(MoE)架构(Jacobs et al., 1991; Jordan and Jacobs, 1994; Shazeer et al., 2017)已成为一种备受青睐的解决方案。该架构能够在实现参数规模扩展的同时,将计算成本控制在较低水平。近期将MoE架构应用于Transformer(Vaswani et al., 2017)的实践,已成功将语言模型扩展至较大规模(Fedus et al., 2021; Lepikhin et al., 2021; Du et al., 2022; Zoph, 2022),并取得了卓越的性能表现。这些成果凸显了MoE语言模型的巨大潜力与广阔前景。尽管MoE架构前景广阔,但现有架构仍可能面临知识混杂与知识冗余的问题,从而限制了专家的专业化,即各专家难以习得互不重叠且高度聚焦的知识。传统的MoE架构通常使用MoE层替换Transformer中的前馈神经网络(FFN)。每个MoE层由多个专家构成,各专家的结构均与标准FFN一致,且每个词元会被分配至一个(Fedus et al., 2021)或两个(Lepikhin et al., 2021)专家。 该架构存在两个潜在问题:(1)知识混合性:现有的MoE实践通常采用有限数量的专家(例如8或16个),因此分配给特定专家的token很可能涵盖多样化的知识。 Consequently, the designated expert will intend to as -> 因此,指定的专家将倾向于作为
[原文]Recent research and practices have empirically demonstrated that, with sufficient training data available, scaling language models with increased parameters and computational budgets can yield remarkably stronger models (Brown et al., 2020 ; OpenAI, 2023 ; Touvron et al., 2023a ; Hoffmann et al., 2022 ) . It is imperative to acknowledge, however, that the endeavor to scale models to an extremely large scale is also associated with exceedingly high computational costs. Considering the substantial costs, the Mixture-of-Experts (MoE) architecture (Jacobs et al., 1991 ; Jordan and Jacobs, 1994 ; S...
[原文]semble vastly different types of knowledge in its parameters, which are hard to utilize simultaneously. (2) Knowledge Redundancy : tokens assigned to different experts may require common knowledge. As a result, multiple experts may converge in acquiring shared knowledge in their respective parameters, thereby leading to redundancy in expert parameters. These issues collectively hinder the expert specialization in existing MoE practices, preventing them from reaching the theoretical upper-bound performance of MoE models. In response to the aforementioned issues, we introduce DeepSeekMoE , an in...
[原文]model where each expert is highly specialized. Starting from a modest scale with 2B parameters, we validate the advantages of the DeepSeekMoE architecture. We conduct evaluations on 12 zero-shot or few-shot benchmarks spanning diverse tasks. Empirical results indicate that DeepSeekMoE 2B surpasses GShard 2B (Lepikhin et al., 2021 ) by a substantial margin, and even matches GShard 2.9B, a larger MoE model with 1.5 × \times expert parameters and computation. Remarkably, we find that DeepSeekMoE 2B nearly approaches the performance of its dense counterpart with an equivalent number of parameters,...
[原文]MoE Chat 16B also achieves comparable performance with DeepSeek Chat 7B and LLaMA2 SFT 7B in the chat setting. Encouraged by these results, we further undertake a preliminary endeavor to scale up DeepSeekMoE to 145B. The experimental results still validate its substantial advantages over the GShard architecture consistently. In addition, it shows performance comparable with DeepSeek 67B, using only 28.5% (maybe even 18.2%) of computations. Our contributions are summarized as follows: • Architectural Innovation. We introduce DeepSeekMoE, an innovative MoE architecture aiming at achieving ultima...
2 Preliminaries: Mixture-of-Experts for Transformers
[原文]We first introduce a generic MoE architecture commonly used in Transformer language models. A standard Transformer language model is constructed by stacking L 𝐿 L layers of standard Transformer blocks, where each block can be represented as follows: 𝐮 1 : T l superscript subscript 𝐮 : 1 𝑇 𝑙 \displaystyle\mathbf{u}_{1:T}^{l} = Self − Att ( 𝐡 1 : T l − 1 ) + 𝐡 1 : T l − 1 , absent Self Att superscript subscript 𝐡 : 1 𝑇 𝑙 1 superscript subscript 𝐡 : 1 𝑇 𝑙 1 \displaystyle=\operatorname{Self-Att}\left(\mathbf{h}_{1:T}^{l-1}\right)+\mathbf{h}_{1:T}^{l-1}, (1) 𝐡 t l superscript subscript 𝐡 𝑡 𝑙 \dis...
2 Preliminaries: Mixture-of-Experts for Transformers
[原文]tate 𝐡 t l superscript subscript 𝐡 𝑡 𝑙 \mathbf{h}_{t}^{l} is expressed as: 𝐡 t l superscript subscript 𝐡 𝑡 𝑙 \displaystyle\mathbf{h}_{t}^{l} = ∑ i = 1 N ( g i , t FFN i ( 𝐮 t l ) ) + 𝐮 t l , absent superscript subscript 𝑖 1 𝑁 subscript 𝑔 𝑖 𝑡 subscript FFN 𝑖 superscript subscript 𝐮 𝑡 𝑙 superscript subscript 𝐮 𝑡 𝑙 \displaystyle=\sum_{i=1}^{N}\left({g_{i,t}\operatorname{FFN}_{i}\left(\mathbf{u}_{t}^{l}\right)}\right)+\mathbf{u}_{t}^{l}, (3) g i , t subscript 𝑔 𝑖 𝑡 \displaystyle g_{i,t} = { s i , t , s i , t ∈ Topk ( { s j , t | 1 ⩽ j ⩽ N } , K ) , 0 , otherwise , absent cases subscript 𝑠 𝑖 ...
[原文]On top of the generic MoE architecture outlined in Section 2 , we introduce DeepSeekMoE, which is specifically designed to exploit the potential of expert specialization. As illustrated in Figure 2 , our architecture incorporates two principal strategies: fine-grained expert segmentation and shared expert isolation. Both of these strategies are designed to elevate the level of expert specialization. 3.1 Fine-Grained Expert Segmentation In scenarios where the number of experts is limited, tokens assigned to a particular expert will be more likely to cover diverse types of knowledge. As a conseq...
[原文]cript FFN 𝑖 superscript subscript 𝐮 𝑡 𝑙 superscript subscript 𝐮 𝑡 𝑙 \displaystyle=\sum_{i=1}^{mN}\left({g_{i,t}\operatorname{FFN}_{i}\left(\mathbf{u}_{t}^{l}\right)}\right)+\mathbf{u}_{t}^{l}, (6) g i , t subscript 𝑔 𝑖 𝑡 \displaystyle g_{i,t} = { s i , t , s i , t ∈ Topk ( { s j , t | 1 ⩽ j ⩽ m N } , m K ) , 0 , otherwise , absent cases subscript 𝑠 𝑖 𝑡 subscript 𝑠 𝑖 𝑡 Topk conditional-set subscript 𝑠 𝑗 𝑡 1 𝑗 𝑚 𝑁 𝑚 𝐾 0 otherwise \displaystyle=\begin{cases}s_{i,t},&s_{i,t}\in\operatorname{Topk}(\{s_{j,t}|1\leqslant j\leqslant mN\},mK),\\ 0,&\text{otherwise},\end{cases} (7) s i , t subscrip...
3.1 细粒度专家分割
在专家数量有限的场景中,分配给特定专家的token更可能涵盖多种类型的知识。因此,该指定专家将试图在其参数中学习差异巨大的各类知识,而这些知识难以被同时有效利用。然而,若每个token能够被路由至更多的专家,则多样化的知识便有机会被分解并分别由不同的专家学习。在此情境下,每个专家仍能保持高度的专业化水平,从而促进知识在专家间更聚焦的分布。为实现这一目标,我们在保持专家参数总量与计算成本一致的前提下,对专家进行了更细粒度的分割。更细粒度的专家分割使得激活专家的组合更加灵活且具备更强的适应性。具体而言,在图2(a)所示的典型MoE架构基础上,我们通过将FFN的中间隐藏维度缩减至原始大小的 1/m,将每个专家FFN分割为 m 个更小的专家。由于每个专家的规模变小,相应地,我们将激活的专家数量增加至 m 倍,以维持相同的计算成本,如图2(b)所示。 章节标题:3 DeepSeekMoE 架构 (2/2)
[原文]ormation. As a result, multiple experts may converge in acquiring shared knowledge in their respective parameters, thereby resulting in redundancy in expert parameters. However, if there are shared experts dedicated to capturing and consolidating common knowledge across varying contexts, the parameter redundancy among other routed experts will be alleviated. This alleviation of redundancy will contribute to a more parameter-efficient model with more specialized experts. Towards this objective, in addition to the fine-grained expert segmentation strategy, we further isolate K s subscript 𝐾 𝑠 K_...
[原文]ame{Topk}(\{s_{j,t}|K_{s}+1\leqslant j\leqslant mN\},mK-K_{s}),\\ 0,&\text{otherwise},\end{cases} (10) s i , t subscript 𝑠 𝑖 𝑡 \displaystyle s_{i,t} = Softmax i ( 𝐮 t l T 𝐞 i l ) . absent subscript Softmax 𝑖 superscript superscript subscript 𝐮 𝑡 𝑙 𝑇 superscript subscript 𝐞 𝑖 𝑙 \displaystyle=\operatorname{Softmax}_{i}\left({\mathbf{u}_{t}^{l}}^{T}\mathbf{e}_{i}^{l}\right). (11) Finally, in DeepSeekMoE, the number of shared expert is K s subscript 𝐾 𝑠 K_{s} , the total number of routed experts is m N − K s 𝑚 𝑁 subscript 𝐾 𝑠 mN-K_{s} , and the number of nonzero gates is m K − K s 𝑚 𝐾 subs...
3 DeepSeekMoE Architecture
章节标题:3 DeepSeekMoE 架构 (1/2) ame{Topk}(\{s_{j,t}|K_{s}+1\leqslant j\leqslant mN\},mK-K_{s}),\\ 0,&\text{otherwise},\end{cases} (10) s i , t subscript 𝑠 𝑖 𝑡 \displaystyle s_{i,t} = Softmax i ( 𝐮 t l T 𝐞 i l ) . absent subscript Softmax 𝑖 superscript superscript subscript 𝐮 𝑡 𝑙 𝑇 superscript subscript 𝐞 𝑖 𝑙 \displaystyle=\operatorname{Softmax}_{i}\left({\mathbf{u}_{t}^{l}}^{T}\mathbf{e}_{i}^{l}\right). (11) 最后,在 DeepSeekMoE 中,共享专家的数量为 K_{s},路由专家的总数为 mN-K_{s},非零门控的数量为 mK-K_{s}。值得注意的是,共享专家隔离的原型可归功于 Rajbhandari 等人(2022)。关键区别在于,他们是从工程角度推导该策略的,而我们是基于算法视角进行研究的。
[原文]e}T}\sum_{t=1}^{T}{\mathds{1}(\text{Token $t$ selects Expert $i$})}, (13) P i subscript 𝑃 𝑖 \displaystyle P_{i} = 1 T ∑ t = 1 T s i , t , absent 1 𝑇 superscript subscript 𝑡 1 𝑇 subscript 𝑠 𝑖 𝑡 \displaystyle=\frac{1}{T}\sum_{t=1}^{T}{s_{i,t}}, (14) where α 1 subscript 𝛼 1 \alpha_{1} is a hyper-parameter called expert-level balance factor, N ′ superscript 𝑁 ′ N^{\prime} is equal to ( m N − K s ) 𝑚 𝑁 subscript 𝐾 𝑠 (mN-K_{s}) and K ′ superscript 𝐾 ′ K^{\prime} is equal to ( m K − K s ) 𝑚 𝐾 subscript 𝐾 𝑠 (mK-K_{s}) for brevity. 𝟙 ( ⋅ ) 1 ⋅ \mathds{1}(\cdot) denotes the indicator function. D...
[原文]subscript 𝛼 2 \alpha_{2} is a hyper-parameter called device-level balance factor. In practice, we set a small expert-level balance factor to mitigate the risk of routing collapse, and meanwhile set a larger device-level balance factor to promote balanced computation across the devices.
[原文]In scenarios where the number of experts is limited, tokens assigned to a particular expert will be more likely to cover diverse types of knowledge. As a consequence, the designated expert will intend to learn vastly different types of knowledge in its parameters, and they are hard to be simultaneously utilized. However, if each token can be routed to more experts, diverse knowledge will gain the potential to be decomposed and learned in different experts respectively. In this context, each expert can still retain a high level of expert specialization, contributing to a more focused knowledge ...
[原文]isplaystyle=\begin{cases}s_{i,t},&s_{i,t}\in\operatorname{Topk}(\{s_{j,t}|1\leqslant j\leqslant mN\},mK),\\ 0,&\text{otherwise},\end{cases} (7) s i , t subscript 𝑠 𝑖 𝑡 \displaystyle s_{i,t} = Softmax i ( 𝐮 t l T 𝐞 i l ) , absent subscript Softmax 𝑖 superscript superscript subscript 𝐮 𝑡 𝑙 𝑇 superscript subscript 𝐞 𝑖 𝑙 \displaystyle=\operatorname{Softmax}_{i}\left({\mathbf{u}_{t}^{l}}^{T}\mathbf{e}_{i}^{l}\right), (8) where the total number of expert parameters is equal to N 𝑁 N times the number of parameters in a standard FFN, and m N 𝑚 𝑁 mN denotes the total number of fine-grained expert...
[原文]With a conventional routing strategy, tokens assigned to different experts may necessitate some common knowledge or information. As a result, multiple experts may converge in acquiring shared knowledge in their respective parameters, thereby resulting in redundancy in expert parameters. However, if there are shared experts dedicated to capturing and consolidating common knowledge across varying contexts, the parameter redundancy among other routed experts will be alleviated. This alleviation of redundancy will contribute to a more parameter-efficient model with more specialized experts. Toward...
[原文]ript 𝑠 𝑗 𝑡 subscript 𝐾 𝑠 1 𝑗 𝑚 𝑁 𝑚 𝐾 subscript 𝐾 𝑠 0 otherwise \displaystyle=\begin{cases}s_{i,t},&s_{i,t}\in\operatorname{Topk}(\{s_{j,t}|K_{s}+1\leqslant j\leqslant mN\},mK-K_{s}),\\ 0,&\text{otherwise},\end{cases} (10) s i , t subscript 𝑠 𝑖 𝑡 \displaystyle s_{i,t} = Softmax i ( 𝐮 t l T 𝐞 i l ) . absent subscript Softmax 𝑖 superscript superscript subscript 𝐮 𝑡 𝑙 𝑇 superscript subscript 𝐞 𝑖 𝑙 \displaystyle=\operatorname{Softmax}_{i}\left({\mathbf{u}_{t}^{l}}^{T}\mathbf{e}_{i}^{l}\right). (11) Finally, in DeepSeekMoE, the number of shared expert is K s subscript 𝐾 𝑠 K_{s} , the total numbe...
[原文]Automatically learned routing strategies may encounter the issue of load imbalance, which manifests two notable defects. Firstly, there is a risk of routing collapse (Shazeer et al., 2017 ) , i.e., the model always selects only a few experts, preventing other experts from sufficient training. Secondly, if experts are distributed across multiple devices, load imbalance can exacerbate computation bottlenecks. Expert-Level Balance Loss. In order to mitigate the risk of routing collapse, we also employ an expert-level balance loss. The computation of the balance loss is as follows: ℒ ExpBal subscr...
[原文]mance. Instead, our primary objective is to ensure balanced computation across the devices. If we partition all routed experts into D 𝐷 D groups { ℰ 1 , ℰ 2 , … , ℰ D } subscript ℰ 1 subscript ℰ 2 … subscript ℰ 𝐷 \{\mathcal{E}_{1},\mathcal{E}_{2},...,\mathcal{E}_{D}\} , and deploy each group on a single device, the device-level balance loss is computed as follows: ℒ DevBal subscript ℒ DevBal \displaystyle\mathcal{L}_{\mathrm{DevBal}} = α 2 ∑ i = 1 D f i ′ P i ′ , absent subscript 𝛼 2 superscript subscript 𝑖 1 𝐷 superscript subscript 𝑓 𝑖 ′ superscript subscript 𝑃 𝑖 ′ \displaystyle=\alpha_{2...
[原文]4.1 Experimental Setup 4.1.1 Training Data and Tokenization Our training data is sampled from a large-scale multilingual corpus created by DeepSeek-AI. The corpus primarily focuses on English and Chinese but also encompasses other languages. It is derived from diverse sources, including web text, mathematical material, coding scripts, published literature, and various other textual materials. For the purpose of validation experiments, we sample a subset containing 100B tokens from the corpus to train our models. For tokenization, we utilize the HuggingFace Tokenizer 2 2 2 https://github.com/hu...
[原文]ers Model Settings. In the validation experiments, we set the number of Transformer layers to 9 and the hidden dimension to 1280. We employ the multi-head attention mechanism with a total of 10 attention heads, where each head has a dimension of 128. For initialization, all learnable parameters are randomly initialized with a standard deviation of 0.006. We substitute all FFNs with MoE layers, and ensure that the total number of expert parameters equals 16 times that of a standard FFN. Additionally, we keep the activated expert parameters, including shared expert parameters and activated route...
[原文]we do not drop any tokens during training and do not employ the device-level balance loss. In order to prevent routing collapse, we set an expert-level balance factor of 0.01. For readability, we also present an overview table of hyper-parameters for DeepSeekMoE across different sizes in Appendix A . 4.1.4 Evaluation Benchmarks We conduct evaluations on a wide range of benchmarks covering various types of tasks. We list the benchmarks as follows. Language Modeling. For language modeling, we evaluate the models on the test set of Pile (Gao et al., 2020 ) , and the evaluation metric is the cross...
4 Validation Experiments
章节标题:4 验证实验 (1/2) 在训练过程中,我们不丢弃任何词元,也不采用设备级平衡损失。为防止路由崩溃,我们将专家级平衡因子设置为 0.01。为便于阅读,我们在附录 A 中提供了不同规模 DeepSeekMoE 的超参数总览表。
[原文]ACE-high (Acc.) 5-shot 29.0 30.0 30.9 30.4 31.7 HumanEval (Pass@1) 0-shot 0.0 1.2 2.4 3.7 4.9 MBPP (Pass@1) 3-shot 0.2 0.6 0.4 0.2 2.2 TriviaQA (EM) 5-shot 4.9 6.5 8.9 10.2 16.6 NaturalQuestions (EM) 5-shot 1.4 1.4 2.5 3.2 5.7 Table 1: Evaluation results for validation experiments. Bold font indicates the best. Compared with other MoE architectures, DeepSeekMoE exhibits a substantial performance advantage. 4.2 Evaluations Baselines. Including DeepSeekMoE, we compare five models for validation experiments. Dense denotes a standard dense Transformer language model with 0.2B total parameters. Has...
4.1.4 评估基准
我们在涵盖多种任务类型的广泛基准上进行了评估。基准列表如下。 语言建模。对于语言建模任务,我们在 Pile (Gao et al., 2020) 的测试集上评估模型,评估指标为交叉熵损失。 语言理解与推理。对于语言理解与推理任务,我们采用 HellaSwag (Zellers et al., 2019)、PIQA (Bisk et al., 2020)、ARC-challenge 和 ARC-easy (Clark et al., 2018)。这些任务的评估指标为准确率。 阅读理解。对于阅读理解任务,我们使用 RACE-high 和 RACE-middle (Lai et al., 2017),评估指标为准确率。 代码生成。对于代码生成任务,我们在 HumanEval (Chen et al., 2021) 和 MBPP (Austin et al., 2021) 上评估模型。评估指标为 Pass@1,代表仅进行一次生成尝试的通过率。 闭卷问答。对于闭卷问答任务,我们采用 TriviaQA (Joshi et al., 2017) 和 NaturalQuestions (Kwiatkowski et al., 2019)。评估指标为精确匹配(EM)率。 章节标题:4 验证实验 (2/2)
[原文]re it with larger baselines with more total parameters or activated parameters. The comparisons enable us to estimate the required model size of GShard or dense baselines to achieve equivalent performance to DeepSeekMoE. Comparison with GShard × 1.5 absent 1.5 \times 1.5 . Table 2 shows the comparison between DeepSeekMoE and a larger GShard model with 1.5 times the expert size, which results in 1.5 times both expert parameters and expert computation. Overall, we observe that DeepSeekMoE achieves comparable performance with GShard × 1.5 absent 1.5 \times 1.5 , underscoring the significant advan...
[原文]so, we provide additional comparisons with Dense × 4 absent 4 \times 4 in Appendix B . Figure 3: Ablation studies for DeepSeekMoE. The performance is normalized by the best performance for clarity in presentation. All compared models have the same number of parameters and activated parameters. We can find that fine-grained expert segmentation and shared expert isolation both contribute to stronger overall performance. 4.4 Ablation Studies In order to substantiate the effectiveness of the fine-grained expert segmentation and shared expert isolation strategies, we conduct ablation studies for De...
[原文]ted experts. Based on the finest granularity with 64 total experts and keeping the number of total experts and activated experts constant, we attempt to isolate 1, 2, and 4 experts as shared ones. We find that different ratios of the shared experts and routed experts do not significantly impact the performance, and 1, 2, and 4 shared experts achieve a Pile loss of 1.808, 1.806, and 1.811, respectively. Considering that the ratio of 1:3 yields a marginally better Pile loss, when scaling up DeepSeekMoE, we keep the ratio between shared experts and activated routed experts as 1:3. 4.5 Analysis on...
[原文]mong its expert parameters, so it can buffer the performance drop when top routed experts are disabled. Shared Experts Are Irreplaceable by Routed Experts. In order to investigate the role of the shared expert in DeepSeekMoE, we disable it and activate one more routed expert. The evaluation on Pile shows a significant increase in the Pile loss, rising from 1.808 to 2.414, even though we maintain the same computational cost. This result highlights the crucial function of the shared expert and indicates that the shared expert captures fundamental and essential knowledge not shared with routed ex...
[原文]erts are activated. The evaluation results shown in Figure 6 demonstrate that, even with the same total expert parameters and only half of the activated expert parameters, DeepSeekMoE still outperforms GShard. This highlights the ability of DeepSeekMoE to leverage expert parameters more efficiently, i.e., the proportion of effective parameters in the activated experts is much higher than that of GShard.
[原文]4.1.1 Training Data and Tokenization Our training data is sampled from a large-scale multilingual corpus created by DeepSeek-AI. The corpus primarily focuses on English and Chinese but also encompasses other languages. It is derived from diverse sources, including web text, mathematical material, coding scripts, published literature, and various other textual materials. For the purpose of validation experiments, we sample a subset containing 100B tokens from the corpus to train our models. For tokenization, we utilize the HuggingFace Tokenizer 2 2 2 https://github.com/huggingface/tokenizers to...
4.1 Experimental Setup
章节标题:4.1 实验设置 (1/2)
[原文]the validation experiments, we set the number of Transformer layers to 9 and the hidden dimension to 1280. We employ the multi-head attention mechanism with a total of 10 attention heads, where each head has a dimension of 128. For initialization, all learnable parameters are randomly initialized with a standard deviation of 0.006. We substitute all FFNs with MoE layers, and ensure that the total number of expert parameters equals 16 times that of a standard FFN. Additionally, we keep the activated expert parameters, including shared expert parameters and activated routed expert parameters, as...
4.1.1 训练数据与分词 我们的训练数据采样自由 DeepSeek-AI 构建的大规模多语言语料库。该语料库主要聚焦于英语和中文,但也涵盖其他语言。其来源多样,包括网络文本、数学资料、代码脚本、已发表文献以及各种其他文本材料。出于验证实验的目的,我们从该语料库中采样了一个包含 100B 个 token 的子集来训练我们的模型。在分词方面,我们使用 HuggingFace Tokenizer 2 2 2 https://github.com/huggingface/tokenizers 工具,在训练语料库的一个较小子集上训练字节对编码(BPE)(Sennrich et al., 2016 ) 分词器。在验证实验中,我们准备了一个词表大小为 8K 的分词器,并且在训练更大模型时,词表大小将会相应扩大。 4.1.2 基础设施 我们的实验基于 HAI-LLM (High-Flyer, 2023 ) 进行,这是一个高效且轻量级的训练框架,集成了多种并行策略,包括张量并行 (Shoeybi et al., 2019 ; Narayanan et al., 2021 ; Korthikanti et al., 2023 ) 、ZeRO 数据并行 (Rajbhandari et al., 2020 ) 、PipeDream 流水线并行 (Harlap et al., 2018 ) ,以及更具体地,通过结合数据并行和张量并行实现的专家并行 (Lepikhin et al., 2021 )。为了优化性能,我们使用 CUDA 和 Triton (Tillet et al., 2019 ) 开发了 GPU 内核,用于门控算法以及融合不同专家中线性层之间的计算。所有实验均在配备 NVIDIA A100 或 H800 GPU 的集群上进行。A100 集群中的每个节点包含 8 块 GPU,它们通过 NVLink 桥接器两两相连。
[原文]ens during training and do not employ the device-level balance loss. In order to prevent routing collapse, we set an expert-level balance factor of 0.01. For readability, we also present an overview table of hyper-parameters for DeepSeekMoE across different sizes in Appendix A . 4.1.4 Evaluation Benchmarks We conduct evaluations on a wide range of benchmarks covering various types of tasks. We list the benchmarks as follows. Language Modeling. For language modeling, we evaluate the models on the test set of Pile (Gao et al., 2020 ) , and the evaluation metric is the cross-entropy loss. Languag...
4.3 DeepSeekMoE Aligns Closely with the upper bound of MoE Models
(4.3 DeepSeekMoE Aligns Closely with the 的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]We have demonstrated that DeepSeekMoE outperforms the dense baseline and other MoE architectures. In order to provide a more precise understanding of the performance of DeepSeekMoE, we compare it with larger baselines with more total parameters or activated parameters. The comparisons enable us to estimate the required model size of GShard or dense baselines to achieve equivalent performance to DeepSeekMoE. Comparison with GShard × 1.5 absent 1.5 \times 1.5 . Table 2 shows the comparison between DeepSeekMoE and a larger GShard model with 1.5 times the expert size, which results in 1.5 times bo...
4.3 DeepSeekMoE Aligns Closely with the upper bound of MoE Models
(4.3 DeepSeekMoE Aligns Closely with the 的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]results suggest that, at least at the scale of about 2B parameters and 100B training tokens, the performance of DeepSeekMoE aligns closely with the theoretical upper bound of MoE models . Also, we provide additional comparisons with Dense × 4 absent 4 \times 4 in Appendix B . Figure 3: Ablation studies for DeepSeekMoE. The performance is normalized by the best performance for clarity in presentation. All compared models have the same number of parameters and activated parameters. We can find that fine-grained expert segmentation and shared expert isolation both contribute to stronger overall p...
[原文]In order to substantiate the effectiveness of the fine-grained expert segmentation and shared expert isolation strategies, we conduct ablation studies for DeepSeekMoE and present the results in Figure 3 . For a fair comparison, we ensure all models included in the comparison have the same number of total parameters and activated parameters. Shared Expert Isolation. In order to evaluate the influence of the shared expert isolation strategy, we isolate one expert as the shared one based on GShard. From Figure 3 , we observe that compared with GShard, the intentional isolation of a shared expert ...
[原文]lds a marginally better Pile loss, when scaling up DeepSeekMoE, we keep the ratio between shared experts and activated routed experts as 1:3.
4.5 Analysis on Expert Specialization
(4.5 Analysis on Expert Specialization的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]In this section, we conduct an empirical analysis on the expert specialization of DeepSeekMoE 2B. DeepSeekMoE 2B in this section refers to the model reported in Table 1 , i.e., comprising 2.0B total parameters, with 1 shared expert and 7 out of 63 routed experts being activated. Figure 4: Pile loss with regard to different ratios of disabled top routed experts. Notably, DeepSeekMoE exhibits greater sensitivity to the ratio of disabled top routed experts, indicating lower redundancy among routed experts in DeepSeekMoE. DeepSeekMoE Exhibits Lower Redundancy Among Routed Experts. In order to asse...
4.5 Analysis on Expert Specialization
(4.5 Analysis on Expert Specialization的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]aceable by routed ones. Figure 5: Pile loss with regard to different numbers of activated routed experts in DeepSeekMoE. With only 4 routed experts activated, DeepSeekMoE achieves a Pile loss comparable with GShard. Figure 6: Comparison between GShard and DeepSeekMoE with half the activated experts (trained from scratch). With the same total expert parameters and only half of the activated expert parameters, DeepSeekMoE still outperforms GShard. DeepSeekMoE Acquires Knowledge More Accurately. In order to validate our claim that higher flexibility in combining activated experts contributes to a...
5 Scaling up to DeepSeekMoE 16B
(5 Scaling up to DeepSeekMoE 16B的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]With the DeepSeekMoE architecture, we scale up our MoE model to a larger scale with 16B total parameters and train it on 2T tokens. Our results demonstrate that compared with LLaMA2 7B, DeepSeekMoE 16B achieves superior performance with only about 40% of computations. 5.1 Experimental Setup 5.1.1 Training Data and Tokenization We sample the training data from the same corpus as described in Section 4.1.1 . Different from the validation experiments, we sample a larger amount of data with 2T tokens, aligning with the number of training tokens of LLaMA2 7B. We also use the HuggingFace Tokenizer t...
5 Scaling up to DeepSeekMoE 16B
(5 Scaling up to DeepSeekMoE 16B的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]2}=0.95 , and weight _ decay = 0.1 weight _ decay 0.1 \mathrm{weight\_decay}=0.1 . The learning rate is also scheduled using a warmup-and-step-decay strategy. Initially, the learning rate linearly increases from 0 to the maximum value during the first 2K steps. Subsequently, the learning rate is multiplied by 0.316 at 80% of the training steps, and again by 0.316 at 90% of the training steps. The maximum learning rate for DeepSeekMoE 16B is set to 4.2 × 10 − 4 4.2 superscript 10 4 4.2\times 10^{-4} , and the gradient clipping norm is set to 1.0. The batch size is set to 4.5K, and with a ma...
5 Scaling up to DeepSeekMoE 16B
(5 Scaling up to DeepSeekMoE 16B的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]we additionally consider DROP (Dua et al., 2019 ) . The evaluation metric is the Exactly Matching (EM) rate. Math Reasoning. For math reasoning, we additionally incorporate GSM8K (Cobbe et al., 2021 ) and MATH (Hendrycks et al., 2021 ) , using EM as the evaluation metric. Multi-Subject Multiple-Choice. For multi-subject multiple-choice, we additionally evaluate the models on MMLU (Hendrycks et al., 2020 ) . The evaluation metric is accuracy. Disambiguation. For disambiguation, we additionally consider WinoGrande (Sakaguchi et al., 2019 ) and the evaluation metric is accuracy. Chinese Benchmark...
5 Scaling up to DeepSeekMoE 16B
(5 Scaling up to DeepSeekMoE 16B的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
(5 Scaling up to DeepSeekMoE 16B的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]ile DeepSeek 7B has 2.5B attention parameters). Our earlier investigation on DeepSeek 7B reveals a positive correlation between the attention capacity and performance on multiple-choice tasks. For example, DeepSeek 7B MQA, which is equipped with the multi-query attention mechanism (Shazeer, 2019 ) , also struggled in MMLU-like tasks. In addition, for a more comprehensive understanding of the training process of DeepSeekMoE 16B, we also provide the benchmark curves of DeepSeekMoE 16B and DeepSeek 7B (Dense) during training in Appendix C for reference. Critically, due to the modest number of par...
5 Scaling up to DeepSeekMoE 16B
(5 Scaling up to DeepSeekMoE 16B的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]model with 6.7B parameters. Both DeepSeekMoE 16B and LLaMA2 7B are pretrained on 2T tokens. Compared with LLaMA2 7B, DeepSeekMoE has 245% of total parameters but only needs 39.6% of computations. The results on our internal benchmarks are presented in Table 4 , leading to the following observations. (1) Among the evaluated benchmarks, with only about 40% of computations, DeepSeekMoE 16B outperforms LLaMA2 7B on the majority of benchmarks. (2) The math reasoning and code generation capabilities of DeepSeekMoE 16B are stronger than LLaMA2 7B, attributed to the enriched presence of mathematical a...
[原文]5.1.1 Training Data and Tokenization We sample the training data from the same corpus as described in Section 4.1.1 . Different from the validation experiments, we sample a larger amount of data with 2T tokens, aligning with the number of training tokens of LLaMA2 7B. We also use the HuggingFace Tokenizer tools to train a BPE tokenizer, but the vocabulary size is set to 100K for DeepSeekMoE 16B. 5.1.2 Hyper-Parameters Model Settings. For DeepSeekMoE 16B, we set the number of Transformer layers to 28 and the hidden dimension to 2048. We employ the multi-head attention mechanism with a total of ...
[原文]rate is multiplied by 0.316 at 80% of the training steps, and again by 0.316 at 90% of the training steps. The maximum learning rate for DeepSeekMoE 16B is set to 4.2 × 10 − 4 4.2 superscript 10 4 4.2\times 10^{-4} , and the gradient clipping norm is set to 1.0. The batch size is set to 4.5K, and with a maximum sequence length of 4K, each training batch contains 18M tokens. Correspondingly, the total number of training steps is set to 106,449 to achieve 2T training tokens. Due to the abundance of training data, we do not use dropout during training. We leverage pipeline parallelism to deploy d...
[原文]ple-Choice. For multi-subject multiple-choice, we additionally evaluate the models on MMLU (Hendrycks et al., 2020 ) . The evaluation metric is accuracy. Disambiguation. For disambiguation, we additionally consider WinoGrande (Sakaguchi et al., 2019 ) and the evaluation metric is accuracy. Chinese Benchmarks. Since DeepSeekMoE 16B is pretrained on a bilingual corpus, we also evaluate it on four Chinese benchmarks. CLUEWSC (Xu et al., 2020 ) is a Chinese disambiguation benchmark. CEval (Huang et al., 2023 ) and CMMLU (Li et al., 2023 ) are two Chinese multi-subject multiple-choice benchmarks wi...
5.1.3 评估基准
除验证实验中使用的基准外,我们还引入了额外的基准以进行更全面的评估。现将与验证实验所用基准的区别介绍如下。 语言建模。在语言建模方面,我们还在Pile(Gao et al., 2020)的测试集上对模型进行评估。由于DeepSeekMoE 16B使用的分词器与LLaMA2 7B使用的不同,为进行公平比较,我们采用每字节比特数(BPB)作为评估指标。 阅读理解。在阅读理解方面,我们额外考虑了DROP(Dua et al., 2019)。 章节标题:5.1 实验设置(2/2) 评估指标为完全匹配(Exactly Matching, EM)率。数学推理。针对数学推理任务,我们额外引入了GSM8K(Cobbe等,2021)和MATH(Hendrycks等,2021)数据集,并以EM作为评估指标。多科目多
5.2 Evaluations
多选题。针对多学科多选题,我们额外在 MMLU (Hendrycks et al., 2020 ) 上对模型进行了评估。评估指标为准确率(accuracy)。消歧。针对消歧任务,我们额外引入了 WinoGrande (Sakaguchi et al., 2019 ),评估指标为准确率。中文基准测试。由于 DeepSeekMoE 16B 是在双语语料库上进行预训练的,我们还在四个中文基准测试上对其进行了评估。CLUEWSC (Xu et al., 2020 ) 是一个中文消歧基准测试。CEval (Huang et al., 2023 ) 和 CMMLU (Li et al., 2023 ) 是两个形式与 MMLU 相似的中文多学科多选题基准测试。CHID (Zheng et al., 2019 ) 是一个中文成语补全基准测试,旨在评估对中国文化的理解能力。上述中文基准测试的评估指标为准确率或精确匹配率(EM)。Open LLM Leaderboard。我们基于内部评估框架对上述所有基准测试进行了评估。为了公平且便捷地将 DeepSeekMoE 16B 与开源模型进行比较,我们额外在 Open LLM Leaderboard 上对 DeepSeekMoE 16B 进行了评估。Open LLM Leaderboard 是由 HuggingFace 支持的公共排行榜,包含六个任务:ARC (Clark et al., 2018 ) 、HellaSwag (Zellers et al., 2019 ) 、MMLU (Hendrycks et al., 2020 ) 、TruthfulQA (Lin et al., 2022 ) 、Winogrande (Sakaguchi et al., 2019 ) 以及 GSM8K (Cobbe et al., 2021 ) 。
[原文]2022a ) . (3) Compared with the excellent performance on other tasks, DeepSeekMoE exhibits limitations in addressing multiple-choice tasks. This inadequacy stems from the limited attention parameters in DeepSeekMoE 16B (DeepSeekMoE 16B has only about 0.5B attention parameters, while DeepSeek 7B has 2.5B attention parameters). Our earlier investigation on DeepSeek 7B reveals a positive correlation between the attention capacity and performance on multiple-choice tasks. For example, DeepSeek 7B MQA, which is equipped with the multi-query attention mechanism (Shazeer, 2019 ) , also struggled in M...
[原文]forms LLaMA2 7B on the majority of benchmarks. 5.2.2 Comparison with Open Source Models Internal Comparison with LLaMA2 7B. In the realm of open source models, we mainly compare DeepSeekMoE 16B with LLaMA2 7B (Touvron et al., 2023b ) , a well-known and strong open source language model with 6.7B parameters. Both DeepSeekMoE 16B and LLaMA2 7B are pretrained on 2T tokens. Compared with LLaMA2 7B, DeepSeekMoE has 245% of total parameters but only needs 39.6% of computations. The results on our internal benchmarks are presented in Table 4 , leading to the following observations. (1) Among the eval...
[原文]n results, as presented in Figure 1 , show that DeepSeekMoE 16B consistently outperforms models with similar activated parameters by a large margin. Moreover, it achieves comparable performance with LLaMA2 7B, which has approximately 2.5 times the activated parameters.
6 Alignment for DeepSeekMoE 16B
(6 Alignment for DeepSeekMoE 16B的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]Previous research indicates that MoE models typically do not emerge significant gains from fine-tuning (Fedus et al., 2021 ; Artetxe et al., 2022 ) . However, Shen et al. ( 2023 ) present findings suggesting that MoE models can indeed benefit from instruction tuning. In order to assess whether DeepSeekMoE 16B can benefit from fine-tuning, we conduct supervised fine-tuning to construct a chat model based on DeepSeekMoE 16B. The experimental results reveal that DeepSeekMoE Chat 16B also achieves comparable performance with LLaMA2 SFT 7B and DeepSeek Chat 7B. 6.1 Experimental Setup Training Data....
6 Alignment for DeepSeekMoE 16B
(6 Alignment for DeepSeekMoE 16B的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
(6 Alignment for DeepSeekMoE 16B的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]nsuming nearly 40% of computations, achieves comparable performance with 7B dense models across language understanding and reasoning (PIQA, ARC, BBH), machine reading comprehension (RACE), mathematical (GSM8K, MATH), and knowledge-intensive tasks (TriviaQA, NaturalQuestions). (2) On code generation tasks, DeepSeekMoE Chat 16B significantly outperforms LLaMA2 SFT 7B, demonstrating notable improvements on HumanEval and MBPP. In addition, it also surpasses DeepSeek Chat 7B. (3) On multiple-choice question answering benchmarks including MMLU, CEval, and CMMLU, DeepSeekMoE Chat 16B still falls behi...
[原文]Training Data. For training the chat model, we conduct supervised fine-tuning (SFT) on our in-house curated data, comprising 1.4M training examples. This dataset spans a broad range of categories including math, code, writing, question answering, reasoning, summarization, and more. The majority of our SFT training data is in English and Chinese, rendering the chat model versatile and applicable in bilingual scenarios. Hyper-Parameters. During supervised fine-tuning, we set the batch size to 1024 examples and conduct training over 8 epochs using the AdamW optimizer (Loshchilov and Hutter, 2019 ...
6.1 Experimental Setup
训练数据。为训练对话模型,我们在内部精选的数据上进行监督微调(SFT),该数据包含1.4M个训练样本。该数据集涵盖广泛的类别,包括数学、代码、写作、问答、推理、摘要等。我们的大部分SFT训练数据为英文和中文,使得该对话模型具备多用途性,并适用于双语场景。超参数。在监督微调过程中,我们将批次大小设为1024个样本,并使用AdamW优化器(Loshchilov and Hutter, 2019)进行8个周期的训练。我们采用4K的最大序列长度,并尽可能密集地打包训练样本直至达到序列长度上限。在监督微调中我们未使用Dropout,且仅设置恒定的学习率10^{-5},未采用任何学习率调度策略。评估基准。对于对话模型的评估,我们采用与第5.1.3节相似的基准,并作如下调整:(1)排除Pile(Gao et al., 2020),因为对话模型很少用于纯语言建模。(2)排除CHID(Zheng et al., 2019),因其结果存在不稳定性,难以得出可靠结论。(3)额外加入BBH(Suzgun et al., 2022),以更全面地评估对话模型的推理能力。 章节标题:6.1 实验设置 (2/2)
[原文]5 14.7 15.2 HumanEval (Pass@1) 0-shot 35.4 45.1 45.7 MBPP (Pass@1) 3-shot 27.8 39.0 46.2 TriviaQA (EM) 5-shot 60.1 59.5 63.3 NaturalQuestions (EM) 0-shot 35.2 32.7 35.1 MMLU (Acc.) 0-shot 50.0 49.7 47.2 WinoGrande (Acc.) 0-shot 65.1 68.4 69.0 CLUEWSC (EM) 5-shot 48.4 66.2 68.2 CEval (Acc.) 0-shot 35.1 44.7 40.0 CMMLU (Acc.) 0-shot 36.9 51.2 49.3 Table 5: Comparison among LLaMA2 SFT 7B, DeepSeek Chat 7B and DeepSeekMoE Chat 16B, with all of these three models fine-tuned on the same SFT data. Compared with both 7B dense models, DeepSeekMoE Chat 16B still achieves comparable or better performance...
[原文]Baselines. In order to validate the potential of DeepSeekMoE 16B after alignment, we conduct supervised fine-tuning for LLaMA2 7B, DeepSeek 7B, and DeepSeekMoE 16B, where we utilize totally the same fine-tuning data to ensure fairness. Correspondingly, we construct three chat models, including LLaMA2 SFT 7B 3 3 3 We use LLaMA2 SFT to distinguish from the official LLaMA2 Chat (Touvron et al., 2023b ) model. , DeepSeek Chat 7B, and DeepSeekMoE Chat 16B. Subsequently, we compare DeepSeekMoE Chat 16B with the other two dense chat models (with about 2.5 times the FLOPs) across a wide range of downs...
[原文]Encouraged by the outstanding performance of DeepSeekMoE 16B, we further undertake a preliminary endeavor to scale up DeepSeekMoE to 145B. In this initial study, DeepSeekMoE 145B is trained on 245B tokens, but it has demonstrated consistent advantages over the GShard architecture and shown promise to match or exceed the performance of DeepSeek 67B (Dense). Furthermore, upon the completion of the final version and full training of DeepSeekMoE 145B, we also plan to make it publicly available. 7.1 Experimental Setup Training Data and Tokenization. For DeepSeekMoE 145B, we employ exactly the same ...
[原文]alue during the first 2K steps. Subsequently, the learning rate keeps constant during the remaining training process. The maximum learning rate for DeepSeekMoE 145B is set to 3.0 × 10 − 4 3.0 superscript 10 4 3.0\times 10^{-4} , and the gradient clipping norm is set to 1.0. The batch size is set to 4.5K, and with a maximum sequence length of 4K, each training batch contains 18M tokens. We train DeepSeekMoE 145B for 13,000 steps, achieving 245B training tokens. Also, we do not use dropout during training. We leverage pipeline parallelism to deploy different layers of a model on different device...
[原文]hyper-parameters. Results. From the evaluation results presented in Table 6 , we have the following observations: (1) Despite having comparable total parameters and computations, DeepSeekMoE 145B significantly outperforms GShard 137B, highlighting the advantages of the DeepSeekMoE architecture again. (2) On the whole, with only 28.5% of computations, DeepSeekMoE 145B achieves comparable performance with DeepSeek 67B (Dense). Consistent with the findings from DeepSeekMoE 16B, DeepSeekMoE 145B exhibits remarkable strengths in language modeling and knowledge-intensive tasks, but with limitations ...
[原文]Training Data and Tokenization. For DeepSeekMoE 145B, we employ exactly the same training corpus and tokenizer as DeepSeekMoE 16B, with the only difference being that DeepSeekMoE 145B is trained on 245B tokens for an initial study. Model Settings. For DeepSeekMoE 145B, we set the number of Transformer layers to 62 and the hidden dimension to 4096. We employ the multi-head attention mechanism with a total of 32 attention heads, where each head has a dimension of 128. As for initialization, all learnable parameters are randomly initialized with a standard deviation of 0.006. As in DeepSeekMoE 16...
[原文]ge pipeline parallelism to deploy different layers of a model on different devices, and for each layer, all the routed experts will be uniformly deployed on 4 devices (i.e., expert parallelism combined with data parallelism). Since we employ expert parallelism for DeepSeekMoE 145B, the device-level load balance should be considered to reduce the computational bottleneck. In response, we set the device-level balance factor to 0.05 to encourage balanced computation across devices. Also, we still set a small expert-level balance factor of 0.003 to prevent routing collapse. Evaluation Benchmarks. ...
[原文]ek 67B (Dense) and MoE models at the scale of about 140B total parameters. In the lines of “# Experts” and “# Activated Experts”, a 𝑎 a + b 𝑏 b denotes a 𝑎 a shared experts and b 𝑏 b routed experts, respectively. Bold font indicates the best or near the best performance excluding the last column. DeepSeekMoE 145B, and even DeepSeekMoE 142B (Half Activated) that has only a half of activated expert parameters, outperform GShard 137B by a large margin. Moreover, with 28.5% of computations, DeepSeekMoE 145B achieves comparable performance with DeepSeek 67B.
7.2 Evaluations
ek 67B(稠密)模型与总参数量约为140B的MoE模型。在“# Experts”和“# Activated Experts”行中,a + b 分别表示 a 个共享专家和 b 个路由专家。粗体表示除最后一列外最优或接近最优的性能。DeepSeekMoE 145B,乃至仅具有一半激活专家参数的 DeepSeekMoE 142B(Half Activated),均以显著优势超越了 GShard 137B。此外,DeepSeekMoE 145B 仅以 28.5% 的计算量,便实现了与 DeepSeek 67B 相当的性能。
[原文]Baselines. Apart from DeepSeekMoE 145B , we consider three additional models for comparison. DeepSeek 67B (Dense) is a dense model with 67.4B total parameters (refer to DeepSeek-AI ( 2024 ) for the model and training details). GShard 137B shares the same hidden dimension and number of layers as DeepSeekMoE 145B, but follows the GShard architecture. Note that DeepSeekMoE 145B aligns the intermediate hidden dimension in each expert to a multiple of 64 for computation efficiency, so its model size is 6% larger than GShard 137B. DeepSeekMoE 142B (Half Activated) has a similar architecture to DeepS...
[原文]The Mixture of Experts (MoE) technique is first proposed by Jacobs et al. ( 1991 ); Jordan and Jacobs ( 1994 ) to deal with different samples with independent expert modules. Shazeer et al. ( 2017 ) introduce MoE into language model training and build a large-scale LSTM-based (Hochreiter and Schmidhuber, 1997 ) MoE models. As Transformer become the most popular architecture for NLP, many attempts extend FFNs in a Transformer as MoE layers to build MoE language models. GShard (Lepikhin et al., 2021 ) and Switch Transformer (Fedus et al., 2021 ) are pioneers which employ learnable top-2 or top-1...
9 Conclusion
混合专家(Mixture of Experts, MoE)技术最早由 Jacobs et al. (1991); Jordan and Jacobs (1994) 提出,用于通过独立专家模块处理不同样本。Shazeer et al. (2017) 将 MoE 引入语言模型训练,构建了基于 LSTM(Hochreiter and Schmidhuber, 1997)的大规模 MoE 模型。随着 Transformer 成为 NLP 最流行的架构,许多工作将 Transformer 中的 FFN 扩展为 MoE 层以构建 MoE 语言模型。GShard (Lepikhin et al., 2021) 与 Switch Transformer (Fedus et al., 2021) 是先行者,采用可学习的 top-2 或 top-1 路由策略,将 MoE 语言模型扩展至极大规模。Hash Layer (Roller et al., 2021) 与 StableMoE (Dai et al., 2022b) 使用固定路由策略以获得更稳定的路由与训练。Zhou et al. (2022) 提出专家选择(expert-choice)路由策略,使每个 token 可被分配至不同数量的专家。Zoph (2022) 聚焦 MoE 模型的训练不稳定与微调困难问题,提出 ST-MoE 以克服这些挑战。除 MoE 架构与训练策略研究外,近年来也涌现出大量基于现有 MoE 架构的大规模语言或多模态模型(Lin et al., 2021; Du et al., 2022; Ren et al., 2023; Xue et al., 2023)。总体而言,大多数既有 MoE 模型基于 conventional top-1 或 top-2 路由策略,在提升专家专业化方面仍有较大空间。为此,我们的 DeepSeekMoE 架构旨在将专家专业化提升至极致。
[原文]In this paper, we introduce the DeepSeekMoE architecture for MoE language models, with the objective of achieving ultimate expert specialization. Through fine-grained expert segmentation and shared expert isolation, DeepSeekMoE achieves significantly higher expert specialization and performance compared with prevailing MoE architectures. Starting with a modest scale of 2B parameters, we validate the advantages of DeepSeekMoE, demonstrating its capability to approach the upper bound performance for MoE models. Furthermore, we provide empirical evidence to show that DeepSeekMoE has a higher leve...
Appendix A Overview of Hyper-Parameters
(Appendix A Overview of Hyper-Parameters的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]We present the overview of hyper-parameters for DeepSeekMoE across various sizes in Table 7 . # Params # Layers Hidden Size # Attn Heads # Shared Experts # Routed Experts Relative Expert Size Sequence Length Batch Size (Sequence) Learning Rate 2.0B 9 1280 10 1 63 (7 activated) 0.25 2048 2048 1.08e-3 16.4B 28 2048 16 2 64 (6 activated) 0.25 4096 4608 4.2e-4 144.6B 62 4096 32 4 128 (12 activated) 0.125 4096 4608 3.0e-4 Table 7: Overview of hyper-parameters for DeepSeekMoE across various sizes. The relative expert size is in comparison to a standard FFN.
Appendix B Comparing DeepSeekMoE with Larger Models
(Appendix B Comparing DeepSeekMoE with La的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
Appendix B Comparing DeepSeekMoE with Larger Models
(Appendix B Comparing DeepSeekMoE with La的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
Appendix C Training Benchmark Curves of DeepSeekMoE 16B
(Appendix C Training Benchmark Curves of 的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]We present the benchmark curves during training of DeepSeekMoE 16B and DeepSeek 7B (Dense) in Figure 7 for reference. Figure 7: Benchmark curves during training of DeepSeekMoE 16B and DeepSeek 7B (Dense). ◄ Feeling lucky? Conversion report Report an issue View original on arXiv ►