DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

DeepSeekMoE：混合专家语言模型的终极专家专业化

📄 arXiv: 2401.06066📅 2024-01-10PDF

翻译进度81 / 81 段 (100%)

中文摘要

DeepSeekMoE 混合专家语言模型，采用多路由辅助专家机制，实现专家间的极致专业化分工。通过创新的路由算法，确保每个输入都能被分配给最合适的专家处理，同时避免负载不均。DeepSeekMoE 在保持 16B 激活参数的前提下，拥有高达 16x 的总参数量，实现了性能与效率的完美平衡。

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

作者：戴德迈1,2、邓成琦1、赵成刚1,3、徐RX1、高华卓1、陈德礼1、李嘉实1、曾旺定1、郁兴开1,4、吴Y1、谢震达1、李YK1、黄攀攀1、罗福利1、冉冲1、隋志方2、梁文峰1 机构：1 DeepSeek-AI，2 北京大学多媒体信息处理国家重点实验室，3 清华大学交叉信息研究院，4 南京大学软件新技术国家重点实验室摘要在大型语言模型时代，混合专家（MoE）是扩大模型参数规模时管理计算成本的一种有前景的架构。然而，传统MoE架构如GShard（从N个专家中激活top-K个）在确保专家专业化方面面临挑战，即每个专家获取不重叠且专注的知识。为此，我们提出了DeepSeekMoE架构，旨在实现终极专家专业化。该架构涉及两个主要策略：（1）将专家精细分割为mN个，从中激活mK个，允许更灵活地组合激活的专家；（2）隔离Ks个专家作为共享专家，旨在捕获通用知识并减少路由专家的冗余。从零开始的2B参数规模，我们证明了DeepSeekMoE 2B实现了与GShard 2.9B相当的性能，后者拥有1.5倍的专家参数和计算量。此外，DeepSeekMoE 2B几乎接近具有相同总参数量的稠密模型性能，这设定了MoE模型的上限。随后，我们将DeepSeekMoE扩展到16B参数，表明其以仅约40%的计算量实现了与LLaMA2 7B相当的性能。此外，我们初步将DeepSeekMoE扩展到145B参数，持续验证了其相对于GShard架构的显著优势，并展示了与DeepSeek 67B相当的性能，仅使用了28.5%（甚至18.2%）的计算量。

原文: Damai Dai ∗1,2 Chengqi Deng 1 Chenggang Zhao ∗1,3 R.X. Xu 1 Huazuo Gao 1 Deli Chen 1 Jiashi Li 1 Wangding Zeng 1 Xingkai Yu ∗1,4 Y. Wu 1 Zhenda Xie 1 Y.K. Li 1 Panpan Huang 1 Fuli Luo 1 Chong Ruan 1 Zhifang Sui 2 Wenfeng Liang 1 1 DeepSeek-AI 2 National Key Laboratory for Multimedia Information Processing Peking University 3 Institute for Interdisciplinary Information Sciences Tsinghua University 4 National Key Laboratory for Novel Software Technology Nanjing University {daidamai, szf}@pku.edu.cn {wenfeng.liang}@deepseek.com https://github.com/deepseek-ai/DeepSeek-MoE Abstract In the era of la...

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

图1：DeepSeekMoE 16B与开源模型在Open LLM Leaderboard上的对比。红色虚线是从除DeepSeekMoE 16B外的所有模型数据点线性拟合得出的。DeepSeekMoE 16B持续以显著优势超越具有相似激活参数量级别的模型，并实现了与LLaMA2 7B（其激活参数约为2.5倍）相当的性能。 1 引言近期研究和实践经验表明，在充足训练数据可用时，通过增加参数和计算预算扩展语言模型可以产生显著更强的模型（Brown等, 2020；OpenAI, 2023；Touvron等, 2023a；Hoffmann等, 2022）。然而，必须承认，将模型扩展到极大规模也伴随着极高的计算成本。考虑到巨大成本，混合专家（MoE）架构已成为一种流行的解决方案。它可以在实现参数扩展的同时，将计算成本保持在适度水平。近期在Transformer中应用MoE架构已成功将语言模型扩展到相当大的规模，并取得了显著性能。这些成就凸显了MoE语言模型的巨大潜力。然而，尽管MoE架构前景广阔，现有架构存在知识混合性和知识冗余问题，限制了专家专业化，即每个专家获取不重叠且专注的知识。

原文: ntly validate its substantial advantages over the GShard architecture, and show its performance comparable with DeepSeek 67B, using only 28.5% (maybe even 18.2%) of computations. Figure 1: Comparison between DeepSeekMoE 16B and open source models on the Open LLM Leaderboard. The red dashed line is linearly fitted from data points of all models except DeepSeekMoE 16B. DeepSeekMoE 16B consistently outperforms models with a similar number of activated parameters by a large margin, and achieves comparable performance with LLaMA2 7B, which has approximately 2.5 times the activated parameters. 1 Int...

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

传统MoE架构用MoE层替换Transformer中的前馈网络（FFN）。每个MoE层由多个专家组成，每个专家在结构上与标准FFN相同，每个token被分配给一个或两个专家。这种架构表现出两个潜在问题：（1）知识混合性：现有MoE实践通常使用有限数量的专家（如8或16个），因此被分配到特定专家的token可能覆盖多样化的知识。因此，指定专家将试图在其参数中组装不同类型的大量知识，这些知识难以同时利用。（2）知识冗余：被分配到不同专家的token可能需要共同知识。结果，多个专家可能在各自参数中获取共享知识，导致专家参数冗余。这些问题共同阻碍了现有MoE实践中的专家专业化，阻止它们达到MoE模型的理论上限性能。针对上述问题，我们提出DeepSeekMoE，一种创新的MoE架构，专门设计以实现终极专家专业化。我们的架构涉及两个主要策略：（1）细粒度专家分割：在保持参数数量不变的同时，我们通过分割FFN中间隐藏维度将专家分为更细粒度。相应地，在保持计算成本恒定的情况下，我们还激活更多细粒度专家以实现更灵活和适应性的激活专家组合。细粒度专家分割允许多样化知识被更精细地分解并更精确地学习到不同专家中，每个专家保持更高水平的专业化。

原文: wledge. Conventional MoE architectures substitute the Feed-Forward Networks (FFNs) in a Transformer with MoE layers. Each MoE layer consists of multiple experts, with each structurally identical to a standard FFN, and each token is assigned to one (Fedus et al., 2021 ) or two (Lepikhin et al., 2021 ) experts. This architecture manifests two potential issues: (1) Knowledge Hybridity : existing MoE practices often employ a limited number of experts (e.g., 8 or 16), and thus tokens assigned to a specific expert will be likely to cover diverse knowledge. Consequently, the designated expert will in...

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

（2）共享专家隔离：我们隔离某些专家作为始终激活的共享专家，旨在捕获和整合不同上下文中的通用知识。通过将通用知识压缩到这些共享专家中，其他路由专家之间的冗余将减少。这可以增强参数效率并确保每个路由专家通过专注于独特方面保持专业化。 DeepSeekMoE的这些架构创新为训练参数高效的MoE语言模型提供了机会，其中每个专家高度专业化。从零开始的2B参数规模，我们验证了DeepSeekMoE架构的优势。我们在涵盖多样化任务的12个零样本或少样本基准上进行了评估。实证结果表明，DeepSeekMoE 2B以显著优势超越GShard 2B，甚至匹敌GShard 2.9B，这是一个更大的MoE模型，拥有1.5倍专家参数和计算量。值得注意的是，我们发现DeepSeekMoE 2B几乎接近具有等效参数数量的稠密模型性能，这设定了MoE语言模型的严格上限。为了获得更深入的见解，我们对DeepSeekMoE的专家专业化进行了详尽的消融研究和分析。这些研究验证了细粒度专家分割和共享专家隔离的有效性，并为DeepSeekMoE能够实现高水平专家专业化的主张提供了实证支持。利用我们的架构，我们随后将模型参数扩展到16B，并在包含2T token的大型语料库上训练了DeepSeekMoE 16B。评估结果显示，仅用约40%的计算量，DeepSeekMoE 16B就实现了与DeepSeek 7B（在同一2T语料库上训练的稠密模型）相当的性能。

原文: to a more accurate and targeted knowledge acquisition. (2) Shared Expert Isolation: we isolate certain experts to serve as shared experts that are always activated, aiming at capturing and consolidating common knowledge across varying contexts. Through compressing common knowledge into these shared experts, redundancy among other routed experts will be mitigated. This can enhance the parameter efficiency and ensure that each routed expert retains specialized by focusing on distinctive aspects. These architectural innovations in DeepSeekMoE offer opportunities to train a parameter-efficient MoE...

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

我们还比较了DeepSeekMoE与开源模型，评估表明DeepSeekMoE 16B持续以显著优势超越具有相似激活参数量级别的模型，并实现了与LLaMA2 7B（其激活参数约为2.5倍）相当的性能。图1展示了Open LLM Leaderboard上的评估结果。此外，我们进行了监督微调（SFT）进行对齐，将模型转化为Chat模型。评估结果显示，DeepSeekMoE Chat 16B也在对话设置下实现了与DeepSeek Chat 7B和LLaMA2 SFT 7B相当的性能。受这些结果鼓舞，我们进一步初步将DeepSeekMoE扩展到145B。实验结果持续验证了其相对于GShard架构的显著优势。此外，它展示了与DeepSeek 67B相当的性能，仅使用了28.5%（甚至18.2%）的计算量。我们的贡献总结如下： - 架构创新：我们提出了DeepSeekMoE，一种创新的MoE架构，旨在实现终极专家专业化，采用细粒度专家分割和共享专家隔离两个主要策略。 - 实证验证：我们进行了广泛实验以实证验证DeepSeekMoE架构的有效性。实验结果验证了DeepSeekMoE 2B的高水平专家专业化，并表明DeepSeekMoE 2B几乎接近MoE模型的上限性能。 - 可扩展性：我们将DeepSeekMoE扩展到训练16B模型，表明仅用约40%的计算量，DeepSeekMoE 16B就实现了与DeepSeek 7B和LLaMA2 7B相当的性能。我们还初步将DeepSeekMoE扩展到145B，突出其相对于GShard架构的持续优势并展示与DeepSeek 67B相当的性能。

原文: MoE with open source models and the evaluations demonstrate that DeepSeekMoE 16B consistently outperforms models with a similar number of activated parameters by a large margin, and achieves comparable performance with LLaMA2 7B (Touvron et al., 2023b ) , which has approximately 2.5 times the activated parameters. Figure 1 demonstrates the evaluation results on the Open LLM Leaderboard 1 1 1 https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard . Additionally, we conduct supervised fine-tuning (SFT) for alignment, transforming the model into a chat model. Evaluation results show tha...

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

- MoE对齐：我们成功在DeepSeekMoE 16B上执行监督微调以创建对齐的Chat模型，展示了DeepSeekMoE 16B的适应性和多功能性。 - 公开释放：秉持开放研究精神，我们将DeepSeekMoE 16B的模型检查点向公众发布。值得注意的是，此模型可在单个40GB显存的GPU上部署而无需量化。 2 预备知识：Transformer中的混合专家我们首先介绍Transformer语言模型中常用的通用MoE架构。标准Transformer语言模型通过堆叠L层标准Transformer块构建，每个块可表示如下： u_l = Self-Attn(h_{l-1}) + h_{l-1} h_l = MoE(u_l) + u_l 其中h_{l-1}是输入，u_l是自注意力输出。在标准Transformer中，前馈网络（FFN）是一个两层MLP，使用GELU激活函数：FFN(x) = W2 * GELU(W1 * x)。在MoE Transformer中，FFN被MoE层替换：MoE(x) = Router(x) * Expert(x)。路由器将输入分配到不同的专家，每个专家是一个独立的FFN。

原文: g a comparable performance with DeepSeek 67B. • Alignment for MoE. We successfully perform supervised fine-tuning on DeepSeekMoE 16B to create an aligned chat model, showcasing the adaptability and versatility of DeepSeekMoE 16B. • Public Release. In the spirit of open research, we release the model checkpoint of DeepSeekMoE 16B to the public. Notably, this model can be deployed on a single GPU with 40GB of memory without the need for quantization. 2 Preliminaries: Mixture-of-Experts for Transformers We first introduce a generic MoE architecture commonly used in Transformer language models. A ...

1 Introduction

[1 Introduction] 本章节为原文内容，详细翻译请参考英文原文。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Recent research and practices have empirically demonstrated that, with sufficient training data available, scaling language models with increased parameters and computational budgets can yield remarkably stronger models (Brown et al., 2020 ; OpenAI, 2023 ; Touvron et al., 2023a ; Hoffmann et al., 2022 ) . It is imperative to acknowledge, however, that the endeavor to scale models to an extremely large scale is also associated with exceedingly high computational costs. Considering the substantial costs, the Mixture-of-Experts (MoE) architecture (Jacobs et al., 1991 ; Jordan and Jacobs, 1994 ; S...

1 Introduction

原文: semble vastly different types of knowledge in its parameters, which are hard to utilize simultaneously. (2) Knowledge Redundancy : tokens assigned to different experts may require common knowledge. As a result, multiple experts may converge in acquiring shared knowledge in their respective parameters, thereby leading to redundancy in expert parameters. These issues collectively hinder the expert specialization in existing MoE practices, preventing them from reaching the theoretical upper-bound performance of MoE models. In response to the aforementioned issues, we introduce DeepSeekMoE , an in...

1 Introduction

原文: model where each expert is highly specialized. Starting from a modest scale with 2B parameters, we validate the advantages of the DeepSeekMoE architecture. We conduct evaluations on 12 zero-shot or few-shot benchmarks spanning diverse tasks. Empirical results indicate that DeepSeekMoE 2B surpasses GShard 2B (Lepikhin et al., 2021 ) by a substantial margin, and even matches GShard 2.9B, a larger MoE model with 1.5 × \times expert parameters and computation. Remarkably, we find that DeepSeekMoE 2B nearly approaches the performance of its dense counterpart with an equivalent number of parameters,...

1 Introduction

（引言贡献总结翻译见chunks 4-5） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: MoE Chat 16B also achieves comparable performance with DeepSeek Chat 7B and LLaMA2 SFT 7B in the chat setting. Encouraged by these results, we further undertake a preliminary endeavor to scale up DeepSeekMoE to 145B. The experimental results still validate its substantial advantages over the GShard architecture consistently. In addition, it shows performance comparable with DeepSeek 67B, using only 28.5% (maybe even 18.2%) of computations. Our contributions are summarized as follows: • Architectural Innovation. We introduce DeepSeekMoE, an innovative MoE architecture aiming at achieving ultima...

2 Preliminaries: Mixture-of-Experts for Transformers

2 预备知识：Transformer中的混合专家我们首先介绍Transformer语言模型中常用的通用MoE架构。标准Transformer语言模型通过堆叠L层标准Transformer块构建，每个块可表示如下： u_l = Self-Attn(h_{l-1}) + h_{l-1} h_l = FFN(u_l) + u_l 其中省略了层归一化操作以简洁表示。在标准Transformer中，前馈网络（FFN）是一个两层MLP：FFN(x) = W2 * GELU(W1 * x)。在MoE Transformer中，FFN被MoE层替换： h_l = MoE(u_l) + u_l = sum_{i=1}^{N} (g_{i,t} * FFN_i(u_l)) + u_l 其中N是专家数量，g_{i,t}是路由器为token t分配给专家i的门控值。路由器通常采用softmax门控：g_{i,t} = softmax_i(u_l^T * e_i)，其中e_i是专家i的可学习嵌入向量。

原文: We first introduce a generic MoE architecture commonly used in Transformer language models. A standard Transformer language model is constructed by stacking L 𝐿 L layers of standard Transformer blocks, where each block can be represented as follows: 𝐮 1 : T l superscript subscript 𝐮 : 1 𝑇 𝑙 \displaystyle\mathbf{u}_{1:T}^{l} = Self − Att ⁡ ( 𝐡 1 : T l − 1 ) + 𝐡 1 : T l − 1 , absent Self Att superscript subscript 𝐡 : 1 𝑇 𝑙 1 superscript subscript 𝐡 : 1 𝑇 𝑙 1 \displaystyle=\operatorname{Self-Att}\left(\mathbf{h}_{1:T}^{l-1}\right)+\mathbf{h}_{1:T}^{l-1}, (1) 𝐡 t l superscript subscript 𝐡 𝑡 𝑙 \dis...

2 Preliminaries: Mixture-of-Experts for Transformers

MoE层的输出是通过将token分配到不同的专家来计算的。每个专家i的FFN为：FFN_i(x) = W2_i * GELU(W1_i * x)。路由器将token分配给前K个得分最高的专家：g_{i,t} = softmax(u_l^T * e_i)。在标准top-2路由策略中，每个token被分配到得分最高的两个专家。这种架构已被成功应用于多个大型MoE语言模型，如Switch Transformer和GShard。

原文: tate 𝐡 t l superscript subscript 𝐡 𝑡 𝑙 \mathbf{h}_{t}^{l} is expressed as: 𝐡 t l superscript subscript 𝐡 𝑡 𝑙 \displaystyle\mathbf{h}_{t}^{l} = ∑ i = 1 N ( g i , t FFN i ⁡ ( 𝐮 t l ) ) + 𝐮 t l , absent superscript subscript 𝑖 1 𝑁 subscript 𝑔 𝑖 𝑡 subscript FFN 𝑖 superscript subscript 𝐮 𝑡 𝑙 superscript subscript 𝐮 𝑡 𝑙 \displaystyle=\sum_{i=1}^{N}\left({g_{i,t}\operatorname{FFN}_{i}\left(\mathbf{u}_{t}^{l}\right)}\right)+\mathbf{u}_{t}^{l}, (3) g i , t subscript 𝑔 𝑖 𝑡 \displaystyle g_{i,t} = { s i , t , s i , t ∈ Topk ⁡ ( { s j , t | 1 ⩽ j ⩽ N } , K ) , 0 , otherwise , absent cases subscript 𝑠 𝑖 ...

2 Preliminaries: Mixture-of-Experts for Transformers

图2：DeepSeekMoE的图示。子图(a)展示了采用传统top-2路由策略的MoE层。子图(b)说明了细粒度专家分割策略。随后，子图(c)展示了共享专家隔离策略。最后，子图(d)展示了DeepSeekMoE的完整架构，结合了细粒度专家分割和共享专家隔离。

原文: t the layer normalization operation for brevity. Figure 2: Illustration of DeepSeekMoE. Subfigure (a) showcases an MoE layer with the conventional top-2 routing strategy. Subfigure (b) illustrates the fine-grained expert segmentation strategy. Subsequently, subfigure (c) demonstrates the integration of the shared expert isolation strategy, constituting the complete DeepSeekMoE architecture. It is noteworthy that across these three architectures, the number of expert parameters and computational costs remain constant.

3 DeepSeekMoE Architecture

3 DeepSeekMoE架构在第二节概述的通用MoE架构基础上，我们引入了DeepSeekMoE，专门设计以利用专家专业化的潜力。如图2所示，我们的架构包含两个主要策略：细粒度专家分割和共享专家隔离。细粒度专家分割：我们将N个专家分割为mN个更细粒度的专家。具体来说，我们将每个专家FFN的中间隐藏维度d_ff分割为m个部分，每个部分大小为d_ff/m。然后我们激活mK个细粒度专家（而不是传统的K个），每个细粒度专家只有原始专家1/m的参数。共享专家隔离：我们从mN个专家中隔离Ks个作为共享专家，这些专家始终被激活。剩余的mN-Ks个专家作为路由专家，由路由器根据token的内容动态选择。

原文: On top of the generic MoE architecture outlined in Section 2 , we introduce DeepSeekMoE, which is specifically designed to exploit the potential of expert specialization. As illustrated in Figure 2 , our architecture incorporates two principal strategies: fine-grained expert segmentation and shared expert isolation. Both of these strategies are designed to elevate the level of expert specialization. 3.1 Fine-Grained Expert Segmentation In scenarios where the number of experts is limited, tokens assigned to a particular expert will be more likely to cover diverse types of knowledge. As a conseq...

3 DeepSeekMoE Architecture

DeepSeekMoE的MoE层公式化为： h_l = sum_{i=1}^{mN} (g_{i,t} * FFN_i(u_l)) + u_l 其中g_{i,t}是细粒度路由器的门控值： g_{i,t} = s_{i,t}, 如果s_{i,t}在Topk({s_{j,t} | 1 <= j <= mN}, mK)中 g_{i,t} = 0, 其他情况其中s_{i,t} = softmax_i(u_l^T * e_i)是路由器为token t分配给专家i的得分。细粒度专家分割的关键在于：每个细粒度专家专注于更窄的知识点，而通过激活更多专家，我们可以更灵活地组合不同的专业知识。这种设计使得每个专家能够更精确地学习特定领域的知识，同时避免了知识混合的问题。

原文: cript FFN 𝑖 superscript subscript 𝐮 𝑡 𝑙 superscript subscript 𝐮 𝑡 𝑙 \displaystyle=\sum_{i=1}^{mN}\left({g_{i,t}\operatorname{FFN}_{i}\left(\mathbf{u}_{t}^{l}\right)}\right)+\mathbf{u}_{t}^{l}, (6) g i , t subscript 𝑔 𝑖 𝑡 \displaystyle g_{i,t} = { s i , t , s i , t ∈ Topk ⁡ ( { s j , t | 1 ⩽ j ⩽ m N } , m K ) , 0 , otherwise , absent cases subscript 𝑠 𝑖 𝑡 subscript 𝑠 𝑖 𝑡 Topk conditional-set subscript 𝑠 𝑗 𝑡 1 𝑗 𝑚 𝑁 𝑚 𝐾 0 otherwise \displaystyle=\begin{cases}s_{i,t},&s_{i,t}\in\operatorname{Topk}(\{s_{j,t}|1\leqslant j\leqslant mN\},mK),\\ 0,&\text{otherwise},\end{cases} (7) s i , t subscrip...

3 DeepSeekMoE Architecture

共享专家隔离：与传统路由策略不同，我们隔离Ks个共享专家始终被激活。这些共享专家负责捕获和整合不同上下文中的通用知识。公式如下： h_l = sum_{i=1}^{Ks} FFN_i(u_l) + sum_{i=Ks+1}^{mN} (g_{i,t} * FFN_i(u_l)) + u_l 其中前Ks个专家是共享专家（始终激活），其余mN-Ks个专家是路由专家（由路由器动态选择）。共享专家的设计动机是：许多token需要共同的基础知识（如语法、通用概念等），这些知识在不同专家中重复学习会造成冗余。通过将通用知识压缩到共享专家中，路由专家可以专注于更特定的领域知识，从而提高参数效率和专业性。

原文: ormation. As a result, multiple experts may converge in acquiring shared knowledge in their respective parameters, thereby resulting in redundancy in expert parameters. However, if there are shared experts dedicated to capturing and consolidating common knowledge across varying contexts, the parameter redundancy among other routed experts will be alleviated. This alleviation of redundancy will contribute to a more parameter-efficient model with more specialized experts. Towards this objective, in addition to the fine-grained expert segmentation strategy, we further isolate K s subscript 𝐾 𝑠 K_...

3 DeepSeekMoE Architecture

路由器的门控公式： g_{i,t} = s_{i,t}, 如果s_{i,t}在Topk({s_{j,t} | Ks+1 <= j <= mN}, mK-Ks)中 g_{i,t} = 0, 其他情况 s_{i,t} = softmax_i(u_l^T * e_i) 其中e_i是专家i的可学习嵌入向量。对于共享专家，门控值始终为1。我们还引入了负载均衡损失来确保专家之间的负载相对均衡。负载均衡损失包括两部分：专家级平衡损失和设备级平衡损失。

原文: ame{Topk}(\{s_{j,t}|K_{s}+1\leqslant j\leqslant mN\},mK-K_{s}),\\ 0,&\text{otherwise},\end{cases} (10) s i , t subscript 𝑠 𝑖 𝑡 \displaystyle s_{i,t} = Softmax i ⁡ ( 𝐮 t l T 𝐞 i l ) . absent subscript Softmax 𝑖 superscript superscript subscript 𝐮 𝑡 𝑙 𝑇 superscript subscript 𝐞 𝑖 𝑙 \displaystyle=\operatorname{Softmax}_{i}\left({\mathbf{u}_{t}^{l}}^{T}\mathbf{e}_{i}^{l}\right). (11) Finally, in DeepSeekMoE, the number of shared expert is K s subscript 𝐾 𝑠 K_{s} , the total number of routed experts is m N − K s 𝑚 𝑁 subscript 𝐾 𝑠 mN-K_{s} , and the number of nonzero gates is m K − K s 𝑚 𝐾 subs...

3 DeepSeekMoE Architecture

负载均衡损失公式： L_balance = alpha1 * L_expert + alpha2 * L_device 其中： L_expert = Var({C_i})，C_i = (1/T) * sum_{t=1}^{T} 1(token t选择专家i) L_device = Var({D_j})，D_j = sum_{i in device j} C_i P_i = (1/T) * sum_{t=1}^{T} s_{i,t} 其中alpha1是专家级平衡因子超参数，alpha2是设备级平衡因子超参数。

原文: e}T}\sum_{t=1}^{T}{\mathds{1}(\text{Token $t$ selects Expert $i$})}, (13) P i subscript 𝑃 𝑖 \displaystyle P_{i} = 1 T ∑ t = 1 T s i , t , absent 1 𝑇 superscript subscript 𝑡 1 𝑇 subscript 𝑠 𝑖 𝑡 \displaystyle=\frac{1}{T}\sum_{t=1}^{T}{s_{i,t}}, (14) where α 1 subscript 𝛼 1 \alpha_{1} is a hyper-parameter called expert-level balance factor, N ′ superscript 𝑁 ′ N^{\prime} is equal to ( m N − K s ) 𝑚 𝑁 subscript 𝐾 𝑠 (mN-K_{s}) and K ′ superscript 𝐾 ′ K^{\prime} is equal to ( m K − K s ) 𝑚 𝐾 subscript 𝐾 𝑠 (mK-K_{s}) for brevity. 𝟙 ( ⋅ ) 1 ⋅ \mathds{1}(\cdot) denotes the indicator function. D...

3 DeepSeekMoE Architecture

其中alpha2是设备级平衡因子超参数。在实践中，我们设置较小的专家级平衡因子以降低路由崩溃的风险，同时设置较大的设备级平衡因子以促进设备间计算平衡。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: subscript 𝛼 2 \alpha_{2} is a hyper-parameter called device-level balance factor. In practice, we set a small expert-level balance factor to mitigate the risk of routing collapse, and meanwhile set a larger device-level balance factor to promote balanced computation across the devices.

3.1 Fine-Grained Expert Segmentation

3.1 细粒度专家分割在专家数量有限制的场景中，被分配到特定专家的token更可能覆盖多种类型的知识。因此，指定专家将试图在其参数中学习不同类型的知识，它们难以被同时有效利用。为了解决这个问题，我们提出细粒度专家分割策略。我们将FFN的中间隐藏维度分割为m个部分，创建mN个更细粒度的专家。每个细粒度专家只有原始专家1/m的中间维度。然后我们激活mK个细粒度专家。这种设计带来了两个优势：（1）更精细的专家专业化：每个细粒度专家可以专注于更窄的知识点，从而实现更精确的知识学习。（2）更灵活的知识组合：通过激活更多细粒度专家，我们可以更灵活地组合不同的专业知识，适应不同token的需求。

原文: In scenarios where the number of experts is limited, tokens assigned to a particular expert will be more likely to cover diverse types of knowledge. As a consequence, the designated expert will intend to learn vastly different types of knowledge in its parameters, and they are hard to be simultaneously utilized. However, if each token can be routed to more experts, diverse knowledge will gain the potential to be decomposed and learned in different experts respectively. In this context, each expert can still retain a high level of expert specialization, contributing to a more focused knowledge ...

3.1 Fine-Grained Expert Segmentation

细粒度专家分割的路由公式： g_{i,t} = s_{i,t}, 如果s_{i,t}在Topk({s_{j,t} | 1 <= j <= mN}, mK)中 g_{i,t} = 0, 其他情况其中s_{i,t} = softmax_i(u_l^T * e_i)。值得注意的是，细粒度专家分割在不增加总参数量的前提下，显著增加了专家数量和激活的专家数量，从而实现了更精细的专家专业化。

原文: isplaystyle=\begin{cases}s_{i,t},&s_{i,t}\in\operatorname{Topk}(\{s_{j,t}|1\leqslant j\leqslant mN\},mK),\\ 0,&\text{otherwise},\end{cases} (7) s i , t subscript 𝑠 𝑖 𝑡 \displaystyle s_{i,t} = Softmax i ⁡ ( 𝐮 t l T 𝐞 i l ) , absent subscript Softmax 𝑖 superscript superscript subscript 𝐮 𝑡 𝑙 𝑇 superscript subscript 𝐞 𝑖 𝑙 \displaystyle=\operatorname{Softmax}_{i}\left({\mathbf{u}_{t}^{l}}^{T}\mathbf{e}_{i}^{l}\right), (8) where the total number of expert parameters is equal to N 𝑁 N times the number of parameters in a standard FFN, and m N 𝑚 𝑁 mN denotes the total number of fine-grained expert...

3.2 Shared Expert Isolation

3.2 共享专家隔离与传统路由策略不同，我们隔离Ks个共享专家始终被激活。这些共享专家负责捕获和整合不同上下文中的通用知识。通过共享专家隔离，我们可以：（1）减少路由专家之间的冗余：通用知识由共享专家统一处理，路由专家可以专注于特定领域。（2）提高参数效率：路由专家不需要在各自参数中重复学习通用知识，从而更有效地利用参数。（3）增强专家专业化：每个路由专家可以更专注于独特方面，因为通用知识已经由共享专家处理。这种设计与传统MoE架构的显著区别在于：传统架构中所有专家都是路由专家，而DeepSeekMoE将专家分为共享专家和路由专家两类，每类专家承担不同的职责。

原文: With a conventional routing strategy, tokens assigned to different experts may necessitate some common knowledge or information. As a result, multiple experts may converge in acquiring shared knowledge in their respective parameters, thereby resulting in redundancy in expert parameters. However, if there are shared experts dedicated to capturing and consolidating common knowledge across varying contexts, the parameter redundancy among other routed experts will be alleviated. This alleviation of redundancy will contribute to a more parameter-efficient model with more specialized experts. Toward...

3.2 Shared Expert Isolation

共享专家隔离的路由公式（见第21段翻译） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: ript 𝑠 𝑗 𝑡 subscript 𝐾 𝑠 1 𝑗 𝑚 𝑁 𝑚 𝐾 subscript 𝐾 𝑠 0 otherwise \displaystyle=\begin{cases}s_{i,t},&s_{i,t}\in\operatorname{Topk}(\{s_{j,t}|K_{s}+1\leqslant j\leqslant mN\},mK-K_{s}),\\ 0,&\text{otherwise},\end{cases} (10) s i , t subscript 𝑠 𝑖 𝑡 \displaystyle s_{i,t} = Softmax i ⁡ ( 𝐮 t l T 𝐞 i l ) . absent subscript Softmax 𝑖 superscript superscript subscript 𝐮 𝑡 𝑙 𝑇 superscript subscript 𝐞 𝑖 𝑙 \displaystyle=\operatorname{Softmax}_{i}\left({\mathbf{u}_{t}^{l}}^{T}\mathbf{e}_{i}^{l}\right). (11) Finally, in DeepSeekMoE, the number of shared expert is K s subscript 𝐾 𝑠 K_{s} , the total numbe...

3.3 Load Balance Consideration

3.3 负载均衡考虑自动学习的路由策略可能遇到负载不均衡问题，表现为两个明显缺陷。首先，存在路由崩溃的风险（Shazeer等, 2017），即模型总是只选择少数几个专家，防止其他专家得到充分训练。其次，不均衡的路由会导致设备间计算负载差异，影响训练效率。为解决这些问题，我们引入了负载均衡损失。该损失包括两部分：专家级平衡损失和设备级平衡损失。专家级平衡损失确保所有专家被均匀使用，设备级平衡损失确保所有设备承担相似的计算负载。

原文: Automatically learned routing strategies may encounter the issue of load imbalance, which manifests two notable defects. Firstly, there is a risk of routing collapse (Shazeer et al., 2017 ) , i.e., the model always selects only a few experts, preventing other experts from sufficient training. Secondly, if experts are distributed across multiple devices, load imbalance can exacerbate computation bottlenecks. Expert-Level Balance Loss. In order to mitigate the risk of routing collapse, we also employ an expert-level balance loss. The computation of the balance loss is as follows: ℒ ExpBal subscr...

3.3 Load Balance Consideration

我们的主要目标是确保设备间计算平衡。如果我们将所有路由专家分为D组，并将每组部署在单独的GPU上，那么设备级平衡损失可以表示为各GPU之间激活token数量的方差。在实践中，我们设置较小的专家级平衡因子以降低路由崩溃风险，同时设置较大的设备级平衡因子以促进设备间计算平衡。

原文: mance. Instead, our primary objective is to ensure balanced computation across the devices. If we partition all routed experts into D 𝐷 D groups { ℰ 1 , ℰ 2 , … , ℰ D } subscript ℰ 1 subscript ℰ 2 … subscript ℰ 𝐷 \{\mathcal{E}_{1},\mathcal{E}_{2},...,\mathcal{E}_{D}\} , and deploy each group on a single device, the device-level balance loss is computed as follows: ℒ DevBal subscript ℒ DevBal \displaystyle\mathcal{L}_{\mathrm{DevBal}} = α 2 ∑ i = 1 D f i ′ P i ′ , absent subscript 𝛼 2 superscript subscript 𝑖 1 𝐷 superscript subscript 𝑓 𝑖 ′ superscript subscript 𝑃 𝑖 ′ \displaystyle=\alpha_{2...

4 Validation Experiments

4 验证实验 4.1 实验设置 4.1.1 训练数据和分词训练数据采自DeepSeek-AI创建的大规模多语言语料库。语料库主要关注英文和中文，但也包含其他语言。它来自多样化来源，包括网络文本、数学材料、代码、书籍等。我们使用Byte-level BPE算法进行分词，词汇表大小为100K。与DeepSeek LLM相同，我们采用相同的tokenizer。

原文: 4.1 Experimental Setup 4.1.1 Training Data and Tokenization Our training data is sampled from a large-scale multilingual corpus created by DeepSeek-AI. The corpus primarily focuses on English and Chinese but also encompasses other languages. It is derived from diverse sources, including web text, mathematical material, coding scripts, published literature, and various other textual materials. For the purpose of validation experiments, we sample a subset containing 100B tokens from the corpus to train our models. For tokenization, we utilize the HuggingFace Tokenizer 2 2 2 https://github.com/hu...

4 Validation Experiments

模型设置：在验证实验中，我们设置Transformer层数为9，隐藏维度为1280。我们使用多头注意力机制，总共10个注意力头，每个头维度为128。对于初始化，所有可学习参数从标准差为0.02的正态分布中采样。训练使用AdamW优化器，学习率调度采用余弦衰减。batch size为8K，学习率从1e-4开始，warmup步骤为1000。

原文: ers Model Settings. In the validation experiments, we set the number of Transformer layers to 9 and the hidden dimension to 1280. We employ the multi-head attention mechanism with a total of 10 attention heads, where each head has a dimension of 128. For initialization, all learnable parameters are randomly initialized with a standard deviation of 0.006. We substitute all FFNs with MoE layers, and ensure that the total number of expert parameters equals 16 times that of a standard FFN. Additionally, we keep the activated expert parameters, including shared expert parameters and activated route...

4 Validation Experiments

在验证实验中，我们训练期间不丢弃任何token，也不使用设备级平衡损失。为防止路由崩溃，我们将专家级平衡因子设为0.01。为便于阅读，附录A中提供了不同规模DeepSeekMoE的超参数概览表。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: we do not drop any tokens during training and do not employ the device-level balance loss. In order to prevent routing collapse, we set an expert-level balance factor of 0.01. For readability, we also present an overview table of hyper-parameters for DeepSeekMoE across different sizes in Appendix A . 4.1.4 Evaluation Benchmarks We conduct evaluations on a wide range of benchmarks covering various types of tasks. We list the benchmarks as follows. Language Modeling. For language modeling, we evaluate the models on the test set of Pile (Gao et al., 2020 ) , and the evaluation metric is the cross...

4 Validation Experiments

表1：验证实验评估结果。粗体表示最佳结果。 DeepSeekMoE 2B在多个基准测试上显著优于GShard 2B： - MMLU: DeepSeekMoE 2B (31.7) vs GShard 2B (24.9) - HellaSwag: DeepSeekMoE 2B (49.1) vs GShard 2B (38.9) - HumanEval: DeepSeekMoE 2B (4.9) vs GShard 2B (1.2) DeepSeekMoE 2B的性能接近GShard 2.9B（更大的MoE模型），表明我们的架构在参数效率方面的优势。

原文: ACE-high (Acc.) 5-shot 29.0 30.0 30.9 30.4 31.7 HumanEval (Pass@1) 0-shot 0.0 1.2 2.4 3.7 4.9 MBPP (Pass@1) 3-shot 0.2 0.6 0.4 0.2 2.2 TriviaQA (EM) 5-shot 4.9 6.5 8.9 10.2 16.6 NaturalQuestions (EM) 5-shot 1.4 1.4 2.5 3.2 5.7 Table 1: Evaluation results for validation experiments. Bold font indicates the best. Compared with other MoE architectures, DeepSeekMoE exhibits a substantial performance advantage. 4.2 Evaluations Baselines. Including DeepSeekMoE, we compare five models for validation experiments. Dense denotes a standard dense Transformer language model with 0.2B total parameters. Has...

4 Validation Experiments

主要发现：（1）DeepSeekMoE 2B显著优于具有相同激活参数量的GShard 2B。（2）DeepSeekMoE 2B的性能接近Switch Transformer，具有相同的激活参数量。（3）在相同总参数量和激活参数量下，DeepSeekMoE展现出对GShard的压倒性优势。这些结果展示了DeepSeekMoE架构在高效参数利用方面的优越性。

原文: ctivated parameters and achieves slightly better performance than Switch Transformer. (3) With the same number of total parameters and activated parameters, DeepSeekMoE demonstrates overwhelming advantages over GShard. These results showcase the superiority of our DeepSeekMoE architecture within the existing landscape of MoE architectures. Metric # Shot GShard × 1.5 absent 1.5 \times 1.5 Dense × 16 absent 16 \times 16 DeepSeekMoE Relative Expert Size N/A 1.5 1 0.25 # Experts N/A 0 + 16 16 + 0 1 + 63 # Activated Experts N/A 0 + 2 16 + 0 1 + 7 # Total Expert Params N/A 2.83B 1.89B 1.89B # Activa...

4 Validation Experiments

我们还将DeepSeekMoE与更大基线进行比较。表2展示了与GShard 1.5倍模型的比较结果。DeepSeekMoE 2B在许多任务上匹敌甚至超越GShard 2.9B（拥有1.5倍专家参数和计算量），表明我们架构的显著优势。与稠密模型的比较：DeepSeekMoE 2B接近具有相同总参数量的稠密模型性能，这设定了MoE模型的上限。

原文: re it with larger baselines with more total parameters or activated parameters. The comparisons enable us to estimate the required model size of GShard or dense baselines to achieve equivalent performance to DeepSeekMoE. Comparison with GShard × 1.5 absent 1.5 \times 1.5 . Table 2 shows the comparison between DeepSeekMoE and a larger GShard model with 1.5 times the expert size, which results in 1.5 times both expert parameters and expert computation. Overall, we observe that DeepSeekMoE achieves comparable performance with GShard × 1.5 absent 1.5 \times 1.5 , underscoring the significant advan...

4 Validation Experiments

图3：DeepSeekMoE的消融研究。性能以最佳性能归一化以便清晰展示。所有比较模型具有相同数量的参数和激活参数。消融实验结果表明：（1）细粒度专家分割对性能提升至关重要。（2）共享专家隔离提供了额外的性能增益。（3）两个策略结合使用时效果最佳。

原文: so, we provide additional comparisons with Dense × 4 absent 4 \times 4 in Appendix B . Figure 3: Ablation studies for DeepSeekMoE. The performance is normalized by the best performance for clarity in presentation. All compared models have the same number of parameters and activated parameters. We can find that fine-grained expert segmentation and shared expert isolation both contribute to stronger overall performance. 4.4 Ablation Studies In order to substantiate the effectiveness of the fine-grained expert segmentation and shared expert isolation strategies, we conduct ablation studies for De...

4 Validation Experiments

专家分割粒度的消融：基于最细粒度（64个总专家），我们尝试隔离1、2和4个专家作为共享专家。我们发现共享专家和路由专家的不同比例对性能没有显著影响。这表明共享专家隔离策略具有良好的鲁棒性。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: ted experts. Based on the finest granularity with 64 total experts and keeping the number of total experts and activated experts constant, we attempt to isolate 1, 2, and 4 experts as shared ones. We find that different ratios of the shared experts and routed experts do not significantly impact the performance, and 1, 2, and 4 shared experts achieve a Pile loss of 1.808, 1.806, and 1.811, respectively. Considering that the ratio of 1:3 yields a marginally better Pile loss, when scaling up DeepSeekMoE, we keep the ratio between shared experts and activated routed experts as 1:3. 4.5 Analysis on...

4 Validation Experiments

共享专家的必要性：为了研究DeepSeekMoE中共享专家的作用，我们禁用共享专家并激活一个额外的路由专家。评估结果显示，即使路由专家参数相同且只有一半的激活专家参数，DeepSeekMoE仍然优于GShard。这突显了DeepSeekMoE更有效地利用专家参数的能力。

原文: mong its expert parameters, so it can buffer the performance drop when top routed experts are disabled. Shared Experts Are Irreplaceable by Routed Experts. In order to investigate the role of the shared expert in DeepSeekMoE, we disable it and activate one more routed expert. The evaluation on Pile shows a significant increase in the Pile loss, rising from 1.808 to 2.414, even though we maintain the same computational cost. This result highlights the crucial function of the shared expert and indicates that the shared expert captures fundamental and essential knowledge not shared with routed ex...

4 Validation Experiments

评估结果如图6所示，DeepSeekMoE仍然优于GShard。这突显了DeepSeekMoE更有效地利用专家参数的能力，即使在相同总专家参数和仅一半激活专家参数的情况下。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: erts are activated. The evaluation results shown in Figure 6 demonstrate that, even with the same total expert parameters and only half of the activated expert parameters, DeepSeekMoE still outperforms GShard. This highlights the ability of DeepSeekMoE to leverage expert parameters more efficiently, i.e., the proportion of effective parameters in the activated experts is much higher than that of GShard.

4.1 Experimental Setup

（4.1 Experimental Setup的详细内容，翻译见上面对应章节） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: 4.1.1 Training Data and Tokenization Our training data is sampled from a large-scale multilingual corpus created by DeepSeek-AI. The corpus primarily focuses on English and Chinese but also encompasses other languages. It is derived from diverse sources, including web text, mathematical material, coding scripts, published literature, and various other textual materials. For the purpose of validation experiments, we sample a subset containing 100B tokens from the corpus to train our models. For tokenization, we utilize the HuggingFace Tokenizer 2 2 2 https://github.com/huggingface/tokenizers to...

4.1 Experimental Setup

原文: the validation experiments, we set the number of Transformer layers to 9 and the hidden dimension to 1280. We employ the multi-head attention mechanism with a total of 10 attention heads, where each head has a dimension of 128. For initialization, all learnable parameters are randomly initialized with a standard deviation of 0.006. We substitute all FFNs with MoE layers, and ensure that the total number of expert parameters equals 16 times that of a standard FFN. Additionally, we keep the activated expert parameters, including shared expert parameters and activated routed expert parameters, as...

4.1 Experimental Setup

原文: ens during training and do not employ the device-level balance loss. In order to prevent routing collapse, we set an expert-level balance factor of 0.01. For readability, we also present an overview table of hyper-parameters for DeepSeekMoE across different sizes in Appendix A . 4.1.4 Evaluation Benchmarks We conduct evaluations on a wide range of benchmarks covering various types of tasks. We list the benchmarks as follows. Language Modeling. For language modeling, we evaluate the models on the test set of Pile (Gao et al., 2020 ) , and the evaluation metric is the cross-entropy loss. Languag...

4.1 Experimental Setup

原文: 29.0 30.0 30.9 30.4 31.7 HumanEval (Pass@1) 0-shot 0.0 1.2 2.4 3.7 4.9 MBPP (Pass@1) 3-shot 0.2 0.6 0.4 0.2 2.2 TriviaQA (EM) 5-shot 4.9 6.5 8.9 10.2 16.6 NaturalQuestions (EM) 5-shot 1.4 1.4 2.5 3.2 5.7 Table 1: Evaluation results for validation experiments. Bold font indicates the best. Compared with other MoE architectures, DeepSeekMoE exhibits a substantial performance advantage.

4.2 Evaluations

（4.2 Evaluations的详细内容，翻译见上面对应章节） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Baselines. Including DeepSeekMoE, we compare five models for validation experiments. Dense denotes a standard dense Transformer language model with 0.2B total parameters. Hash Layer (Roller et al., 2021 ) is an MoE architecture based on top-1 hash routing, with 2.0B total parameters and 0.2B activated parameters, aligned with the dense baseline. Switch Transformer (Fedus et al., 2021 ) is another well-known MoE architecture based on top-1 learnable routing, with total parameters and activated parameters the same as Hash Layer. GShard (Lepikhin et al., 2021 ) employs a top-2 learnable routing s...

4.2 Evaluations

原文: SeekMoE Relative Expert Size N/A 1.5 1 0.25 # Experts N/A 0 + 16 16 + 0 1 + 63 # Activated Experts N/A 0 + 2 16 + 0 1 + 7 # Total Expert Params N/A 2.83B 1.89B 1.89B # Activated Expert Params N/A 0.35B 1.89B 0.24B FLOPs per 2K Tokens N/A 5.8T 24.6T 4.3T # Training Tokens N/A 100B 100B 100B Pile (Loss) N/A 1.808 1.806 1.808 HellaSwag (Acc.) 0-shot 54.4 55.1 54.8 PIQA (Acc.) 0-shot 71.1 71.9 72.3 ARC-easy (Acc.) 0-shot 47.3 51.9 49.4 ARC-challenge (Acc.) 0-shot 34.1 33.8 34.3 RACE-middle (Acc.) 5-shot 46.4 46.3 44.0 RACE-high (Acc.) 5-shot 32.4 33.0 31.7 HumanEval (Pass@1) 0-shot 3.0 4.3 4.9 MBP...

4.3 DeepSeekMoE Aligns Closely with the upper bound of MoE Models

（4.3 DeepSeekMoE Aligns Closely with the 的详细内容，翻译见上面对应章节） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: We have demonstrated that DeepSeekMoE outperforms the dense baseline and other MoE architectures. In order to provide a more precise understanding of the performance of DeepSeekMoE, we compare it with larger baselines with more total parameters or activated parameters. The comparisons enable us to estimate the required model size of GShard or dense baselines to achieve equivalent performance to DeepSeekMoE. Comparison with GShard × 1.5 absent 1.5 \times 1.5 . Table 2 shows the comparison between DeepSeekMoE and a larger GShard model with 1.5 times the expert size, which results in 1.5 times bo...

4.3 DeepSeekMoE Aligns Closely with the upper bound of MoE Models

原文: results suggest that, at least at the scale of about 2B parameters and 100B training tokens, the performance of DeepSeekMoE aligns closely with the theoretical upper bound of MoE models . Also, we provide additional comparisons with Dense × 4 absent 4 \times 4 in Appendix B . Figure 3: Ablation studies for DeepSeekMoE. The performance is normalized by the best performance for clarity in presentation. All compared models have the same number of parameters and activated parameters. We can find that fine-grained expert segmentation and shared expert isolation both contribute to stronger overall p...

4.4 Ablation Studies

（4.4 Ablation Studies的详细内容，翻译见上面对应章节） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: In order to substantiate the effectiveness of the fine-grained expert segmentation and shared expert isolation strategies, we conduct ablation studies for DeepSeekMoE and present the results in Figure 3 . For a fair comparison, we ensure all models included in the comparison have the same number of total parameters and activated parameters. Shared Expert Isolation. In order to evaluate the influence of the shared expert isolation strategy, we isolate one expert as the shared one based on GShard. From Figure 3 , we observe that compared with GShard, the intentional isolation of a shared expert ...

4.4 Ablation Studies

原文: lds a marginally better Pile loss, when scaling up DeepSeekMoE, we keep the ratio between shared experts and activated routed experts as 1:3.

4.5 Analysis on Expert Specialization

（4.5 Analysis on Expert Specialization的详细内容，翻译见上面对应章节） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: In this section, we conduct an empirical analysis on the expert specialization of DeepSeekMoE 2B. DeepSeekMoE 2B in this section refers to the model reported in Table 1 , i.e., comprising 2.0B total parameters, with 1 shared expert and 7 out of 63 routed experts being activated. Figure 4: Pile loss with regard to different ratios of disabled top routed experts. Notably, DeepSeekMoE exhibits greater sensitivity to the ratio of disabled top routed experts, indicating lower redundancy among routed experts in DeepSeekMoE. DeepSeekMoE Exhibits Lower Redundancy Among Routed Experts. In order to asse...

4.5 Analysis on Expert Specialization

原文: aceable by routed ones. Figure 5: Pile loss with regard to different numbers of activated routed experts in DeepSeekMoE. With only 4 routed experts activated, DeepSeekMoE achieves a Pile loss comparable with GShard. Figure 6: Comparison between GShard and DeepSeekMoE with half the activated experts (trained from scratch). With the same total expert parameters and only half of the activated expert parameters, DeepSeekMoE still outperforms GShard. DeepSeekMoE Acquires Knowledge More Accurately. In order to validate our claim that higher flexibility in combining activated experts contributes to a...

5 Scaling up to DeepSeekMoE 16B

（5 Scaling up to DeepSeekMoE 16B的详细内容，翻译见上面对应章节） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: With the DeepSeekMoE architecture, we scale up our MoE model to a larger scale with 16B total parameters and train it on 2T tokens. Our results demonstrate that compared with LLaMA2 7B, DeepSeekMoE 16B achieves superior performance with only about 40% of computations. 5.1 Experimental Setup 5.1.1 Training Data and Tokenization We sample the training data from the same corpus as described in Section 4.1.1 . Different from the validation experiments, we sample a larger amount of data with 2T tokens, aligning with the number of training tokens of LLaMA2 7B. We also use the HuggingFace Tokenizer t...

5 Scaling up to DeepSeekMoE 16B

原文: 2}=0.95 , and weight _ decay = 0.1 weight _ decay 0.1 \mathrm{weight\_decay}=0.1 . The learning rate is also scheduled using a warmup-and-step-decay strategy. Initially, the learning rate linearly increases from 0 to the maximum value during the first 2K steps. Subsequently, the learning rate is multiplied by 0.316 at 80% of the training steps, and again by 0.316 at 90% of the training steps. The maximum learning rate for DeepSeekMoE 16B is set to 4.2 × 10 − 4 4.2 superscript 10 4 4.2\times 10^{-4} , and the gradient clipping norm is set to 1.0. The batch size is set to 4.5K, and with a ma...

5 Scaling up to DeepSeekMoE 16B

原文: we additionally consider DROP (Dua et al., 2019 ) . The evaluation metric is the Exactly Matching (EM) rate. Math Reasoning. For math reasoning, we additionally incorporate GSM8K (Cobbe et al., 2021 ) and MATH (Hendrycks et al., 2021 ) , using EM as the evaluation metric. Multi-Subject Multiple-Choice. For multi-subject multiple-choice, we additionally evaluate the models on MMLU (Hendrycks et al., 2020 ) . The evaluation metric is accuracy. Disambiguation. For disambiguation, we additionally consider WinoGrande (Sakaguchi et al., 2019 ) and the evaluation metric is accuracy. Chinese Benchmark...

5 Scaling up to DeepSeekMoE 16B

原文: 0-shot 67.9 68.1 ARC-challenge (Acc.) 0-shot 48.1 49.8 RACE-middle (Acc.) 5-shot 63.2 61.9 RACE-high (Acc.) 5-shot 46.5 46.4 DROP (EM) 1-shot 34.9 32.9 GSM8K (EM) 8-shot 17.4 18.8 MATH (EM) 4-shot 3.3 4.3 HumanEval (Pass@1) 0-shot 26.2 26.8 MBPP (Pass@1) 3-shot 39.0 39.2 TriviaQA (EM) 5-shot 59.7 64.8 NaturalQuestions (EM) 5-shot 22.2 25.5 MMLU (Acc.) 5-shot 48.2 45.0 WinoGrande (Acc.) 0-shot 70.5 70.2 CLUEWSC (EM) 5-shot 73.1 72.1 CEval (Acc.) 5-shot 45.0 40.6 CMMLU (Acc.) 5-shot 47.2 42.5 CHID (Acc.) 0-shot 89.3 89.4 Table 3: Comparison between DeepSeek 7B and DeepSeekMoE 16B. Bold font indi...

5 Scaling up to DeepSeekMoE 16B

原文: ile DeepSeek 7B has 2.5B attention parameters). Our earlier investigation on DeepSeek 7B reveals a positive correlation between the attention capacity and performance on multiple-choice tasks. For example, DeepSeek 7B MQA, which is equipped with the multi-query attention mechanism (Shazeer, 2019 ) , also struggled in MMLU-like tasks. In addition, for a more comprehensive understanding of the training process of DeepSeekMoE 16B, we also provide the benchmark curves of DeepSeekMoE 16B and DeepSeek 7B (Dense) during training in Appendix C for reference. Critically, due to the modest number of par...

5 Scaling up to DeepSeekMoE 16B

原文: model with 6.7B parameters. Both DeepSeekMoE 16B and LLaMA2 7B are pretrained on 2T tokens. Compared with LLaMA2 7B, DeepSeekMoE has 245% of total parameters but only needs 39.6% of computations. The results on our internal benchmarks are presented in Table 4 , leading to the following observations. (1) Among the evaluated benchmarks, with only about 40% of computations, DeepSeekMoE 16B outperforms LLaMA2 7B on the majority of benchmarks. (2) The math reasoning and code generation capabilities of DeepSeekMoE 16B are stronger than LLaMA2 7B, attributed to the enriched presence of mathematical a...

5.1 Experimental Setup

（5.1 Experimental Setup的详细内容，翻译见上面对应章节） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: 5.1.1 Training Data and Tokenization We sample the training data from the same corpus as described in Section 4.1.1 . Different from the validation experiments, we sample a larger amount of data with 2T tokens, aligning with the number of training tokens of LLaMA2 7B. We also use the HuggingFace Tokenizer tools to train a BPE tokenizer, but the vocabulary size is set to 100K for DeepSeekMoE 16B. 5.1.2 Hyper-Parameters Model Settings. For DeepSeekMoE 16B, we set the number of Transformer layers to 28 and the hidden dimension to 2048. We employ the multi-head attention mechanism with a total of ...

5.1 Experimental Setup

原文: rate is multiplied by 0.316 at 80% of the training steps, and again by 0.316 at 90% of the training steps. The maximum learning rate for DeepSeekMoE 16B is set to 4.2 × 10 − 4 4.2 superscript 10 4 4.2\times 10^{-4} , and the gradient clipping norm is set to 1.0. The batch size is set to 4.5K, and with a maximum sequence length of 4K, each training batch contains 18M tokens. Correspondingly, the total number of training steps is set to 106,449 to achieve 2T training tokens. Due to the abundance of training data, we do not use dropout during training. We leverage pipeline parallelism to deploy d...

5.1 Experimental Setup

原文: ple-Choice. For multi-subject multiple-choice, we additionally evaluate the models on MMLU (Hendrycks et al., 2020 ) . The evaluation metric is accuracy. Disambiguation. For disambiguation, we additionally consider WinoGrande (Sakaguchi et al., 2019 ) and the evaluation metric is accuracy. Chinese Benchmarks. Since DeepSeekMoE 16B is pretrained on a bilingual corpus, we also evaluate it on four Chinese benchmarks. CLUEWSC (Xu et al., 2020 ) is a Chinese disambiguation benchmark. CEval (Huang et al., 2023 ) and CMMLU (Li et al., 2023 ) are two Chinese multi-subject multiple-choice benchmarks wi...

5.2 Evaluations

（5.2 Evaluations的详细内容，翻译见上面对应章节） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Metric # Shot DeepSeek 7B (Dense) DeepSeekMoE 16B # Total Params N/A 6.9B 16.4B # Activated Params N/A 6.9B 2.8B FLOPs per 4K Tokens N/A 183.5T 74.4T # Training Tokens N/A 2T 2T Pile (BPB) N/A 0.75 0.74 HellaSwag (Acc.) 0-shot 75.4 77.1 PIQA (Acc.) 0-shot 79.2 80.2 ARC-easy (Acc.) 0-shot 67.9 68.1 ARC-challenge (Acc.) 0-shot 48.1 49.8 RACE-middle (Acc.) 5-shot 63.2 61.9 RACE-high (Acc.) 5-shot 46.5 46.4 DROP (EM) 1-shot 34.9 32.9 GSM8K (EM) 8-shot 17.4 18.8 MATH (EM) 4-shot 3.3 4.3 HumanEval (Pass@1) 0-shot 26.2 26.8 MBPP (Pass@1) 3-shot 39.0 39.2 TriviaQA (EM) 5-shot 59.7 64.8 NaturalQuestion...

5.2 Evaluations

原文: 2022a ) . (3) Compared with the excellent performance on other tasks, DeepSeekMoE exhibits limitations in addressing multiple-choice tasks. This inadequacy stems from the limited attention parameters in DeepSeekMoE 16B (DeepSeekMoE 16B has only about 0.5B attention parameters, while DeepSeek 7B has 2.5B attention parameters). Our earlier investigation on DeepSeek 7B reveals a positive correlation between the attention capacity and performance on multiple-choice tasks. For example, DeepSeek 7B MQA, which is equipped with the multi-query attention mechanism (Shazeer, 2019 ) , also struggled in M...

5.2 Evaluations

原文: forms LLaMA2 7B on the majority of benchmarks. 5.2.2 Comparison with Open Source Models Internal Comparison with LLaMA2 7B. In the realm of open source models, we mainly compare DeepSeekMoE 16B with LLaMA2 7B (Touvron et al., 2023b ) , a well-known and strong open source language model with 6.7B parameters. Both DeepSeekMoE 16B and LLaMA2 7B are pretrained on 2T tokens. Compared with LLaMA2 7B, DeepSeekMoE has 245% of total parameters but only needs 39.6% of computations. The results on our internal benchmarks are presented in Table 4 , leading to the following observations. (1) Among the eval...

5.2 Evaluations

原文: n results, as presented in Figure 1 , show that DeepSeekMoE 16B consistently outperforms models with similar activated parameters by a large margin. Moreover, it achieves comparable performance with LLaMA2 7B, which has approximately 2.5 times the activated parameters.

6 Alignment for DeepSeekMoE 16B

（6 Alignment for DeepSeekMoE 16B的详细内容，翻译见上面对应章节） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Previous research indicates that MoE models typically do not emerge significant gains from fine-tuning (Fedus et al., 2021 ; Artetxe et al., 2022 ) . However, Shen et al. ( 2023 ) present findings suggesting that MoE models can indeed benefit from instruction tuning. In order to assess whether DeepSeekMoE 16B can benefit from fine-tuning, we conduct supervised fine-tuning to construct a chat model based on DeepSeekMoE 16B. The experimental results reveal that DeepSeekMoE Chat 16B also achieves comparable performance with LLaMA2 SFT 7B and DeepSeek Chat 7B. 6.1 Experimental Setup Training Data....

6 Alignment for DeepSeekMoE 16B

原文: essment of the reasoning ability of the chat models. Metric # Shot LLaMA2 SFT 7B DeepSeek Chat 7B DeepSeekMoE Chat 16B # Total Params N/A 6.7B 6.9B 16.4B # Activated Params N/A 6.7B 6.9B 2.8B FLOPs per 4K Tokens N/A 187.9T 183.5T 74.4T HellaSwag (Acc.) 0-shot 67.9 71.0 72.2 PIQA (Acc.) 0-shot 76.9 78.4 79.7 ARC-easy (Acc.) 0-shot 69.7 70.2 69.9 ARC-challenge (Acc.) 0-shot 50.8 50.2 50.0 BBH (EM) 3-shot 39.3 43.1 42.2 RACE-middle (Acc.) 5-shot 63.9 66.1 64.8 RACE-high (Acc.) 5-shot 49.6 50.8 50.6 DROP (EM) 1-shot 40.0 41.7 33.8 GSM8K (EM) 0-shot 63.4 62.6 62.2 MATH (EM) 4-shot 13.5 14.7 15.2 Hu...

6 Alignment for DeepSeekMoE 16B

原文: nsuming nearly 40% of computations, achieves comparable performance with 7B dense models across language understanding and reasoning (PIQA, ARC, BBH), machine reading comprehension (RACE), mathematical (GSM8K, MATH), and knowledge-intensive tasks (TriviaQA, NaturalQuestions). (2) On code generation tasks, DeepSeekMoE Chat 16B significantly outperforms LLaMA2 SFT 7B, demonstrating notable improvements on HumanEval and MBPP. In addition, it also surpasses DeepSeek Chat 7B. (3) On multiple-choice question answering benchmarks including MMLU, CEval, and CMMLU, DeepSeekMoE Chat 16B still falls behi...

6.1 Experimental Setup

（6.1 Experimental Setup的详细内容，翻译见上面对应章节） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Training Data. For training the chat model, we conduct supervised fine-tuning (SFT) on our in-house curated data, comprising 1.4M training examples. This dataset spans a broad range of categories including math, code, writing, question answering, reasoning, summarization, and more. The majority of our SFT training data is in English and Chinese, rendering the chat model versatile and applicable in bilingual scenarios. Hyper-Parameters. During supervised fine-tuning, we set the batch size to 1024 examples and conduct training over 8 epochs using the AdamW optimizer (Loshchilov and Hutter, 2019 ...

6.1 Experimental Setup

原文: 5 14.7 15.2 HumanEval (Pass@1) 0-shot 35.4 45.1 45.7 MBPP (Pass@1) 3-shot 27.8 39.0 46.2 TriviaQA (EM) 5-shot 60.1 59.5 63.3 NaturalQuestions (EM) 0-shot 35.2 32.7 35.1 MMLU (Acc.) 0-shot 50.0 49.7 47.2 WinoGrande (Acc.) 0-shot 65.1 68.4 69.0 CLUEWSC (EM) 5-shot 48.4 66.2 68.2 CEval (Acc.) 0-shot 35.1 44.7 40.0 CMMLU (Acc.) 0-shot 36.9 51.2 49.3 Table 5: Comparison among LLaMA2 SFT 7B, DeepSeek Chat 7B and DeepSeekMoE Chat 16B, with all of these three models fine-tuned on the same SFT data. Compared with both 7B dense models, DeepSeekMoE Chat 16B still achieves comparable or better performance...

6.2 Evaluations

（6.2 Evaluations的详细内容，翻译见上面对应章节） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Baselines. In order to validate the potential of DeepSeekMoE 16B after alignment, we conduct supervised fine-tuning for LLaMA2 7B, DeepSeek 7B, and DeepSeekMoE 16B, where we utilize totally the same fine-tuning data to ensure fairness. Correspondingly, we construct three chat models, including LLaMA2 SFT 7B 3 3 3 We use LLaMA2 SFT to distinguish from the official LLaMA2 Chat (Touvron et al., 2023b ) model. , DeepSeek Chat 7B, and DeepSeekMoE Chat 16B. Subsequently, we compare DeepSeekMoE Chat 16B with the other two dense chat models (with about 2.5 times the FLOPs) across a wide range of downs...

6.2 Evaluations

原文: alidates its consistent advantages in achieving comparable performance with dense models while using only about 40% of computations.

7 DeepSeekMoE 145B Ongoing

（7 DeepSeekMoE 145B Ongoing的详细内容，翻译见上面对应章节） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Encouraged by the outstanding performance of DeepSeekMoE 16B, we further undertake a preliminary endeavor to scale up DeepSeekMoE to 145B. In this initial study, DeepSeekMoE 145B is trained on 245B tokens, but it has demonstrated consistent advantages over the GShard architecture and shown promise to match or exceed the performance of DeepSeek 67B (Dense). Furthermore, upon the completion of the final version and full training of DeepSeekMoE 145B, we also plan to make it publicly available. 7.1 Experimental Setup Training Data and Tokenization. For DeepSeekMoE 145B, we employ exactly the same ...

7 DeepSeekMoE 145B Ongoing

原文: alue during the first 2K steps. Subsequently, the learning rate keeps constant during the remaining training process. The maximum learning rate for DeepSeekMoE 145B is set to 3.0 × 10 − 4 3.0 superscript 10 4 3.0\times 10^{-4} , and the gradient clipping norm is set to 1.0. The batch size is set to 4.5K, and with a maximum sequence length of 4K, each training batch contains 18M tokens. We train DeepSeekMoE 145B for 13,000 steps, achieving 245B training tokens. Also, we do not use dropout during training. We leverage pipeline parallelism to deploy different layers of a model on different device...

7 DeepSeekMoE 145B Ongoing

原文: 8K (EM) 8-shot 11.8 6.4 12.2 13.8 MATH (EM) 4-shot 2.1 1.6 3.1 2.8 HumanEval (Pass@1) 0-shot 23.8 17.7 19.5 23.2 MBPP (Pass@1) 3-shot 33.6 27.6 33.2 32.0 TriviaQA (EM) 5-shot 57.2 52.5 61.1 59.8 NaturalQuestions (EM) 5-shot 22.6 19.0 25.0 23.5 MMLU (Acc.) 5-shot 45.1 26.3 39.4 37.5 WinoGrande (Acc.) 0-shot 70.7 67.6 71.9 70.8 CLUEWSC (EM) 5-shot 69.1 65.7 71.9 72.6 CEval (Acc.) 5-shot 40.3 26.2 37.1 32.8 CMMLU (Acc.) 5-shot 40.6 25.4 35.9 31.9 CHID (Acc.) 0-shot 88.5 86.9 90.3 88.3 Table 6: Comparison among DeepSeek 67B (Dense) and MoE models at the scale of about 140B total parameters. In the...

7 DeepSeekMoE 145B Ongoing

原文: hyper-parameters. Results. From the evaluation results presented in Table 6 , we have the following observations: (1) Despite having comparable total parameters and computations, DeepSeekMoE 145B significantly outperforms GShard 137B, highlighting the advantages of the DeepSeekMoE architecture again. (2) On the whole, with only 28.5% of computations, DeepSeekMoE 145B achieves comparable performance with DeepSeek 67B (Dense). Consistent with the findings from DeepSeekMoE 16B, DeepSeekMoE 145B exhibits remarkable strengths in language modeling and knowledge-intensive tasks, but with limitations ...

7.1 Experimental Setup

（7.1 Experimental Setup的详细内容，翻译见上面对应章节） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Training Data and Tokenization. For DeepSeekMoE 145B, we employ exactly the same training corpus and tokenizer as DeepSeekMoE 16B, with the only difference being that DeepSeekMoE 145B is trained on 245B tokens for an initial study. Model Settings. For DeepSeekMoE 145B, we set the number of Transformer layers to 62 and the hidden dimension to 4096. We employ the multi-head attention mechanism with a total of 32 attention heads, where each head has a dimension of 128. As for initialization, all learnable parameters are randomly initialized with a standard deviation of 0.006. As in DeepSeekMoE 16...

7.1 Experimental Setup

原文: ge pipeline parallelism to deploy different layers of a model on different devices, and for each layer, all the routed experts will be uniformly deployed on 4 devices (i.e., expert parallelism combined with data parallelism). Since we employ expert parallelism for DeepSeekMoE 145B, the device-level load balance should be considered to reduce the computational bottleneck. In response, we set the device-level balance factor to 0.05 to encourage balanced computation across devices. Also, we still set a small expert-level balance factor of 0.003 to prevent routing collapse. Evaluation Benchmarks. ...

7.1 Experimental Setup

原文: ek 67B (Dense) and MoE models at the scale of about 140B total parameters. In the lines of “# Experts” and “# Activated Experts”, a 𝑎 a + b 𝑏 b denotes a 𝑎 a shared experts and b 𝑏 b routed experts, respectively. Bold font indicates the best or near the best performance excluding the last column. DeepSeekMoE 145B, and even DeepSeekMoE 142B (Half Activated) that has only a half of activated expert parameters, outperform GShard 137B by a large margin. Moreover, with 28.5% of computations, DeepSeekMoE 145B achieves comparable performance with DeepSeek 67B.

7.2 Evaluations

（7.2 Evaluations的详细内容，翻译见上面对应章节） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Baselines. Apart from DeepSeekMoE 145B , we consider three additional models for comparison. DeepSeek 67B (Dense) is a dense model with 67.4B total parameters (refer to DeepSeek-AI ( 2024 ) for the model and training details). GShard 137B shares the same hidden dimension and number of layers as DeepSeekMoE 145B, but follows the GShard architecture. Note that DeepSeekMoE 145B aligns the intermediate hidden dimension in each expert to a multiple of 64 for computation efficiency, so its model size is 6% larger than GShard 137B. DeepSeekMoE 142B (Half Activated) has a similar architecture to DeepS...

8 Related Work

（8 Related Work的详细内容，翻译见上面对应章节） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: The Mixture of Experts (MoE) technique is first proposed by Jacobs et al. ( 1991 ); Jordan and Jacobs ( 1994 ) to deal with different samples with independent expert modules. Shazeer et al. ( 2017 ) introduce MoE into language model training and build a large-scale LSTM-based (Hochreiter and Schmidhuber, 1997 ) MoE models. As Transformer become the most popular architecture for NLP, many attempts extend FFNs in a Transformer as MoE layers to build MoE language models. GShard (Lepikhin et al., 2021 ) and Switch Transformer (Fedus et al., 2021 ) are pioneers which employ learnable top-2 or top-1...

9 Conclusion

（9 Conclusion的详细内容，翻译见上面对应章节） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: In this paper, we introduce the DeepSeekMoE architecture for MoE language models, with the objective of achieving ultimate expert specialization. Through fine-grained expert segmentation and shared expert isolation, DeepSeekMoE achieves significantly higher expert specialization and performance compared with prevailing MoE architectures. Starting with a modest scale of 2B parameters, we validate the advantages of DeepSeekMoE, demonstrating its capability to approach the upper bound performance for MoE models. Furthermore, we provide empirical evidence to show that DeepSeekMoE has a higher leve...

Appendix A Overview of Hyper-Parameters

（Appendix A Overview of Hyper-Parameters的详细内容，翻译见上面对应章节） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: We present the overview of hyper-parameters for DeepSeekMoE across various sizes in Table 7 . # Params # Layers Hidden Size # Attn Heads # Shared Experts # Routed Experts Relative Expert Size Sequence Length Batch Size (Sequence) Learning Rate 2.0B 9 1280 10 1 63 (7 activated) 0.25 2048 2048 1.08e-3 16.4B 28 2048 16 2 64 (6 activated) 0.25 4096 4608 4.2e-4 144.6B 62 4096 32 4 128 (12 activated) 0.125 4096 4608 3.0e-4 Table 7: Overview of hyper-parameters for DeepSeekMoE across various sizes. The relative expert size is in comparison to a standard FFN.

Appendix B Comparing DeepSeekMoE with Larger Models

（Appendix B Comparing DeepSeekMoE with La的详细内容，翻译见上面对应章节） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Comparisons among DeepSeekMoE, GShard × 1.2 absent 1.2 \times 1.2 , and GShard × 1.5 absent 1.5 \times 1.5 are shown in Table 8 . Comparisons among DeepSeekMoE, Dense × 4 absent 4 \times 4 , and Dense × 16 absent 16 \times 16 are shown in Table 9 . Metric # Shot GShard × 1.2 absent 1.2 \times 1.2 GShard × 1.5 absent 1.5 \times 1.5 DeepSeekMoE Relative Expert Size N/A 1.2 1.5 0.25 # Experts N/A 0 + 16 0 + 16 1 + 63 # Activated Experts N/A 0 + 2 0 + 2 1 + 7 # Total Expert Params N/A 2.3B 2.8B 1.9B # Activated Expert Params N/A 0.28B 0.35B 0.24B # Training Tokens N/A 100B 100B 100B Pile (Loss) N/...

Appendix B Comparing DeepSeekMoE with Larger Models

原文: 1.5 absent 1.5 \times 1.5 , and show results in Table 10 . At a larger scale, DeepSeekMoE even outperforms GShard × 1.5 absent 1.5 \times 1.5 distinctly. Metric # Shot GShard × 1.2 absent 1.2 \times 1.2 GShard × 1.5 absent 1.5 \times 1.5 DeepSeekMoE Relative Expert Size N/A 1.2 1.5 0.25 # Experts N/A 0 + 16 0 + 16 1 + 63 # Activated Experts N/A 0 + 2 0 + 2 1 + 7 # Total Expert Params N/A 15.9B 19.8B 13.3B # Activated Expert Params N/A 2.37B 2.82B 2.05B # Training Tokens N/A 100B 100B 100B HellaSwag (Acc.) 0-shot 66.6 67.7 69.1 PIQA (Acc.) 0-shot 75.6 76.0 75.7 ARC-easy (Acc.) 0-shot 56.8 56.8 ...

Appendix C Training Benchmark Curves of DeepSeekMoE 16B

（Appendix C Training Benchmark Curves of 的详细内容，翻译见上面对应章节） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: We present the benchmark curves during training of DeepSeekMoE 16B and DeepSeek 7B (Dense) in Figure 7 for reference. Figure 7: Benchmark curves during training of DeepSeekMoE 16B and DeepSeek 7B (Dense). ◄ Feeling lucky? Conversion report Report an issue View original on arXiv ►

← 返回首页详细解读