ESFT: Expert-Specialized Fine-Tuning for Mixture-of-Experts Models

ESFT：混合专家模型的专家专门微调方法

📄 arXiv: 2407.01906📅 2024-07-02PDF

翻译进度54 / 54 段 (100%)

中文摘要

ESFT 为 MoE 模型设计了一种高效的专家专门微调策略。该方法针对 MoE 架构中专家专业化分工的特点，提出了一种能够精确控制微调过程中专家行为的方法，避免了对非相关专家的干扰。在保持 MoE 模型大规模参数的同时，实现了高效、精准的任务适配。

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

【摘要】ESFT: Expert-Specialized Fine-Tuning for Mixture-of-Experts Models - 本文介绍了ESFT的架构、训练方法和实验结果。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Zihan Wang 12 , Deli Chen 1 , Damai Dai 1 , Runxin Xu 1 , Zhuoshu Li 1 , Y. Wu 1 1 DeepSeek AI 2 Northwestern University {zw, victorchen}@deepseek.com Work done during internship at DeepSeek. Abstract Parameter-efficient fine-tuning ( PEFT ) is crucial for customizing Large Language Models ( LLMs ) with constrained resources. Although there have been various PEFT methods for dense-architecture LLMs, PEFT for sparse-architecture LLMs is still underexplored. In this work, we study the PEFT method for LLMs with the Mixture-of-Experts ( MoE ) architecture and the contents of this work are mainly t...

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

【架构】ESFT的模型架构设计和技术细节。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: rease (Meta, 2024 ; Mistral, 2024a ; DeepSeek, 2024 ; Qwen, 2024 ) , parameter-efficient fine-tuning ( PEFT ) methods (Han et al., 2024 ) are becoming increasingly important in adapting pre-trained LLMs to downstream customization tasks. However, existing works on PEFT like low-rank adaptation (LoRA) and P-Tuning (Hu et al., 2021 ; Liu et al., 2021 ) have primarily focused on dense-architecture LLMs, with research on sparse-architecture LLMs still being markedly insufficient. In this work, we focus on exploring PEFT techniques within the Mixture-of-Experts ( MoE ) LLMs (Mistral, 2024b ; Databr...

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

原文: ces the storage of up to 90% and training time up to 30% compared to full-parameter fine-tuning, as shown in § 5.2 . Besides, we delve deeper into the working mechanism of the ESFT method. We analyze the expert selection process in § 6.1 and demonstrate how ESFT leverages specialized experts effectively, as selecting 5-15% experts can achieve promising performance in different tasks. We investigate the efficiency of ESFT under different computational constraints in § 6.2 , showcasing its ability to leverage training resources efficiently compared to other PEFT methods like LoRA. Our studies in...

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

原文: l., 2020 ; Gheini et al., 2021 ; He et al., 2023 ; Vucetic et al., 2022 ) and unstructured training (Liao et al., 2023 ; Ansell et al., 2021 ; Sung et al., 2021 ; Xu et al., 2021 ) . (3) Applying low-rank adaptation : LoRA (Hu et al., 2021 ; Fomenko et al., 2024 ) is a widely-used PEFT method, which decomposes the origin weight matrices into low-rank components. Subsequent works (Zhang et al., 2023a ; Ding et al., 2023 ; Lin et al., 2024 ; Liu et al., 2023 ) have introduced numerous improvements to the original LoRA method. However, the study of PEFT in sparse models is still scarce. In this w...

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

原文: relevant to the task for efficient tuning. 3 Methods Figure 1: Comparison between Expert-Specialized Fine-Tuning (ESFT) and other fine-tuning methods. FFT trains all parameters. LoRA combines pre-trained weights with low-rank matrices to reduce training costs. ESFT only trains a subset of experts in a Mixture-of-Expert (MoE) architecture, optimizing efficiency and task specialization. 3.1 Preliminaries: Mixture-of-Experts for Transformers Mixture-of-Experts (MoE) for Transformers replace Feed-Forward Networks (FFNs) with MoE layers. Each MoE layer consists of multiple experts structurally iden...

1 Introduction

【引言】ESFT的研究背景、动机和主要贡献。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: As the parameter scale of large language models ( LLMs ) continues to increase (Meta, 2024 ; Mistral, 2024a ; DeepSeek, 2024 ; Qwen, 2024 ) , parameter-efficient fine-tuning ( PEFT ) methods (Han et al., 2024 ) are becoming increasingly important in adapting pre-trained LLMs to downstream customization tasks. However, existing works on PEFT like low-rank adaptation (LoRA) and P-Tuning (Hu et al., 2021 ; Liu et al., 2021 ) have primarily focused on dense-architecture LLMs, with research on sparse-architecture LLMs still being markedly insufficient. In this work, we focus on exploring PEFT techn...

1 Introduction

原文: nly trains the parameters of the selected experts, which effectively reduces the storage of up to 90% and training time up to 30% compared to full-parameter fine-tuning, as shown in § 5.2 . Besides, we delve deeper into the working mechanism of the ESFT method. We analyze the expert selection process in § 6.1 and demonstrate how ESFT leverages specialized experts effectively, as selecting 5-15% experts can achieve promising performance in different tasks. We investigate the efficiency of ESFT under different computational constraints in § 6.2 , showcasing its ability to leverage training resou...

2 Related Work

（2 Related Work - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: 2.1 Parameter-efficient fine-tuning for dense architectural LLMs The goal of parameter-efficient fine-tuning (Han et al., 2024 ) is to efficiently customize LLMs for downstream tasks, while existing studies primarily focus on dense architectural LLMs. PEFT methods for dense models can generally be categorized into three approaches: (1) Adding new parameters : methods of this kind fix the existing model parameters and fine-tune the model on a small number of newly added parameters. Adapter (Houlsby et al., 2019 ; Pfeiffer et al., 2020 ; He et al., 2021 ; Wang et al., 2022 ) and Soft Prompt (Li ...

2 Related Work

原文: and inference costs. Based on the granularity of experts, existing large MoE models can generally be divided into two categories: coarse- and fine-grained expert LLMs. Most existing MoE LLMs (Lepikhin et al., 2021 ; Fedus et al., 2021 ; Roller et al., 2021 ; Dai et al., 2022 ; Shen et al., 2024 ) have coarse-grained experts where the number of experts is very limited. For example, 2 out of 8 experts are activated for Mixtral MoE series (Mistral, 2024a , b ) and Grok-V1 (XAI, 2024 ) . As a result, a single expert has to learn complicated patterns from different domain tasks simultaneously. To a...

2.1 Parameter-efficient fine-tuning for dense architectural LLMs

原文: The goal of parameter-efficient fine-tuning (Han et al., 2024 ) is to efficiently customize LLMs for downstream tasks, while existing studies primarily focus on dense architectural LLMs. PEFT methods for dense models can generally be categorized into three approaches: (1) Adding new parameters : methods of this kind fix the existing model parameters and fine-tune the model on a small number of newly added parameters. Adapter (Houlsby et al., 2019 ; Pfeiffer et al., 2020 ; He et al., 2021 ; Wang et al., 2022 ) and Soft Prompt (Li and Liang, 2021 ; Liu et al., 2021 ; Zhang et al., 2023b ; Lester...

2.2 Coarse- and Fine-grained MoE LLMs

（2.2 Coarse- and Fine-grained MoE LLMs - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Compared to dense LLMs (e.g., LLaMA series, Meta, 2023b , a ), MoE LLMs (e.g., Mixtral series, Mistral, 2024a , b ) can increase model size while saving training and inference costs. Based on the granularity of experts, existing large MoE models can generally be divided into two categories: coarse- and fine-grained expert LLMs. Most existing MoE LLMs (Lepikhin et al., 2021 ; Fedus et al., 2021 ; Roller et al., 2021 ; Dai et al., 2022 ; Shen et al., 2024 ) have coarse-grained experts where the number of experts is very limited. For example, 2 out of 8 experts are activated for Mixtral MoE serie...

3 Methods

[3 Methods] 本章节为原文内容，详细翻译请参考英文原文。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Figure 1: Comparison between Expert-Specialized Fine-Tuning (ESFT) and other fine-tuning methods. FFT trains all parameters. LoRA combines pre-trained weights with low-rank matrices to reduce training costs. ESFT only trains a subset of experts in a Mixture-of-Expert (MoE) architecture, optimizing efficiency and task specialization. 3.1 Preliminaries: Mixture-of-Experts for Transformers Mixture-of-Experts (MoE) for Transformers replace Feed-Forward Networks (FFNs) with MoE layers. Each MoE layer consists of multiple experts structurally identical to a FFN. Tokens are assigned to and processed ...

3 Methods

原文: t 𝑠 𝑖 𝑡 s_{i,t} denotes the token-to-expert affinity, TopK ( ⋅ , K ) TopK ⋅ 𝐾 \text{TopK}(\cdot,K) denotes the set comprising K 𝐾 K highest affinity scores among those calculated for the t 𝑡 t -th token and all N 𝑁 N experts, and 𝐞 i l superscript subscript 𝐞 𝑖 𝑙 \mathbf{e}_{i}^{l} is the centroid of the i 𝑖 i -th expert in the l 𝑙 l -th layer. Recently, DeepSeekMoE (Dai et al., 2024 ) proposes enhancements to the MoE architecture through several techniques, including (1) Fine-grained segmentation, segmenting each expert into multiple smaller ones and keeping the same fraction of experts to ...

3 Methods

原文: FFN 𝑛 𝑖 \text{FFN}^{n}_{i} denote the shared and non-shared experts, respectively. Each expert is segmented into m 𝑚 m ones, with N 𝑁 N and K 𝐾 K also multiplied by m 𝑚 m times compared to the coarse-grained architecture. 3.2 Probing Task-Specific Expert Specialization in MoE Models Despite the significant success of MoE LLMs, a clear understanding of the underlying mechanism remains elusive. We conduct probing experiments to understand how non-shared experts are utilized across various tasks. These tasks, as detailed in § 4.1 , include general domains like math and code, as well as specialize...

3 Methods

原文: ared Top-6 routed experts across tasks. The values are averaged by layer, indicating that the sets of experts used for the same task are consistent while different tasks are distinct. 3.3 Expert-Specialized Fine-tuning (ESFT) The highly specialized expert system suggests that different experts can be optimized for specific tasks. Inspired by this, we propose Expert-Specialized Fine-Tuning (ESFT) for MoE LLM customization, which selectively fine-tunes the most relevant experts for downstream tasks to enhance computational efficiency and maintain expert specialization. Figure 1 illustrates the d...

3 Methods

原文: ubscript 𝑗 1 subscript 𝑁 𝑠 1 subscript 𝐿 𝑗 superscript subscript 𝑘 1 subscript 𝐿 𝑗 superscript subscript 𝑔 𝑖 𝑘 𝑙 g_{i}^{l}=\frac{1}{N_{s}}\sum_{j=1}^{N_{s}}\frac{1}{L_{j}}\sum_{k=1}^{L_{j}}g_{i,k}^{l}, (6) where L j subscript 𝐿 𝑗 L_{j} is the length of the input sequence x j subscript 𝑥 𝑗 x_{j} in the sampled data D s subscript 𝐷 𝑠 D_{s} . Token Selection Ratio (ESFT-Token) This score calculates the ratio of tokens for which expert e i subscript 𝑒 𝑖 e_{i} is selected. It is defined as: r i l = 1 N s ∑ j = 1 N s 1 L j ∑ k = 1 L j 𝟙 ( g i , k l > 0 ) K , superscript subscript 𝑟 𝑖 𝑙 1 subsc...

3 Methods

原文: nd inference, tokens can be assigned to any expert. However, only the selected experts E s l superscript subscript 𝐸 𝑠 𝑙 E_{s}^{l} in each layer can be updated; other experts and modules remain frozen.

3.1 Preliminaries: Mixture-of-Experts for Transformers

（3.1 Preliminaries: Mixture-of-Experts fo - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Mixture-of-Experts (MoE) for Transformers replace Feed-Forward Networks (FFNs) with MoE layers. Each MoE layer consists of multiple experts structurally identical to a FFN. Tokens are assigned to and processed by a subset of the most relevant experts based on their affinity scores, ensuring computational efficiency in MoE layers. The output hidden state 𝐡 t l superscript subscript 𝐡 𝑡 𝑙 \mathbf{h}_{t}^{l} of the t 𝑡 t -th token in the l 𝑙 l -th MoE layer is computed as: 𝐡 t l = ∑ i = 1 N ( g i , t FFN i n ( 𝐮 t l ) ) + 𝐮 t l , superscript subscript 𝐡 𝑡 𝑙 superscript subscript 𝑖 1 𝑁 subscri...

3.1 Preliminaries: Mixture-of-Experts for Transformers

原文: proposes enhancements to the MoE architecture through several techniques, including (1) Fine-grained segmentation, segmenting each expert into multiple smaller ones and keeping the same fraction of experts to process each token, allowing specialization in different knowledge types while maintaining the same computational cost. (2) Shared expert isolation, leveraging shared experts that process all tokens to capture common knowledge, reducing parameter redundancy and enhancing efficiency. The output of an MoE layer in DeepSeekMoE is: 𝐡 t l = ∑ i = 1 K s FFN i s ( 𝐮 t l ) + ∑ i = 1 N ( g i , t...

3.2 Probing Task-Specific Expert Specialization in MoE Models

（3.2 Probing Task-Specific Expert Special - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Despite the significant success of MoE LLMs, a clear understanding of the underlying mechanism remains elusive. We conduct probing experiments to understand how non-shared experts are utilized across various tasks. These tasks, as detailed in § 4.1 , include general domains like math and code, as well as specialized domains like intent recognition, summarization, legal judgment prediction, and translation. These experiments reveal the expert specialization in MoE models in two aspects: Expert Routing is Concentrated in the Same Task We investigate the distribution of normalized gate values, i....

3.3 Expert-Specialized Fine-tuning (ESFT)

（3.3 Expert-Specialized Fine-tuning (ESFT - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: The highly specialized expert system suggests that different experts can be optimized for specific tasks. Inspired by this, we propose Expert-Specialized Fine-Tuning (ESFT) for MoE LLM customization, which selectively fine-tunes the most relevant experts for downstream tasks to enhance computational efficiency and maintain expert specialization. Figure 1 illustrates the differences between our method and existing methods. Below, we introduce our method step by step. Data Sampling We randomly sample a subset D s = { ( x i , y i ) } i = 1 N s subscript 𝐷 𝑠 superscript subscript subscript 𝑥 𝑖 sub...

3.3 Expert-Specialized Fine-tuning (ESFT)

原文: 𝐿 𝑗 L_{j} is the length of the input sequence x j subscript 𝑥 𝑗 x_{j} in the sampled data D s subscript 𝐷 𝑠 D_{s} . Token Selection Ratio (ESFT-Token) This score calculates the ratio of tokens for which expert e i subscript 𝑒 𝑖 e_{i} is selected. It is defined as: r i l = 1 N s ∑ j = 1 N s 1 L j ∑ k = 1 L j 𝟙 ( g i , k l > 0 ) K , superscript subscript 𝑟 𝑖 𝑙 1 subscript 𝑁 𝑠 superscript subscript 𝑗 1 subscript 𝑁 𝑠 1 subscript 𝐿 𝑗 superscript subscript 𝑘 1 subscript 𝐿 𝑗 1 superscript subscript 𝑔 𝑖 𝑘 𝑙 0 𝐾 r_{i}^{l}=\frac{1}{N_{s}}\sum_{j=1}^{N_{s}}\frac{1}{L_{j}}\sum_{k=1}^{L_{j}}\frac{\ma...

4 Experiment Setup

（4 Experiment Setup - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: 4.1 Main Evaluation We evaluate our ESFT method on two common LLM customization scenarios: (1) improving the model’s specific ability in a domain where the model may already have decent performance; (2) adapting the model to a possibly narrow but unfamiliar specialized task . 4.1.1 Tasks for Model Enhancement We choose two domain-specific tasks, i.e., Math and Code, to evaluate how our method can enhance the model’s existing abilities. The two domains are widely concerned in current LLM research and suitable for evaluation, as many pre-trained models can perform decently, while there is signif...

4 Experiment Setup

原文: r downstream task, covering a diverse range of abilities that most models can excel at after training but not without training: (1) Text-to-JSON Intent Recognition in the BDCI-21 Smart HCI NLU Challenge 1 1 1 https://www.datafountain.cn/competitions/511 , which requires converting text instructions into JSON format for home appliances. (2) Text Summarization in the BDCI-21 Summarization Challenge 2 2 2 https://www.datafountain.cn/competitions/536 , which summarizes customer service call transcripts. (3) Legal judgment Prediction in the the BDCI-21 Law Event Prediction Challenge 3 3 3 https://w...

4 Experiment Setup

原文: FT 80.9 ± plus-or-minus \pm 1.1 65.9 ± plus-or-minus \pm 0.7 34.2 ± plus-or-minus \pm 4.1 55.5 ± plus-or-minus \pm 1.0 58.8 ± plus-or-minus \pm 0.9 67.9 ± plus-or-minus \pm 3.8 48.4 ± plus-or-minus \pm 2.4 58.8 ± plus-or-minus \pm 1.3 LoRA 74.3 ± plus-or-minus \pm 7.7 63.4 ± plus-or-minus \pm 5.4 38.7 ± plus-or-minus \pm 2.5 55.5 ± plus-or-minus \pm 1.2 57.0 ± plus-or-minus \pm 1.5 72.8 ± plus-or-minus \pm 1.9 51.8 ± plus-or-minus \pm 2.3 59.1 ± plus-or-minus \pm 2.5 ESFT-Token 80.9 ± plus-or-minus \pm 0.9 66.7 ± plus-or-minus \pm 1.8 40.7 ± plus-or-minus \pm 1.3 57.1 ± plus-or-minus \pm 0.5 5...

4 Experiment Setup

原文: meter Fine-Tuning (FFT) and Low-Rank Adaptation (LoRA, Hu et al., 2021 ). For LoRA, we add low-rank matrices to all parameters for training except token embeddings and the language modeling head. We maintain a 1:1 ratio for task-specific data and alignment data for all methods, which we find is highly effective in preserving general abilities obtained from the alignment phase for FFT and LoRA. However, for our ESFT method, not adopting this data mixing strategy may even better maintain general ability. We detail this in Appendix F . All experiments are done on the HFAI cluster 5 5 5 https://do...

4.1 Main Evaluation

（4.1 Main Evaluation - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: We evaluate our ESFT method on two common LLM customization scenarios: (1) improving the model’s specific ability in a domain where the model may already have decent performance; (2) adapting the model to a possibly narrow but unfamiliar specialized task . 4.1.1 Tasks for Model Enhancement We choose two domain-specific tasks, i.e., Math and Code, to evaluate how our method can enhance the model’s existing abilities. The two domains are widely concerned in current LLM research and suitable for evaluation, as many pre-trained models can perform decently, while there is significant potential for ...

4.1 Main Evaluation

原文: overing a diverse range of abilities that most models can excel at after training but not without training: (1) Text-to-JSON Intent Recognition in the BDCI-21 Smart HCI NLU Challenge 1 1 1 https://www.datafountain.cn/competitions/511 , which requires converting text instructions into JSON format for home appliances. (2) Text Summarization in the BDCI-21 Summarization Challenge 2 2 2 https://www.datafountain.cn/competitions/536 , which summarizes customer service call transcripts. (3) Legal judgment Prediction in the the BDCI-21 Law Event Prediction Challenge 3 3 3 https://www.datafountain.cn/c...

4.2 General Ability Evaluation

（4.2 General Ability Evaluation - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: We select a broad range of benchmarks to evaluate the extent to which the models’ general abilities are preserved after training on new tasks. These benchmarks include MMLU Hendrycks et al. ( 2021b ) , TriviaQA Joshi et al. ( 2017 ) , HellaSwag Zellers et al. ( 2019 ) , ARC-Challenge Clark et al. ( 2018 ) , IFEval Zhou et al. ( 2023 ) , CEval Huang et al. ( 2023 ) , and CLUEWSC Xu et al. ( 2020 ) , covering comprehensive model ability evaluations across various domains including natural language understanding, question answering, instruction following, and common sense reasoning. CLUEWSC Trivi...

4.3 Backbone Model and Training Settings

（4.3 Backbone Model and Training Settings - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: We use the backbone architecture of DeepSeek-V2-Lite DeepSeek ( 2024 ) for all experiments. The model includes a fine-grained set of 66 experts for each transformer layer. This makes it uniquely suitable at the time of this study for our method, which benefits from expert specialization. We train the model on a carefully curated alignment dataset that excludes math and code data and take the resulting checkpoint as our vanilla model for subsequent experiments. This alignment phase can activate model ability across various domains while keeping Math/Code ability as elementary to better verify t...

5 Results

[5 Results] 本章节为原文内容，详细翻译请参考英文原文。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: 5.1 Benchmark Performance Results The results in Table 1 and Table 2 demonstrate several conclusions. All methods can improve model performance in customization tasks compared to the vanilla model, while they may cause a performance decrease in general tasks. Generally, the performance increase is higher in model adaptation tasks than in model enhancement tasks. For customization ability evaluation, ESFT surpasses LoRA significantly and is competitive with FFT. As shown in Table 1 , ESFT-Token and ESFT-Gate achieve near-best results in model enhancement tasks like Math, and ESFT-Gate achieves ...

5 Results

原文: and storage space requirements: Figure 4: Number of experts trained in ESFT across layers and tasks. Earlier computed layers are numbered smaller. Most tasks and layers train 5-15% of experts, demonstrating ESFT’s effectiveness in selecting task-related experts. Figure 5: Computational efficiency results. Blue bars show the training time and green lines show storage space. ESFT performs efficiently in terms of training time and storage space. Training Time The average training time for ESFT-Token and ESFT-Gate is 19.8 minutes and 20.9 minutes, respectively. The FFT method takes significantly l...

5.1 Benchmark Performance Results

（5.1 Benchmark Performance Results - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: The results in Table 1 and Table 2 demonstrate several conclusions. All methods can improve model performance in customization tasks compared to the vanilla model, while they may cause a performance decrease in general tasks. Generally, the performance increase is higher in model adaptation tasks than in model enhancement tasks. For customization ability evaluation, ESFT surpasses LoRA significantly and is competitive with FFT. As shown in Table 1 , ESFT-Token and ESFT-Gate achieve near-best results in model enhancement tasks like Math, and ESFT-Gate achieves the best performance in the Humane...

5.2 Computational Efficiency Results

（5.2 Computational Efficiency Results - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: The results in Figure 6 demonstrates that ESFT exhibits several advantages in terms of training time and storage space requirements: Figure 4: Number of experts trained in ESFT across layers and tasks. Earlier computed layers are numbered smaller. Most tasks and layers train 5-15% of experts, demonstrating ESFT’s effectiveness in selecting task-related experts. Figure 5: Computational efficiency results. Blue bars show the training time and green lines show storage space. ESFT performs efficiently in terms of training time and storage space. Training Time The average training time for ESFT-Tok...

6 Analysis

[6 Analysis] 本章节为原文内容，详细翻译请参考英文原文。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Non-shared Experts Shared Experts Non-expert Parameters Trainable Parameters Specialized Ability General Ability Average ALL ✓ ✓ \checkmark ✓ ✓ \checkmark 15.7B 51.0 58.8 54.9 Relevant ✓ ✓ \checkmark × \times 1.85B 49.8 60.7 55.3 Relevant × \times × \times 1.4B 49.4 61.5 55.4 × \times ✓ ✓ \checkmark × \times 450M 47.4 61.2 54.3 × \times ✓ ✓ \checkmark ✓ ✓ \checkmark 1.3B 49.0 60.0 54.5 Relevant ✓ ✓ \checkmark ✓ ✓ \checkmark 2.7B 50.8 60.3 55.6 × \times × \times × \times - 33.8 62.4 48.1 Table 3: Comparisons of different model configs based on whether training shared or non-shared parameters. R...

6 Analysis

原文: SFT-Token generally employs fewer experts while better maintaining general performance, comparable to ESFT-Gate in tasks like Math, Intent, and Law. (3) The number of experts varies by task, with more specialized tasks like Math and Translation using fewer experts; our method’s performances for these tasks exceed LoRA to the greatest extent, indicating that our method is especially suitable for more specialized tasks. (4) For most tasks, few experts are chosen in the middle layers, indicating that expert distribution is more concentrated in these layers. 6.2 ESFT Leverages Training Resources E...

6 Analysis

原文: pecialized ability and more stable general ability. (3) ESFT-Token peaks in both specialized and general ability at p 𝑝 p =0.5, while ESFT-Gate peaks at p 𝑝 p =0.3 for specialized and p 𝑝 p =0.1 for general ability. (4) ESFT-Token and ESFT-Gate performance saturates at p 𝑝 p =0.2 and p 𝑝 p =0.1, respectively, indicating that most expert choices may be less relevant to task performance. We delve deeper into this in Appendix E . 6.3 Selectively Training Non-Shared Parameters is the Key to ESFT In our proposed ESFT method, we only fine-tune a subset of non-shared experts. This section provides de...

6 Analysis

原文: overfitting on downstream tasks and forgetting on general tasks compared to training non-shared parameters. It is highly prioritized to train task-relevant non-shared experts. Training relevant experts achieves at least 55.3, while other settings achieve at most 54.9, even with higher demands of up to 15.7B parameters. Therefore, fine-tuning these experts is highly prioritized for model customization. We propose two major training strategies based on these conclusions: 1. Prioritize specialized ability: Train all shared parameters and task-relevant non-shared experts to maximize the enhancemen...

6 Analysis

原文: B ) to group experts, simulating coarse-grained segmentation. Experts in the same group share the average affinity score. We maintain the computational cost by selecting a constant 1/8 of experts for each token. Experiment results of the Math domain in Figure 7 show that as the group size increases, our method’s performance decreases more severely than FFT, while the training cost (i.e., trainable experts) rises. These findings indicate that our method, and even effective LLM customization, highly rely on a fine-grained segmented LLM architecture with more specialized experts.

6.1 ESFT Leverages Specialized Experts Effectively

（6.1 ESFT Leverages Specialized Experts E - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: We analyze the number of experts ESFT trains across tasks and layers to understand its expert selection process. Results are shown in Figure 4 . From the results, we have several observations: (1) The average number of experts used per task across layers ranges from 2 to 15 out of 66, indicating ESFT can have 75%-95% fewer trainable parameters than FFT. (2) ESFT-Token generally employs fewer experts while better maintaining general performance, comparable to ESFT-Gate in tasks like Math, Intent, and Law. (3) The number of experts varies by task, with more specialized tasks like Math and Transl...

6.2 ESFT Leverages Training Resources Efficiently

（6.2 ESFT Leverages Training Resources Ef - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Math Ability Code Ability Specialized Tasks MATH GSM8K Humaneval MBPP Intent Summary Law Translation Average ESFT-Token 22.6 66.0 41.5 42.6 75.6 65.4 45.7 36.2 49.4 Δ Δ \Delta of rand -1.0 -3.7 -2.5 0.2 -2.6 -1.7 1.3 -13.5 -2.8 ESFT-Gate 23.2 64.9 43.3 41.8 78.6 65.8 49.1 35.2 50.2 Δ Δ \Delta of rand -1.7 -3.2 -4.3 1.6 -5.0 0.3 -2.9 -20.4 -4.4 Table 4: Performance comparison between original experts and random experts. Replacing high-affinity experts with random ones significantly harms model performance across different tasks. Both ESFT and LoRA have a training efficiency hyperparameter ( p 𝑝...

6.3 Selectively Training Non-Shared Parameters is the Key to ESFT

（6.3 Selectively Training Non-Shared Para - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: In our proposed ESFT method, we only fine-tune a subset of non-shared experts. This section provides detailed discussions of several variants of our method that may also train shared parameters. The variables are based on: • Whether all non-shared experts or a task-relevant subset of them (we use the Token Selection Ratio and set p 𝑝 p =0.2) are trained. • Whether shared experts are trained. • Whether other parameters, including gates, attention layers, and embeddings, are trained. The results are shown in Table 3 . We report average trainable parameters across all tasks, performance of specia...

6.3 Selectively Training Non-Shared Parameters is the Key to ESFT

原文: ed ability: Train all shared parameters and task-relevant non-shared experts to maximize the enhancement of specialized performance. 2. Balance specialized and general ability, and computational efficiency: Train only task-relevant non-shared experts to minimize parameter costs while maximizing the maintenance of general ability. Figure 7: Experiment results for grouped experts. As the experts become more coarse-grained, ESFT degrades more severely than FFT.

6.4 Analysis of Key Modules in ESFT

（6.4 Analysis of Key Modules in ESFT - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: In this section, we analyze and demonstrate that the effectiveness of our method lies in two modules: (1) our proposed expert relevance score functions and (2) the fine-grained expert segmentation of the MoE model architecture. Expert Relevance Score Function In this work, we propose Average Gate Score and Token Selection Ratio as expert relevance score functions to filter relevant experts for different tasks. To demonstrate their effectiveness, we replace the experts obtained from these functions with random experts while keeping the number of activated experts per layer the same. Results in ...

7 Conclusion

（7 Conclusion - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: In this work, we study parameter-efficient fine-tuning methods for sparse large language models with the Mixture of Experts (MoE) architecture. We first observe that tasks from different domains are handled by distinct combinations of experts. We then propose selecting the most relevant experts for downstream tasks using two metrics: average gate score and token selection ratio. Experimental results show that our method significantly reduces training costs while matching or surpassing full parameter fine-tuning results. Further analysis confirms that our method enhances the specialization of t...

Limitations

（Limitations - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Firstly, due to the limitation of the availability of other fine-grained MoE models, our method was only tested on the DeepSeek-V2-Lite MoE model. The conclusions drawn from this model require further validation when applied to other contexts. Besides, due to the lack of parameter-wise and structurally aligned MoE models with different expert granularities, we used a simulation approach by binding several groups of experts to compare coarse-grained and fine-grained MoE methods.

Appendix A Examples for Specialized Tasks

（Appendix A Examples for Specialized Task - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Table 5 presents task examples as prompts and corresponding reference responses for each specialized task, including intent recognition, text summarization, legal judgment prediction, and low-resource translation.

Appendix B Strategy for Grouping Experts

（Appendix B Strategy for Grouping Experts - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: To group experts together and simulate coarse-grained mixture-of-experts transformer models, we calculate expert similarity and group the experts by maximizing in-group similarities using a greedy search algorithm. We sample data from the alignment dataset, containing 32 samples each with a sequence length of 4096, to calculate the similarity between experts. We initialize a co-occurrence matrix for all expert pairs as a zero matrix. For each pair of experts that occur simultaneously in a token’s Top-6 expert choices, we increment their score by 1 in the matrix. After iterating through the dat...

Appendix C Analysis of Expert Affinity Sample Size

（Appendix C Analysis of Expert Affinity S - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Figure 8: Results of the shared Top-6 routed experts in two independent samples of a task. The x-axis represents the sample size, and the y-axis shows the shared Top-6 routed experts averaged by model layers. To evaluate the amount of data needed to identify the most relevant experts for a task, we independently sample two sets of data from the training set for each of the six tasks and calculate the shared Top-6 experts between the two sets. The results are shown in Figure 8 . As the sample size reaches 2 17 superscript 2 17 2^{17} (i.e., 32 samples with a sequence length of 4096), all tasks ...

Appendix D Detailed Results for Ablations on Training Shared Parameters

（Appendix D Detailed Results for Ablation - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: We present two tables that summarize the performance of various methods with different configurations for training shared or non-shared parameters. Table 6 shows results on general tasks, and Table 7 focuses on specialized tasks. The results indicate that training only task-relevant non-shared experts consistently maintains the best general task performance. Additionally, training task-relevant non-shared experts and all shared parameters yields the best specialized task performance, short of full-parameter fine-tuning.

Appendix E Qualitative Examples of the Expert Choices

（Appendix E Qualitative Examples of the E - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: We present qualitative examples of the amount that routed experts are trainable among all tokens for each task in Figure LABEL:fig:qualitative . Each subfigure demonstrates examples drawn from a task. Deeper tokens indicate more trainable experts across all 26 layers (top-6 experts per layer). The parameter p 𝑝 p is set to 0.2 for the token selection ratio. Results show that our method, even handling only about 20% of expert choices, covers a wide range of key task-relevant words. For example, in the Intent recognition task, the deepest tokens are “意图” (Intent); in the legal judgment task, the...

Appendix F The Impact of Mixing Alignment Data for Training

（Appendix F The Impact of Mixing Alignmen - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: We adopt a 1:1 ratio for downstream task data and alignment data for all methods during training to better maintain general task performance. This manual ratio is kept constant to avoid the significant additional costs associated with fine-tuning the ratio for each task. In this section, we present performance comparisons across various methods and tasks to reveal the impact of mixing alignment data during training. Table 9 presents the performance on downstream specialized tasks, and Table 10 shows the performance on general tasks. The results indicate that FFT and LoRA benefit from the inclu...

Appendix G Evaluation Instructions for Specialized Tasks

（Appendix G Evaluation Instructions for S - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Table 11 presents the detailed criteria to evaluate specialized tasks including text summarization, legal judgment prediction, and low-resource translation. Each task includes specific instructions on assessing predicted answers against reference answers, focusing on aspects such as content accuracy, completeness, relevance, and consistency.

Appendix H Evaluating Math and Code as General Tasks

（Appendix H Evaluating Math and Code as G - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: We investigate the Math and Code performance of models trained on adaptation tasks (i.e., Intent, Summary, Law, Translation), as these domains reflect the model’s general ability if not specifically trained on them. We report numbers with the setting of training on only downstream task data. Results in Table 8 show that FFT and LoRA would lead to significant performance drops in the Math and Code domain, having average performance drops of 9.0 and 12.4, respectively. Notably, our ESFT method retains performance significantly better compared to FFT and LoRA, with an average performance drop of ...

← 返回首页详细解读