[原文]Parameter-efficient fine-tuning (PEFT) is cru-
cial for customizing Large Language Models
(LLMs) with constrained resources. Although
there have been various PEFT methods for
dense-architecture LLMs, PEFT for sparse-
architecture LLMs is still underexplored. In
this work, we study the PEFT method for
LLMs with the Mixture-of-Experts (MoE) ar-
chitecture and the contents of this work are
mainly threefold: (1) We investigate the dis-
persion degree of the activated experts in cus-
tomized tasks, and found that the routing distri-
bution for a specific task tends to be highly con-
centrated, whil...
Introduction
随着大语言模型(LLMs)的参数规模持续扩大(Meta, 2024; Mistral, 2024a; DeepSeek, 2024; Qwen, 2024),参数高效微调(PEFT)方法(Han et al., 2024)在将预训练LLM适配至下游定制化任务方面正变得日益重要。然而,现有的PEFT工作(如低秩适应LoRA和P-Tuning)(Hu et al., 2021; Liu et al., 2021)主要集中于稠密架构的LLM,针对稀疏架构LLM的研究仍显著不足。在本工作中,我们专注于探索混合专家(MoE)LLM中的PEFT技术(如§3.1所述)(Mistral, 2024b; Databricks, 2024)。与所有任务均由相同参数处理的稠密模型不同,在MoE架构中,不同的任务由不同的激活专家进行处理(Lepikhin et al., 2021; Fedus et al., 2021)。观察表明,专家系统中的任务专业化是MoE LLM性能的关键(Dai et al., 2024)。我们在§3.2中进一步阐明了这种专业化现象:由同一任务数据激活的专家趋于集中,而不同任务对应的专家则差异显著,这表明MoE模型通过专用的专家组合来处理不同任务。受此启发,我们提出了专家专用微调(Expert-Specialized Fine-Tuning, ESFT)方法(如§3.3所示)。ESFT仅微调与任务匹配度最高的专家,同时冻结其他专家和模块的参数。ESFT的主要优势体现在两个方面:(1)保持专家专业化:ESFT避免了全参数微调中专业化程度下降的问题,在全参数微调中,不擅长该任务的专家也会更新其参数。§5.1的实验结果表明,与全参数微调相比,ESFT在下游任务中能够达到相当甚至更优的性能,并且在通用任务中更好地保持了性能。(2)节省计算资源:ESFT仅训练所选专家参数,如§5.2所示,与全参数微调相比,这有效减少了高达90%的存储需求和高达30%的训练时间。此外,我们还深入探讨了ESFT方法的工作机制。
[原文]As the parameter scale of large language mod-
els (LLMs) continues to increase (Meta, 2024;
Mistral, 2024a; DeepSeek, 2024; Qwen, 2024),
parameter-efficient fine-tuning (PEFT) methods
(Han et al., 2024) are becoming increasingly impor-
tant in adapting pre-trained LLMs to downstream
customization tasks. However, existing works on
PEFT like low-rank adaptation (LoRA) and P-
*Work done during internship at DeepSeek. Tuning (Hu et al., 2021; Liu et al., 2021) have pri-
marily focused on dense-architecture LLMs, with
research on sparse-architecture LLMs still being
markedly insufficient. In this w...
Introduction
引言 我们在§6.1中分析了专家选择过程,并展示了ESFT如何有效地利用专业化专家。研究表明,仅选择5%-15%的专家即可在不同任务中取得优异的性能。在§6.2中,我们探讨了ESFT在不同计算约束下的效率,展示了其相较于LoRA等其他PEFT方法在高效利用训练资源方面的优势。§6.3的研究分析了模型中共享与非共享参数对专业化性能与通用性能的影响,指出在ESFT中应优先选择性训练非共享参数。通过§6.4中的消融实验,我们强调了专家相关性评分与细粒度专家分割架构的重要性。 参数高效微调(Han et al., 2024)的目标是高效地将大语言模型定制用于下游任务,而现有研究主要集中于稠密架构的大语言模型。针对稠密模型的PEFT方法通常可分为三类:(1)添加新参数:此类方法固定现有模型参数,仅对少量新增参数进行微调。Adapter(Houlsby et al., 2019; Pfeiffer et al., 2020; He et al., 2021; Wang et al., 2022)与Soft Prompt(Li and Liang, 2021; Liu et al., 2021; Zhang et al., 2023b; Lester et al., 2021)是该类方法的两个典型代表。(2)选择现有参数:此类方法仅微调现有参数中的一小部分,同时保持其余大部分参数固定。根据可训练参数空间是否连续,这些方法通常可进一步划分为结构化训练(Guo et al., 2020; Gheini et al., 2021; He et al., 2023; Vucetic et al., 2022)与非结构化训练(Liao et al., 2023; Ansell et al., 2021; Sung et al., 2021; Xu et al., 2021)。(3)应用低秩适应:LoRA(Hu et al., 2021; Fomenko et al., 2024)是一种广泛使用的PEFT方法,其核心思想是将原始权重矩阵分解为低秩分量。后续研究(Zhang et al., 2023a; Ding et al., 2023; Lin et al., 2024; Liu et al., 2023)对原始LoRA方法进行了大量改进。然而,目前针对稀疏模型的PEFT研究仍然较为匮乏。
[原文]We analyze the ex-
pert selection process in §6.1 and demonstrate how
arXiv:2407.01906v2 [cs.CL] 5 Jul 2024
ESFT leverages specialized experts effectively, as
selecting 5-15% experts can achieve promising per-
formance in different tasks.We investigate the
efficiency of ESFT under different computational
constraints in §6.2, showcasing its ability to lever-
age training resources efficiently compared to other
PEFT methods like LoRA. Our studies in §6.3 an-
alyze the effects of shared and non-shared parame-
ters in the model on specialized and general perfor-
mance, pointing out the priority...
Introduction
引言 在本研究中,我们基于专家与下游任务的亲和力来选择并微调部分专家,这构成了稀疏混合专家(MoE)架构所独有的一个独特选择维度。 与稠密大语言模型(如LLaMA系列,Meta, 2023b,a)相比,MoE大语言模型(如Mixtral系列,Mistral, 2024a,b)能够在扩大模型规模的同时降低训练与推理成本。根据专家的粒度划分,现有的大型MoE模型通常可分为两类:粗粒度与细粒度专家大语言模型。大多数现有的MoE大语言模型(Lepikhin et al., 2021; Fedus et al., 2021; Roller et al., 2021; Dai et al., 2022; Shen et al., 2024)采用粗粒度专家设计,其专家总数非常有限。例如,在Mixtral MoE系列(Mistral, 2024a,b)和Grok-V1(XAI, 2024)中,仅从8个专家中激活2个。这导致单个专家必须同时从不同领域的任务中学习复杂的模式。为解决这一问题,DeepSeek MoE(Dai et al., 2024)引入了细粒度专家划分机制。在DeepSeek-V2(DeepSeek, 2024)中,专家总数高达162个,每次激活8个专家(DeepSeek-V2-Lite版本则从66个专家中激活8个)。专家的细粒度划分确保了各专家具备高度的专业化能力。此外,这种专业化的专家体系使得模型能够筛选出与当前任务最相关的专家,从而实现高效的微调。 3 方法
[原文]In this work, we se-
lect and tune part of the experts based on their
downstream task affinity, as a unique selection di-
mension exclusive to the sparse MoE architecture.
2.2
Coarse- and Fine-grained MoE LLMs
Compared to dense LLMs (e.g., LLaMA series,
Meta, 2023b,a), MoE LLMs (e.g., Mixtral series,
Mistral, 2024a,b) can increase model size while
saving training and inference costs.Based on the
granularity of experts, existing large MoE mod-
els can generally be divided into two categories:
coarse- and fine-grained expert LLMs. Most exist-
ing MoE LLMs (Lepikhin et al., 2021; Fedus et al.,
20...
[原文]The output
hidden state hl
t of the t-th token in the l-th MoE
layer is computed as:
hl
t =
N
�
i=1
�
gi,tFFNn
i (ul
t)
�
+ ul
t,
(1)
gi,t =
�
si,t,
si,t∈TopK({sj,t|1⩽j⩽N}, K),
0,
otherwise,
(2)
si,t = Softmaxi
�
ul⊤
t el
i
�
,
(3)
Trainable Modules
Frozen Modules
Training Task
Transformer Block × L
Feed-Forward
Layer
Attention & Norm
Low Rank Adaptation (LoRA)
Full-Parameter Fine-Tuning (FFT)
Input 𝐮𝑡
Output 𝐡𝑡
′
Training Task
Transformer Block × L
Feed-Forward
Layer
Attention & Norm
Pretrained
Weights
LoRA - A
LoRA - B
Transformer Block × L
Expert-Specialized Fine-Tuning (ESFT)
Training Tas...
可训练模块 冻结模块 训练任务 Transformer Block × L 前馈层 注意力与归一化 低秩自适应 (LoRA) 全参数微调 (FFT) 输入 𝐮𝑡 输出 𝐡𝑡′ 训练任务 Transformer Block × L 前馈层 注意力与归一化 预训练权重 LoRA - A LoRA - B Transformer Block × L 专家专用微调 (ESFT) 训练任务...
Introduction
与粗粒度架构相比,每个专家被分割为 m 个,且 N 和 K 也相应地扩大为原来的 m 倍。
[原文]Each expert is segmented into
m ones, with N and K also multiplied by m times
compared to the coarse-grained architecture.
3.2
Probing Task-Specific Expert
Specialization in MoE Models
Despite the significant success of MoE LLMs, a
clear understanding of the underlying mechanism
remains elusive.We conduct probing experiments
to understand how non-shared experts are utilized
across various tasks. These tasks, as detailed in
§4.1, include general domains like math and code,
as well as specialized domains like intent recog-
nition, summarization, legal judgment prediction,
and translation. These ...
[原文]The values are averaged by layer,
indicating that the sets of experts used for the same task
are consistent while different tasks are distinct.
tional efficiency and maintain expert specialization.Figure 1 illustrates the differences between our
method and existing methods. Below, we intro-
duce our method step by step. Data Sampling
We randomly sample a subset
Ds = {(xi, yi)}Ns
i=1 from the training data D =
{(xi, yi)}N
i=1 for expert selection, where xi and yi
denote the input and label, respectively. Empir-
ically, we find that a subset of 32 concatenated
samples, each with a fixed length o...
专家相关性评分 我们提出了两种基于专家与样本token亲和度来计算专家与任务相关性的方法,分别定义为平均门控评分和token选择比率。这两种方法均可用于评估各专家与下游任务的相关性,可根据具体任务的实验表现进行选择。 专家选择与微调 对于每个MoE层 l,我们根据其相关性评分选择一部分专家进行微调。我们定义一个阈值 p \in (0, 1] 作为超参数,用于控制所选子集中包含的总相关性评分的比例。对于每一层 l,我们选择一组得分最高的专家 E_s^l,使其累积相关性评分超过阈值 p,满足: \sum_{i \in E_s^l} R_i^l \ge p, \quad (8) 其中,R_i^l 是第 l 层中专家 i 的相关性评分(可为 r_i^l 或 g_i^l)。在训练和推理阶段,token可被分配给任意专家。
[原文]However,
only the selected experts El
s in each layer can be
updated; other experts and modules remain frozen.
4
Experiment Setup
4.1
Main Evaluation
We evaluate our ESFT method on two common
LLM customization scenarios: (1) improving the
model’s specific ability in a domain where the
model may already have decent performance; (2)
adapting the model to a possibly narrow but un-
familiar specialized task.
4.1.1
Tasks for Model Enhancement
We choose two domain-specific tasks, i.e., Math
and Code, to evaluate how our method can enhance
the model’s existing abilities.The two domains are
widely co...
我们选择了两个领域特定任务,即数学(Math)和代码(Code),以评估我们的方法如何增强模型的现有能力。这两个领域是当前LLM研究广泛关注的方向,且非常适合进行评估,因为许多预训练模型在此已能取得不错的表现,同时通过进一步训练仍有显著的改进空间。我们通过性能提升幅度来评估该方法的有效性。在数学领域,我们使用 MetaMathQA (Yu et al., 2023) 进行训练,并使用 GSM8K (Cobbe et al., 2021) 和 MATH (Hendrycks et al., 2021a) 进行评估。在代码领域,我们在庞大的 evol-codealpaca 数据集 (Luo et al., 2023) 的 Python 子集上训练模型,以模拟更为聚焦的LLM定制场景,并在 HumanEval (Chen et al., 2021) 和 MBPP (Austin et al., 2021) 上评估其性能。 我们选取了四个专业任务,以评估我们的方法如何促进语言模型适应不熟悉的下游任务。这些任务涵盖了多种能力,大多数模型在训练后能表现出色,但未经训练则难以胜任:(1)BDCI-21 智能人机交互 NLU 挑战赛¹中的文本到JSON意图识别任务,要求将文本指令转换为家用电器的JSON格式;(2)BDCI-21 摘要挑战赛²中的文本摘要任务,用于总结客服通话记录;(3)BDCI-21 法律事件预测挑战赛³中的法律判决预测任务,将“案情描述”和“判决结果”转化为法律判决预测任务;(4)ChrEn 数据集 (Zhang et al., 2020) 中的低资源翻译任务,将少数民族语言切罗基语(Cherokee)翻译为英语。任务示例见附录A。为衡量模型性能,对于文本到JSON任务,我们计算模型输出与参考答案之间的精确匹配率;对于其他任务,我们结合参考答案⁴,使用 GPT-4 对模型输出进行0到10分的评分。
[原文]All evaluations
use few-shot examples.
4.2
General Ability Evaluation
We select a broad range of benchmarks to evaluate
the extent to which the models’ general abilities are
preserved after training on new tasks.These bench-
marks include MMLU (Hendrycks et al., 2021b),
1https://www.datafountain.cn/competitions/511
2https://www.datafountain.cn/competitions/536
3https://www.datafountain.cn/competitions/540
4The exact version we use is gpt-4-1106-preview. The
evaluation instructions are in Appendix G. TriviaQA (Joshi et al., 2017), HellaSwag (Zellers
et al., 2019), ARC-Challenge (Clark et al., 2...
[原文]The learning rates are set to 3e-5, 1e-4,
and 1e-5 for FFT, LoRA, and ESFT, respectively,
based on a hyperparameter search in {1e-5, 3e-
5, 1e-4, 3e-4}.The LoRA rank is set to 8 and
scaling is set to 2, following Hu et al. (2021). The
threshold p is set to 0.1 for ESFT-Gate and 0.2
for ESFT-Token, respectively. §6.2 shows how we
determine the threshold for ESFT.
5https://doc.hfai.high-flyer.cn/index.html
Math Ability
Code Ability
Specialized Tasks
MATH
GSM8K
Humaneval
MBPP
Intent
Summary
Law
Translation
Average
Vanilla Model
19.6
55.9
42.1
44.6
16.8
58.6
17.1
14.5
33.6
FFT
23.4
66.4
42.1
42.2...
5https://doc.hfai.high-flyer.cn/index.html 数学能力 代码能力 专项任务 MATH GSM8K Humaneval MBPP Intent Summary Law Translation 平均 原始模型
[原文]As shown in Table 1, ESFT-Token and ESFT-
Gate achieve near-best results in model enhance-
ment tasks like Math, and ESFT-Gate achieves the
best performance in the Humaneval task.ESFT
also excels in model adaptation tasks, with ESFT-
Gate achieving near-best performance in 3 tasks
out of 4. Notably, ESFT-Gate’s average of 50.2
is competitive compared to FFT’s 51.0, slightly
better than ESFT-Token’s 49.4, and significantly
surpasses LoRA’s 44.9. This demonstrates that
finding task-relevant experts can efficiently adapt
the model for efficient customization. For general ability evaluation, ESFT ...
[原文]ESFT performs efficiently in terms of training
time and storage space.In summary, ESFT demonstrates excellent per-
formance in training time and storage space, signif-
icantly outperforming FFT. Furthermore, as shown
in Table 3, ESFT requires much fewer trainable
parameters compared to FFT, resulting in lower
GPU memory usage. These advantages show that
ESFT is efficient and effective for language model
customization and adaptation.
6
Analysis
In this section, we investigate the expert selection
process of ESFT in §6.1, and demonstrate the per-
formance of ESFT and LoRA under different com-
pu...
[原文]We set rank ⩽
Non-shared
Experts
Shared
Experts
Non-expert
Parameters
Trainable
Parameters
Specialized
Ability
General
Ability
Average
ALL
✓
✓
15.7B
51.0
58.8
54.9
Relevant
✓
×
1.85B
49.8
60.7
55.3
Relevant
×
×
1.4B
49.4
61.5
55.4
×
✓
×
450M
47.4
61.2
54.3
×
✓
✓
1.3B
49.0
60.0
54.5
Relevant
✓
✓
2.7B
50.8
60.3
55.6
×
×
×
-
33.8
62.4
48.1
Table 3: Comparisons of different model configs based on whether training shared or non-shared parameters.Results include trainable parameters and performance of specialized and general abilities. The best or near-best
results excluding the non-training settin...
[原文]The results are shown in Table 3.We report
average trainable parameters across all tasks, per-
formance of specialized and general abilities, and
their average. Detailed numbers for all benchmarks
are shown in Appendix D. From the results, we can
draw several conclusions:
Specialized performance increases as train-
able parameters increase. The rank of trainable
parameters from 450M to 15.7B highly aligns with
the rank of specialized ability from 47.4 to 51.0. This suggests that increasing trainable parameters
is effective in enhancing specialized performance. General performance decreases as ...
1. Prioritize specialized ability:
训练所有共享参数及与任务相关的非共享专家,以最大化提升专项性能。
[原文]Train all
shared parameters and task-relevant non-
shared experts to maximize the enhancement
of specialized performance.
[原文]and computational efficiency: Train only
task-relevant non-shared experts to minimize
parameter costs while maximizing the main-
tenance of general ability.
6.4
Analysis of Key Modules in ESFT
In this section, we analyze and demonstrate that the
effectiveness of our method lies in two modules:
(1) our proposed expert relevance score functions
and (2) the fine-grained expert segmentation of the
MoE model architecture.
Expert Relevance Score Function
In this work,
we propose Average Gate Score and Token Se-
lection Ratio as expert relevance score functions
to filter relevant experts for differen...
[原文]grained segmentation. Experts in the same group
share the average affinity score. We maintain the
computational cost by selecting a constant 1/8 of
experts for each token.
Experiment results of
the Math domain in Figure 7 show that as the
group size increases, our method’s performance de-
creases more severely than FFT, while the training
cost (i.e., trainable experts) rises. These findings
indicate that our method, and even effective LLM
customization, highly rely on a fine-grained seg-
mented LLM architecture with more specialized
experts.
7
Conclusion
结论 在本研究中,我们探讨了针对采用混合专家(Mixture of Experts, MoE)架构的稀疏大语言模型的参数高效微调方法。我们首先观察到,来自不同领域的任务由不同的专家组合进行处理。随后,我们提出利用两项指标——平均门控得分与词元选择比例——为下游任务筛选最相关的专家。实验结果表明,我们的方法在显著降低训练成本的同时,其性能可媲美甚至超越全参数微调的结果。进一步的分析证实,该方法有效增强了MoE架构中专家系统的专业化程度。 局限性 首先,受限于其他细粒度MoE模型的可获取性,我们的方法仅在DeepSeek-V2-Lite MoE模型上进行了测试。基于该模型得出的结论在推广至其他场景时仍需进一步验证。此外,由于缺乏在参数与结构上对齐且具备不同专家粒度的MoE模型,我们采用了一种模拟方法,通过将多个专家组绑定聚合,以对比粗粒度与细粒度MoE方法的性能。
[原文]In this work, we study parameter-efficient fine-
tuning methods for sparse large language models
with the Mixture of Experts (MoE) architecture.
We first observe that tasks from different domains
are handled by distinct combinations of experts.
We then propose selecting the most relevant experts
for downstream tasks using two metrics: average
gate score and token selection ratio. Experimental
results show that our method significantly reduces
training costs while matching or surpassing full
parameter fine-tuning results. Further analysis con-
firms that our method enhances the specialization
...
References
Alan Ansell, Edoardo Maria Ponti, Anna Korhonen, 和 Ivan Vulić. 2021. 用于跨语言迁移的可组合稀疏微调. arXiv preprint arXiv:2110.07560. Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Trevor Cai, Anselm Levskaya, Charles Sutton, et al. 2021. 基于大语言模型的程序合成. arXiv preprint arXiv:2108.07732. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Maarten Dehghani, Pieter Abbeel, Deepak Pathak, Brandon Sanders, Vishal Katarkar, Zareen Xu, et al.
[原文]Alan Ansell, Edoardo Maria Ponti, Anna Korhonen,
and Ivan Vuli´c. 2021.
Composable sparse fine-
tuning for cross-lingual transfer.
arXiv preprint
arXiv:2110.07560.
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten
Bosma, Henryk Michalewski, David Dohan, Ellen
Jiang, Trevor Cai, Anselm Levskaya, Charles Sutton,
et al. 2021. Program synthesis with large language
models. arXiv preprint arXiv:2108.07732.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan,
Maarten Dehghani, Pieter Abbeel, Deepak Pathak,
Brandon Sanders, Vishal Katarkar, Zareen Xu, et al.
2021. Evaluating large language models trained on
在代码上训练的大型语言模型评估. 发表于 NeurIPS. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, 和 Oyvind Tafjord. 2018. 以为你解决了问答问题?试试 ARC,AI2 推理挑战. CoRR, abs/1803.05457. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, 和 John Schulman. 2021. GSM8K:一个用于小学数学问题求解的数据集. 发表于 NeurIPS. Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chon...
[原文]code. In NeurIPS. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,
Ashish Sabharwal, Carissa Schoenick, and Oyvind
Tafjord. 2018. Think you have solved question an-
swering? try arc, the AI2 reasoning challenge. CoRR,
abs/1803.05457. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,
Jacob Hilton, Reiichiro Nakano, Christopher Hesse,
and John Schulman. 2021. Gsm8k: A dataset for
grade school math problem solving. In NeurIPS. Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding
Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li,
Panpan Huang, Fuli Luo, Chon...
2021. Evaluating large language models trained on
Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, 和 Graham Neubig. 2021. 迈向参数高效迁移学习的统一视角. arXiv preprint arXiv:2110.04366. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, 和 Jacob Steinhardt. 2021a. 使用 MATH 数据集评估数学问题解决能力. arXiv preprint arXiv:2103.03874. Dan Hendrycks, Collin Burns, Steven Basart, et al. 2021b. 大规模多任务语言理解评估. 收录于 International Conference on Learning Representations (ICLR). Neil Houlsby, Andrei Giurgiu, S...
[原文]Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-
Kirkpatrick, and Graham Neubig. 2021.Towards a
unified view of parameter-efficient transfer learning.
arXiv preprint arXiv:2110.04366. Dan Hendrycks, Collin Burns, Steven Basart, Andy
Zou, Mantas Mazeika, Dawn Song, and Jacob Stein-
hardt. 2021a. Measuring mathematical problem
solving with the math dataset.
arXiv preprint
arXiv:2103.03874. Dan Hendrycks, Collin Burns, Steven Basart, et al.
2021b. Measuring massive multitask language under-
standing. In International Conference on Learning
Representations (ICLR). Neil Houlsby, Andrei Giurgiu, S...
[原文]Qidong Liu, Xian Wu, Xiangyu Zhao, Yuanshao Zhu,
Derong Xu, Feng Tian, and Yefeng Zheng. 2023.Moelora: An moe-based parameter efficient fine-
tuning method for multi-task medical applications.
arXiv preprint arXiv:2310.18339. Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Lam Tam,
Zhengxiao Du, Zhilin Yang, and Jie Tang. 2021. P-
tuning v2: Prompt tuning can be comparable to fine-
tuning universally across scales and tasks. arXiv
preprint arXiv:2110.07602. Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xi-
ubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma,
Qingwei Lin, and Daxin Jiang. 2023. Wizardcoder:
Empo...
2024. Jetmoe: Reaching llama2 performance with
0.1m 美元. CoRR, abs/2404.07413.
Yi-Lin Sung, Varun Nair, 和 Colin A Raffel. 2021. 使用固定稀疏掩码训练神经网络. Advances in Neural Information Processing Systems, 34:24193–24205. Danilo Vucetic, Mohammadreza Tayaranian, Maryam Ziaeefard, James J Clark, Brett H Meyer, 和 Warren J Gross. 2022. 边缘设备上 BERT 模型的高效微调. 收录于 2022 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1838–
[原文]0.1m dollars. CoRR, abs/2404.07413.
Yi-Lin Sung, Varun Nair, and Colin A Raffel. 2021.
Training neural networks with fixed sparse masks.
Advances in Neural Information Processing Systems,
34:24193–24205.
Danilo Vucetic, Mohammadreza Tayaranian, Maryam
Ziaeefard, James J Clark, Brett H Meyer, and War-
ren J Gross. 2022. Efficient fine-tuning of bert mod-
els on the edge. In 2022 IEEE International Sympo-
sium on Circuits and Systems (ISCAS), pages 1838–
[原文]Yaqing Wang, Subhabrata Mukherjee, Xiaodong Liu,
Jing Gao, Ahmed Hassan Awadallah, and Jian-
feng Gao. 2022. Adamix: Mixture-of-adapter for
parameter-efficient tuning of large language models.
arXiv preprint arXiv:2205.12410, 1(2):4.
XAI. 2024. Grok open release.
Liang Xu, Hai Hu, Xuanwei Zhang, et al. 2020. Clue:
A chinese language understanding evaluation bench-
mark. arXiv preprint arXiv:2004.05986.
Runxin Xu, Fuli Luo, Zhiyuan Zhang, Chuanqi Tan,
Baobao Chang, Songfang Huang, and Fei Huang.
2021. Raise a child in large language model: To-
迈向有效且可泛化的微调. arXiv preprint arXiv:2109.05687. Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, 和 Weiyang Liu. 2023. MetaMath:为大语言模型自举生成数学问题. arXiv preprint arXiv:2309.12284.
[原文]wards effective and generalizable fine-tuning. arXiv
preprint arXiv:2109.05687.
Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu,
Zhengying Liu, Yu Zhang, James T Kwok, Zhen-
guo Li, Adrian Weller, and Weiyang Liu. 2023.
Metamath: Bootstrap your own mathematical ques-
tions for large language models.
arXiv preprint
arXiv:2309.12284.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
Farhadi, and Yejin Choi. 2019. HellaSwag: Can a
machine really finish your sentence? In Proceedings
of the 57th Conference of the Association for Com-
putational Linguistics, ACL 2019, Florence, Italy,
July 28- August 2...
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, 和 Yejin Choi. 2019. HellaSwag:机器真的能补全你的句子吗?收录于 Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2...
Appendix
附录 A 专用任务示例 B 专家分组策略 为了将专家进行分组并模拟粗粒度的混合专家(Mixture-of-Experts)Transformer模型,我们计算专家相似度,并采用贪心搜索算法通过最大化组内相似度对专家进行分组。我们从对齐数据集中采样数据(包含32个样本,每个样本的序列长度为4096),以计算专家之间的相似度。我们将所有专家对的共现矩阵初始化为零矩阵。对于在某个token的Top-6专家选择中同时出现的每一对专家,我们在矩阵中将它们的得分加1。遍历完数据集后,我们利用矩阵中第i行与第j行向量之间的余弦相似度,计算专家i与专家j之间的相似度。为了通过贪心搜索获得专家分组策略,我们从每层66个专家中的64个非共享专家中,计算所有可能的K专家组(其中K为组大小,取2或4)的平均组内相似度(即组内所有专家两两之间的平均相似度)。随后,我们选择得分最高的K专家组。对于剩余未选中的专家,我们重复此过程,直到所有专家均被选中并完成分组。
[原文]A
Examples for Specialized Tasks
Table 5 presents task examples as prompts and cor-
responding reference responses for each special-
ized task, including intent recognition, text sum-
marization, legal judgment prediction, and low-
resource translation. B
Strategy for Grouping Experts
To group experts together and simulate coarse-
grained mixture-of-experts transformer models, we
calculate expert similarity and group the experts by
maximizing in-group similarities using a greedy
search algorithm. We sample data from the alignment dataset, con-
taining 32 samples each with a sequence length of
...
C 专家亲和力样本量分析 为了评估识别任务最相关专家所需的数据量,我们针对六个任务分别从训练集中独立采样两组数据,并计算这两组数据之间共享的Top-6专家。结果如图8所示。当样本量达到2^{17}(即32个序列长度为4096的样本)时,所有任务在两组样本之间均表现出较高数量的共享专家。这表明该样本量已足够大,能够用于为任务筛选最相关的专家。
Appendix
附录 D 训练共享参数的消融实验详细结果 我们提供了两个表格,总结了在不同配置下训练共享或非共享参数时各种方法的性能。表6展示了通用任务的结果,表7则专注于专用任务。结果表明,仅训练与任务相关的非共享专家能持续保持最佳的通用任务性能。此外,训练与任务相关的非共享专家及所有共享参数,能在专用任务上取得仅次于全参数微调的最佳性能。 E 专家选择的定性示例
[原文]D
Detailed Results for Ablations on
Training Shared Parameters
We present two tables that summarize the perfor-
mance of various methods with different configura-
tions for training shared or non-shared parameters.Table 6 shows results on general tasks, and Table 7
focuses on specialized tasks. The results indicate
that training only task-relevant non-shared experts
consistently maintains the best general task perfor-
mance. Additionally, training task-relevant non-
shared experts and all shared parameters yields the
best specialized task performance, short of full-
parameter fine-tuning. E
Qu...
[原文]Best or near-best results are shown in bold.
est tokens are key words like “const”, or important
commentary words like “Fetch the list of IDs”.F
The Impact of Mixing Alignment Data
for Training
We adopt a 1:1 ratio for downstream task data and
alignment data for all methods during training to
better maintain general task performance. This
manual ratio is kept constant to avoid the signif-
icant additional costs associated with fine-tuning
the ratio for each task. In this section, we present performance compar-
isons across various methods and tasks to reveal the
impact of mixing alignment data...
附录 内容冗余度:预测答案是否简洁,且与标准答案的风格保持一致?请严格按照以下格式回复:“内容准确性 x 分,细节/完整性程度 x 分,……,总分:x 分”。总分为各项得分的平均值。请勿给出评分理由。预测答案:{prediction} 参考答案:{ground_truth} (a) 意图识别 (b) 低资源翻译 (c) 文本摘要 (d) 法律判决预测 (e) 数学领域 (f) 代码领域
[原文]Content redundancy: Is the
predicted answer concise and consistent with the style of the standard answer?Respond following the
format: "Content accuracy x points, level of detail/completeness x points, ..., total score: x points". The total score is the average of all the scores. Do not give reasons for your scores. Predicted answer:
{prediction} Reference answer: {ground_truth}
Table 11: Task instructions for model performance evaluation. The placeholder {prediction} and {ground_truth}
represent model prediction and reference answer, respectively.
(a) Intent recognition
(b) Low-resource tran...