Mixtral:混合专家模型5%
← 首页 | 厂商论文 | 详细解读
Mistral
Mixtral of Experts
Mixtral:混合专家模型
Mistral AI Team
📅 2024-01-03 | 📄 arXiv: 2401.04088
翻译完成度 1 / 18 段 (5%)
摘要 / Abstract
Mixtral 8x7B是一个稀疏的混合专家(MoE)模型,总参数达47B,但每次推理仅激活约13B参数。该模型在性能上超越Llama 2 70B,同时在推理成本和延迟方面与7B模型相当。Mixtral支持8K上下文窗口,采用滑动窗口注意力,并可通过LoRA微调。模型采用Apache 2.0许可发布。
Paper Content
📝 暂未翻译 — Mixtral of Experts Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lac
Paper Content
📝 暂未翻译 — ow batch-sizes, and higher throughput at large batch-sizes. Mixtral is a sparse mixture-of-experts network. It is a decoder-only model where the feedforward block picks from a set of 8 distinct groups of parameters. At every layer, for every token, a router network chooses two of these groups (the “
Paper Content
📝 暂未翻译 — for diverse applications. To enable the community to run Mixtral with a fully open-source stack, we submitted changes to the vLLM project, which integrates Megablocks CUDA kernels for efficient inference. Skypilot also allows the deployment of vLLM endpoints on any instance in the cloud. 2 A
Paper Content
📝 暂未翻译 — logits of a linear layer [28]. We use G(x) := Softmax(TopK(x · Wg )), where (TopK(ℓ))i := ℓi if ℓi is among the top-K coordinates of logits ℓ ∈ Rn and (TopK(ℓ))i := −∞ otherwise. The value of K – the number of experts used per token – is a hyper-parameter that modu- lates the amount of compute used
Paper Content
📝 暂未翻译 — t K = 2. This means each token is routed to two SwiGLU sub-blocks with different sets of weights. Taking this all together, the output y for an input token x is computed as: n−1 X y= Softmax(Top2(x · Wg ))i · SwiGLUi (x). i=0 This formulation is similar to the GShard architecture [21], with
Paper Content
📝 暂未翻译 — % 25.0% 40.9% 8.4% 44.1% LLaMA 2 70B 70B 69.9% 85.4% 80.4% 82.6% 79.9% 56.5% 25.4% 73.0% 29.3% 49.8% 13.8% 69.6% Mistral 7B 7B 62.5% 81.0% 74.2% 82.2% 80.5% 54.9% 23.2% 62.5% 26.2% 50.2% 12.7% 50.0% Mixtral 8x7B 13B 70.6% 8
Paper Content
📝 暂未翻译 — note that the SMoEs layer introduces additional overhead due to the routing mechanism and due to the increased memory loads when running more than one expert per device. They are more suitable for batched workloads where one can reach a good degree of arithmetic intensity. Comparison with Llama 2 70
Paper Content
📝 暂未翻译 — h accuracy in English. In particular, Mixtral significantly outperforms Llama 2 70B in French, German, Spanish, and Italian, as shown in Table 4. Active French German Spanish Italian Model Params Arc-c HellaS MMLU Arc-c
Paper Content
📝 暂未翻译 — 51.5% 56.0% Bias Benchmark for QA (BBQ) [24] and BOLD sentiment score (avg ± std) Bias in Open-Ended Language Generation gender 0.293 ± 0.073 0.323 ±0.045 Dataset (BOLD) [10]. BBQ is a dataset profession 0.218 ± 0.073 0.243 ± 0.087 of ha
Paper Content
📝 暂未翻译 — Gemini Pro, Claude-2.1, and Llama 2 70B chat. Figure 6: LMSys Leaderboard. (Screenshot from Dec 22, 2023) Mixtral 8x7B Instruct v0.1 achieves an Arena Elo rating of 1121 outperforming Claude-2.1 (1117), all versions of GPT-3.5-Turbo (1117 best), Gemini Pro (1111), and Llama-2-70b-chat (1077). Mixtra
Paper Content
📝 暂未翻译 — hrough the same expert even though they involve multiple tokens. Similarly, in code, the indentation tokens are always assigned to the same experts, particularly at the first and last layers where the hidden states are more correlated to the input and output of the model. We also note from Figure 8
Paper Content
📝 暂未翻译 — 61.9% 51.3% PubMed Abstracts 14.2% 24.6% 22.0% 48.6% 61.6% 51.8% StackExchange 13.6% 27.2% 23.6% 48.2% 64.6% 53.6% Wikipedia (en) 14.4% 23.6% 25.3%
Paper Content
📝 暂未翻译 — are making our trained and fine-tuned mod- els publicly available under the Apache 2.0 license. By sharing our models, we aim to facilitate the de- velopment of new techniques and applications that can benefit a wide range of industries and domains. Figure 8: Text samples where each token is colored
Paper Content
📝 暂未翻译 — n, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. Unified scaling laws for routed language models. In International Conference on Machine Learning, pages 4057–4086. PMLR, 2022. [7] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Tou
Paper Content
📝 暂未翻译 — a Chen, Rahul Mazumder, Lichan Hong, and Ed Chi. Dselect-k: Differentiable selection in the mixture of experts with applications to multi-task learning. Advances in Neural Information Processing Systems, 34:29335–29347, 2021. 9 [16] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeik
Paper Content
📝 暂未翻译 — v preprint arXiv:2305.16300, 2023. [24] Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thomp- son, Phu Mon Htut, and Samuel R Bowman. Bbq: A hand-built bias benchmark for question answering. arXiv preprint arXiv:2110.08193, 2021. [25] Rafael Rafailov, Archit Shar
Paper Content
📝 暂未翻译 — uan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023. 10 [34] Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Na
Paper Content
Let's refine the translation to match academic style precisely: ayer 图10:每个MoE层的连续重复分配。与均匀分配(由虚线表示)相比,重复分配的发生频率要高得多。各数据集的模式相似,但DM Mathematics数据集的重复次数较少。 13
📄 点击展开原文