Paper Content
Mixtral of Experts
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch,
Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas,
Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour,
Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux,
Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao,
Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed
arXiv:2401.04088v1 [cs.LG] 8 Jan 2024
Abstract
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language
model. Mixtral has the same architecture as Mistral 7B, with the difference
that each layer is composed of 8 feedforward blocks (i.e. experts). For every
token, at each layer, a router network selects two experts t
Paper Content
ow
batch-sizes, and higher throughput at large batch-sizes.
Mixtral is a sparse mixture-of-experts network. It is a decoder-only model where the feedforward
block picks from a set of 8 distinct groups of parameters. At every layer, for every token, a router
network chooses two of these groups (the “experts”) to process the token and combine their output
additively. This technique increases the number of parameters of a model while controlling cost and
latency, as the model only uses a fraction of the total set of parameters per token.
Mixtral is pretrained with multilingual data using a context size of 32k tokens. It either matches
or exceeds the performance of Llama 2 70B and GPT-3.5, over several benchmarks. In particular,
Figure 1: Mixture of Experts Layer. Each input vector is assigned
Paper Content
for diverse applications.
To enable the community to run Mixtral with a fully open-source stack, we submitted changes to
the vLLM project, which integrates Megablocks CUDA kernels for efficient inference. Skypilot also
allows the deployment of vLLM endpoints on any instance in the cloud.
2 Architectural details
Mixtral is based on a transformer architecture [31] and uses the same
Parameter Value
modifications as described in [18], with the notable exceptions that Mix-
tral supports a fully dense context length of 32k tokens, and the feed- dim 4096
forward blocks are replaced by Mixture-of-Expert layers (Section 2.1). n_layers 32
The model architecture parameters are summarized in Table 1. he
Paper Content
logits of a linear layer [28]. We use
G(x) := Softmax(TopK(x · Wg )),
where (TopK(ℓ))i := ℓi if ℓi is among the top-K coordinates of logits ℓ ∈ Rn and (TopK(ℓ))i := −∞
otherwise. The value of K – the number of experts used per token – is a hyper-parameter that modu-
lates the amount of compute used to process each token. If one increases n while keeping K fixed, one
1
https://mistral.ai/news/mixtral-of-experts/
2
can increase the model’s parameter count while keeping its computational cost effectively constant.
This motivates a distinction between the model’s total parameter count (commonly referenced as the
sparse parameter count), which grows with n, and the number of parameters used for processing an
individual token (called the active parameter count), which grows with K up to n.
MoE l
Paper Content
t K = 2. This means each token is routed to two
SwiGLU sub-blocks with different sets of weights. Taking this all together, the output y for an input
token x is computed as:
n−1
X
y= Softmax(Top2(x · Wg ))i · SwiGLUi (x).
i=0
This formulation is similar to the GShard architecture [21], with the exceptions that we replace all
FFN sub-blocks by MoE layers while GShard replaces every other block, and that GShard uses a
more elaborate gating strategy for the second expert assigned to each token.
3 Results
We compare Mixtral to Llama, and re-run all benchmarks with our own evaluation pipeline for fair
comparison. We measure performance on a wide variety of tasks categorized as follow:
• Commonsense Reasoning (0-shot): Hellaswag [32], Winogrande [26], PIQA [3], SIQA [27],
OpenbookQA
Paper Content
% 25.0% 40.9% 8.4% 44.1%
LLaMA 2 70B 70B 69.9% 85.4% 80.4% 82.6% 79.9% 56.5% 25.4% 73.0% 29.3% 49.8% 13.8% 69.6%
Mistral 7B 7B 62.5% 81.0% 74.2% 82.2% 80.5% 54.9% 23.2% 62.5% 26.2% 50.2% 12.7% 50.0%
Mixtral 8x7B 13B 70.6% 84.4% 77.2% 83.6% 83.1% 59.7% 30.6% 71.5% 40.2% 60.7% 28.4% 74.4%
Table 2: Comparison of Mixtral with Llama. Mixtral outperforms or matches Llama 2 70B performance on
almost all popular benchmarks while using 5x fewer active parameters during inference.
Figure 3: Results on MMLU, commonsense reasoning, world knowledge and reading comprehension,
math and code for Mistral (7B/8x7B) vs Llama 2 (7B/13B/70B). Mixtral largely outperforms Llama 2 70B
on all benchmarks, except on read
Paper Content
note that the SMoEs layer
introduces additional overhead due to the routing mechanism and due to the increased memory loads
when running more than one expert per device. They are more suitable for batched workloads where
one can reach a good degree of arithmetic intensity.
Comparison with Llama 2 70B and GPT-3.5. In Table 3, we report the performance of Mixtral 8x7B
compared to Llama 2 70B and GPT-3.5. We observe that Mixtral performs similarly or above the
two other models. On MMLU, Mixtral obtains a better performance, despite its significantly smaller
capacity (47B tokens compared to 70B). For MT Bench, we report the performance of the latest
GPT-3.5-Turbo model available, gpt-3.5-turbo-1106.
2
Since Llama 2 34B was not open-sourced, we report results for Llama 1 34B.
4
LLaMA 2 70B
Paper Content
h accuracy in English. In particular, Mixtral significantly outperforms Llama 2 70B
in French, German, Spanish, and Italian, as shown in Table 4.
Active French German Spanish Italian
Model Params Arc-c HellaS MMLU Arc-c HellaS MMLU Arc-c HellaS MMLU Arc-c HellaS MMLU
LLaMA 1 33B 33B 39.3% 68.1% 49.9% 41.1% 63.3% 48.7% 45.7% 69.8% 52.3% 42.9% 65.4% 49.0%
LLaMA 2 70B 70B 49.9% 72.5% 64.3% 47.3% 68.7% 64.2% 50.5% 74.5% 66.0% 49.4% 70.9% 65.1%
Mixtral 8x7B 13B 58.2% 77.4% 70.9% 54.3% 73.0% 71.5% 55.4% 77.6% 72.5% 52.8% 75.1% 70.9%
Table 4: Comparison of Mixtral with Llama on Multilingual Benchmarks. On ARC Challenge, Hellaswag,
Paper Content
51.5% 56.0%
Bias Benchmark for QA (BBQ) [24] and BOLD sentiment score (avg ± std)
Bias in Open-Ended Language Generation
gender 0.293 ± 0.073 0.323 ±0.045
Dataset (BOLD) [10]. BBQ is a dataset
profession 0.218 ± 0.073 0.243 ± 0.087
of hand-written question sets that target
religious_ideology 0.188 ± 0.133 0.144 ± 0.089
attested social biases against nine differ-
political_ideology 0.149 ± 0.140 0.186 ± 0.146
ent socially-relevant categories: age, dis-
race 0.232 ± 0.049 0.232 ± 0.052
ability status, gender identity, nationality,
physical appearance, race/ethnicity, religion,
Figure 5: Bias Benchmarks. Compared Llama 2 70B,
socio-economic status, sexual orientation.
Mixtral presents less
Paper Content
Gemini Pro, Claude-2.1, and Llama 2 70B chat.
Figure 6: LMSys Leaderboard. (Screenshot from Dec 22, 2023) Mixtral 8x7B Instruct v0.1 achieves an Arena
Elo rating of 1121 outperforming Claude-2.1 (1117), all versions of GPT-3.5-Turbo (1117 best), Gemini Pro
(1111), and Llama-2-70b-chat (1077). Mixtral is currently the best open-weights model by a large margin.
3
https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
6
5 Routing analysis
In this section, we perform a small analysis on the expert selection by the router. In particular,
we are interested to see if during training some experts specialized to some specific domains (e.g.
mathematics, biology, philosophy, etc.).
To investigate this, we measure the distribution of selected experts on different subsets of Th
Paper Content
hrough the same expert
even though they involve multiple tokens. Similarly, in code, the indentation tokens are always
assigned to the same experts, particularly at the first and last layers where the hidden states are more
correlated to the input and output of the model.
We also note from Figure 8 that consecutive tokens are often assigned the same experts. In fact, we
observe some degree of positional locality in The Pile datasets. Table 5 shows the proportion of con-
secutive tokens that get the same expert assignments per domain and layer. The proportion of repeated
layer: 0
0.20
0.15
0.10
0.05
0
Selection proportion
layer: 15
0.20
0.15
0.10
0.05
0
layer: 31
0.20
0.15
0.10
0.05
0
0 1 2 3 4 5 6 7
Expert ID
ArXiv
Paper Content
61.9% 51.3%
PubMed Abstracts 14.2% 24.6% 22.0% 48.6% 61.6% 51.8%
StackExchange 13.6% 27.2% 23.6% 48.2% 64.6% 53.6%
Wikipedia (en) 14.4% 23.6% 25.3% 49.8% 62.1% 51.8%
Table 5: Percentage of expert assignment repetitions. We evaluate the proportion of times the same expert is
assigned to a token i and its following token i+1. We report whether the first chosen expert is the same, or whether
the same expert is observed as first or second choice in consecutive tokens. For reference, the expected proportion
of repetitions in the case of random assignments is 18 = 12.5% for “First choice” and 1 − 68 57 ≈ 46% for “First
and s
Paper Content
are making our trained and fine-tuned mod-
els publicly available under the Apache 2.0 license. By sharing our models, we aim to facilitate the de-
velopment of new techniques and applications that can benefit a wide range of industries and domains.
Figure 8: Text samples where each token is colored with the first expert choice. The selection of experts
appears to be more aligned with the syntax rather than the domain, especially at the initial and final layers.
8
Acknowledgements
We thank the CoreWeave and Scaleway teams for technical support as we trained our models. We
are grateful to NVIDIA for supporting us in integrating TensorRT-LLM and Triton and working
alongside us to make a sparse mixture of experts compatible with TensorRT-LLM.
References
[1] Jacob Austin, Augustus Odena, Maxwe
Paper Content
n, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. Unified
scaling laws for routed language models. In International Conference on Machine Learning,
pages 4057–4086. PMLR, 2022.
[7] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and
Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions.
arXiv preprint arXiv:1905.10044, 2019.
[8] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick,
and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning
challenge. arXiv preprint arXiv:1803.05457, 2018.
[9] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro
Paper Content
a Chen,
Rahul Mazumder, Lichan Hong, and Ed Chi. Dselect-k: Differentiable selection in the mixture
of experts with applications to multi-task learning. Advances in Neural Information Processing
Systems, 34:29335–29347, 2021.
9
[16] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and
Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint
arXiv:2009.03300, 2020.
[17] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn
Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.
arXiv preprint arXiv:2103.03874, 2021.
[18] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh
Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengye
Paper Content
v preprint arXiv:2305.16300, 2023.
[24] Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thomp-
son, Phu Mon Htut, and Samuel R Bowman. Bbq: A hand-built bias benchmark for question
answering. arXiv preprint arXiv:2110.08193, 2021.
[25] Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and
Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.
arXiv preprint arXiv:2305.18290, 2023.
[26] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An
adversarial winograd schema challenge at scale. Communications of the ACM, pages 99–106,
2021.
[27] Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Com-
monsense reasoning about soc
Paper Content
uan Zhuang, Zhanghao Wu, Yonghao Zhuang,
Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and
chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
10
[34] Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied,
Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation
models. arXiv preprint arXiv:2304.06364, 2023.
[35] Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai,
Quoc V Le, James Laudon, et al. Mixture-of-experts with expert choice routing. Advances in
Neural Information Processing Systems, 35:7103–7114, 2022.
11
Layer 0 -- Either choice
0.3
0.2
0.1
0
Layer 0 -- First choice
0.3
0.2
0.1
0
Layer 0 -- Second choice
0.3
0.2
0.1
0
Layer 15 -- Either choice
Paper Content
ayer
Figure 10: Repeated consecutive assignments per MoE layer. Repeated assignments occur a lot more
often than they would with uniform assignments (materialized by the dashed lines). Patterns are similar across
datasets with less repetitions for DM Mathematics.
13