Paper Content
Mistral 7B
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford,
Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel,
Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux,
Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix,
William El Sayed
arXiv:2310.06825v1 [cs.CL] 10 Oct 2023
Abstract
We introduce Mistral 7B, a 7–billion-parameter language model engineered for
superior performance and efficiency. Mistral 7B outperforms the best open 13B
model (Llama 2) across all evaluated benchmarks, and the best released 34B
model (Llama 1) in reasoning, mathematics, and code generation. Our model
leverages grouped-query attention (GQA) for faster inference, coupled with sliding
window attention (SWA) to effectivel
Paper Content
Mistral 7B approaches the coding performance of Code-Llama 7B [20],
without sacrificing performance on non-code related benchmarks.
Mistral 7B leverages grouped-query attention (GQA) [1], and sliding window attention (SWA) [6, 3].
GQA significantly accelerates the inference speed, and also reduces the memory requirement during
decoding, allowing for higher batch sizes hence higher throughput, a crucial factor for real-time
applications. In addition, SWA is designed to handle longer sequences more effectively at a reduced
computational cost, thereby alleviating a common limitation in LLMs. These attention mechanisms
collectively contribute to the enhanced performance and efficiency of Mistral 7B.
Mistral 7B is released under the Apache 2.0 license. This release is accompanied by a reference
Paper Content
an attend to at most W tokens from the previous layer (here, W = 3). Note that tokens
outside the sliding window still influence next word prediction. At each attention layer, information can move
forward by W tokens. Hence, after k attention layers, information can move forward by up to k × W tokens.
Mistral 7B is based on a transformer architecture [27]. The main
Parameter Value
parameters of the architecture are summarized in Table 1. Compared
to Llama, it introduces a few changes that we summarize below. dim 4096
n_layers 32
Sliding Window Attention. SWA exploits the stacked layers of a trans- head_dim 128
former to attend information beyond the window size W . The hidden hidden_dim
Paper Content
length of 32k tokens, this reduces the cache memory usage
by 8x, without impacting the model quality.
1
https://github.com/mistralai/mistral-src
2
https://github.com/skypilot-org/skypilot
3
https://huggingface.co/mistralai
2
Figure 2: Rolling buffer cache. The cache has a fixed size of W = 4. Keys and values for position i are stored
in position i mod W of the cache. When the position i is larger than W , past values in the cache are overwritten.
The hidden state corresponding to the latest generated tokens are colored in orange.
Pre-fill and Chunking. When generating a sequence, we need to predict tokens one-by-one, as
each token is conditioned on the previous ones. However, the prompt is known in advance, and we
can pre-fill the (k, v) cache with the prompt. If the prompt is very large,
Paper Content
ndow (left block).
3 Results
We compare Mistral 7B to Llama, and re-run all benchmarks with our own evaluation pipeline for
fair comparison. We measure performance on a wide variety of tasks categorized as follow:
• Commonsense Reasoning (0-shot): Hellaswag [28], Winogrande [21], PIQA [4], SIQA [22],
OpenbookQA [19], ARC-Easy, ARC-Challenge [9], CommonsenseQA [24]
• World Knowledge (5-shot): NaturalQuestions [16], TriviaQA [15]
• Reading Comprehension (0-shot): BoolQ [8], QuAC [7]
• Math: GSM8K [10] (8-shot) with maj@8 and MATH [13] (4-shot) with maj@4
• Code: Humaneval [5] (0-shot) and MBPP [2] (3-shot)
• Popular aggregated results: MMLU [12] (5-shot), BBH [23] (3-shot), and AGI Eval [29]
(3-5-shot, English multiple-choice questions only)
Detailed results for Mistral 7B, Llama 2 7B/
Paper Content
.4% 34.5% 11.0% 34.9% 31.1% 52.5% 5.2% 20.8%
Mistral 7B Pretrained 60.1% 81.3% 75.3% 83.0% 80.0% 55.5% 28.8% 69.9% 30.5% 47.5% 13.1% 52.2%
Table 2: Comparison of Mistral 7B with Llama. Mistral 7B outperforms Llama 2 13B on all metrics, and
approaches the code performance of Code-Llama 7B without sacrificing performance on non-code benchmarks.
Size and Efficiency. We computed “equivalent model sizes” of the Llama 2 family, aiming to
understand Mistral 7B models’ efficiency in the cost-performance spectrum (see Figure 5). When
evaluated on reasoning, comprehension, and STEM reasoning (specifically MMLU), Mistral 7B
mirrored performance that one might expect from a Llama 2 model with more than 3x its size. On
the Knowledge benchmarks, Mistral 7B’s performance achieves
Paper Content
997 6.17
Mistral 7B – Instruct, exhibits superior perfor- Alpaca 13B 914 4.53
mance compared to all 7B models on MT-Bench,
and is comparable to 13B – Chat models. An Table 3: Comparison of Chat models. Mistral 7B –
Instruct outperforms all 7B models on MT-Bench, and
independent human evaluation was conducted on is comparable to 13B – Chat models.
https://llmboxing.com/leaderboard.
In this evaluation, participants were provided with a set of questions along with anonymous responses
from two models and were asked to select their preferred response, as illustrated in Figure 6. As of
October 6, 2023, the outputs generated by Mistral 7B were preferred 5020 times, compared to 4143
times for Llama 2 13B.
4
Figure 5: Results on MMLU,
Paper Content
oid harmful,
unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity.
We use a set of 175 unsafe prompts for evaluating safety.
Guardrails MT Bench
With the recommended system prompt, the model properly
declines to answer 100% of the harmful questions. No system prompt 6.84 ± 0.07
Llama 2 system prompt 6.38 ± 0.07
As an illustration, we provide in Table 5 the answers of Mistral system prompt 6.58 ± 0.05
both Mistral 7B – Instruct and Llama 2 Chat 13B to the
question How to kill a linux process with system prompts Table 4: System prompts. Mean official
activated. We observe that Mistral 7B provides a correct MT Bench score over 10 iterations with
standard deviation
Paper Content
to terminate. Additionally, it’s generally a good idea to use the
`kill`command with caution, as it can be used to terminate critical system processes that
are necessary for the operation of the system.
As a helpful and respectful assistant, I cannot provide information on how to kill a Linux
process. It is not appropriate or ethical to intentionally harm or terminate processes, as it
Llama 2 13B – Chat can cause damage to the system and potentially result in data loss. Additionally, it is
with Llama 2 system prompt important to respect the autonomy and security of other users’ processes, as they may be
critical to their work or personal activities. Instead, I suggest that you explore other
options to address any issues or concerns you may have.
Table 5: Comparison
Paper Content
lest possible model.
Acknowledgements
We are grateful to CoreWeave for their 24/7 help in marshalling our cluster. We thank the
CINECA/EuroHPC team, and in particular the operators of Leonardo, for their resources and help.
We thank the maintainers of FlashAttention, vLLM, xFormers, Skypilot for their precious assistance
in implementing new features and integrating their solutions into ours. A huge thanks to Tri Dao
and Daniel Haziza for helping include Mistral related changes to FlashAttention and xFormers on
a tight schedule. We thank the teams of Hugging Face, AWS, GCP, Azure ML for their intense help
in making our model compatible everywhere.
6
Figure 6: Human evaluation of Mistral 7B – Instruct vs Llama 2 13B – Chat Example. An example of
human evaluation from llmboxing.com. The quest
Paper Content
, Greg Brockman, et al. Evaluating large
language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
[6] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with
sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
[7] Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and
Luke Zettlemoyer. Quac: Question answering in context. arXiv preprint arXiv:1808.07036,
2018.
[8] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and
Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions.
arXiv preprint arXiv:1905.10044, 2019.
[9] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick,
and Oyvind Tafjord. Think you h
Paper Content
therine Millican, George van den Driessche, Bogdan Damoc, Aurelia
Guy, Simon Osindero, Karén Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Laurent
Sifre. An empirical analysis of compute-optimal large language model training. In Advances in
Neural Information Processing Systems, volume 35, 2022.
[15] Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large
scale distantly supervised challenge dataset for reading comprehension. arXiv preprint
arXiv:1705.03551, 2017.
[16] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris
Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a
benchmark for question answering research. Transactions of the Association for Computational
Linguistic
Paper Content
Communications of the ACM, 64(9):99–106,
2021.
[22] Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Com-
monsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.
[23] Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won
Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei.
Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint
arXiv:2210.09261, 2022.
[24] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A ques-
tion answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937,
2018.
[25] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo-
thée La