Mistral

Mistral 7B

Mistral 7B：高效开源大语言模型

📄 arXiv: 2310.06825📅 2023-10-10PDF

⚠️ 当前为英文原文展示，中文翻译正在进行中。共计 13 个段落。

Paper Content

Mistral 7B Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed arXiv:2310.06825v1 [cs.CL] 10 Oct 2023 Abstract We introduce Mistral 7B, a 7–billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms the best open 13B model (Llama 2) across all evaluated benchmarks, and the best released 34B model (Llama 1) in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectivel

Paper Content

Mistral 7B approaches the coding performance of Code-Llama 7B [20], without sacrificing performance on non-code related benchmarks. Mistral 7B leverages grouped-query attention (GQA) [1], and sliding window attention (SWA) [6, 3]. GQA significantly accelerates the inference speed, and also reduces the memory requirement during decoding, allowing for higher batch sizes hence higher throughput, a crucial factor for real-time applications. In addition, SWA is designed to handle longer sequences more effectively at a reduced computational cost, thereby alleviating a common limitation in LLMs. These attention mechanisms collectively contribute to the enhanced performance and efficiency of Mistral 7B. Mistral 7B is released under the Apache 2.0 license. This release is accompanied by a reference

Paper Content

an attend to at most W tokens from the previous layer (here, W = 3). Note that tokens outside the sliding window still influence next word prediction. At each attention layer, information can move forward by W tokens. Hence, after k attention layers, information can move forward by up to k × W tokens. Mistral 7B is based on a transformer architecture [27]. The main Parameter Value parameters of the architecture are summarized in Table 1. Compared to Llama, it introduces a few changes that we summarize below. dim 4096 n_layers 32 Sliding Window Attention. SWA exploits the stacked layers of a trans- head_dim 128 former to attend information beyond the window size W . The hidden hidden_dim

Paper Content

length of 32k tokens, this reduces the cache memory usage by 8x, without impacting the model quality. 1 https://github.com/mistralai/mistral-src 2 https://github.com/skypilot-org/skypilot 3 https://huggingface.co/mistralai 2 Figure 2: Rolling buffer cache. The cache has a fixed size of W = 4. Keys and values for position i are stored in position i mod W of the cache. When the position i is larger than W , past values in the cache are overwritten. The hidden state corresponding to the latest generated tokens are colored in orange. Pre-fill and Chunking. When generating a sequence, we need to predict tokens one-by-one, as each token is conditioned on the previous ones. However, the prompt is known in advance, and we can pre-fill the (k, v) cache with the prompt. If the prompt is very large,

Paper Content

ndow (left block). 3 Results We compare Mistral 7B to Llama, and re-run all benchmarks with our own evaluation pipeline for fair comparison. We measure performance on a wide variety of tasks categorized as follow: • Commonsense Reasoning (0-shot): Hellaswag [28], Winogrande [21], PIQA [4], SIQA [22], OpenbookQA [19], ARC-Easy, ARC-Challenge [9], CommonsenseQA [24] • World Knowledge (5-shot): NaturalQuestions [16], TriviaQA [15] • Reading Comprehension (0-shot): BoolQ [8], QuAC [7] • Math: GSM8K [10] (8-shot) with maj@8 and MATH [13] (4-shot) with maj@4 • Code: Humaneval [5] (0-shot) and MBPP [2] (3-shot) • Popular aggregated results: MMLU [12] (5-shot), BBH [23] (3-shot), and AGI Eval [29] (3-5-shot, English multiple-choice questions only) Detailed results for Mistral 7B, Llama 2 7B/

Paper Content

.4% 34.5% 11.0% 34.9% 31.1% 52.5% 5.2% 20.8% Mistral 7B Pretrained 60.1% 81.3% 75.3% 83.0% 80.0% 55.5% 28.8% 69.9% 30.5% 47.5% 13.1% 52.2% Table 2: Comparison of Mistral 7B with Llama. Mistral 7B outperforms Llama 2 13B on all metrics, and approaches the code performance of Code-Llama 7B without sacrificing performance on non-code benchmarks. Size and Efficiency. We computed “equivalent model sizes” of the Llama 2 family, aiming to understand Mistral 7B models’ efficiency in the cost-performance spectrum (see Figure 5). When evaluated on reasoning, comprehension, and STEM reasoning (specifically MMLU), Mistral 7B mirrored performance that one might expect from a Llama 2 model with more than 3x its size. On the Knowledge benchmarks, Mistral 7B’s performance achieves

Paper Content

997 6.17 Mistral 7B – Instruct, exhibits superior perfor- Alpaca 13B 914 4.53 mance compared to all 7B models on MT-Bench, and is comparable to 13B – Chat models. An Table 3: Comparison of Chat models. Mistral 7B – Instruct outperforms all 7B models on MT-Bench, and independent human evaluation was conducted on is comparable to 13B – Chat models. https://llmboxing.com/leaderboard. In this evaluation, participants were provided with a set of questions along with anonymous responses from two models and were asked to select their preferred response, as illustrated in Figure 6. As of October 6, 2023, the outputs generated by Mistral 7B were preferred 5020 times, compared to 4143 times for Llama 2 13B. 4 Figure 5: Results on MMLU,

Paper Content

oid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity. We use a set of 175 unsafe prompts for evaluating safety. Guardrails MT Bench With the recommended system prompt, the model properly declines to answer 100% of the harmful questions. No system prompt 6.84 ± 0.07 Llama 2 system prompt 6.38 ± 0.07 As an illustration, we provide in Table 5 the answers of Mistral system prompt 6.58 ± 0.05 both Mistral 7B – Instruct and Llama 2 Chat 13B to the question How to kill a linux process with system prompts Table 4: System prompts. Mean official activated. We observe that Mistral 7B provides a correct MT Bench score over 10 iterations with standard deviation

Paper Content

to terminate. Additionally, it’s generally a good idea to use the `kill`command with caution, as it can be used to terminate critical system processes that are necessary for the operation of the system. As a helpful and respectful assistant, I cannot provide information on how to kill a Linux process. It is not appropriate or ethical to intentionally harm or terminate processes, as it Llama 2 13B – Chat can cause damage to the system and potentially result in data loss. Additionally, it is with Llama 2 system prompt important to respect the autonomy and security of other users’ processes, as they may be critical to their work or personal activities. Instead, I suggest that you explore other options to address any issues or concerns you may have. Table 5: Comparison

Paper Content

lest possible model. Acknowledgements We are grateful to CoreWeave for their 24/7 help in marshalling our cluster. We thank the CINECA/EuroHPC team, and in particular the operators of Leonardo, for their resources and help. We thank the maintainers of FlashAttention, vLLM, xFormers, Skypilot for their precious assistance in implementing new features and integrating their solutions into ours. A huge thanks to Tri Dao and Daniel Haziza for helping include Mistral related changes to FlashAttention and xFormers on a tight schedule. We thank the teams of Hugging Face, AWS, GCP, Azure ML for their intense help in making our model compatible everywhere. 6 Figure 6: Human evaluation of Mistral 7B – Instruct vs Llama 2 13B – Chat Example. An example of human evaluation from llmboxing.com. The quest

Paper Content

, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. [6] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019. [7] Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. Quac: Question answering in context. arXiv preprint arXiv:1808.07036, 2018. [8] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019. [9] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you h

Paper Content

therine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karén Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Laurent Sifre. An empirical analysis of compute-optimal large language model training. In Advances in Neural Information Processing Systems, volume 35, 2022. [15] Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017. [16] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistic

Paper Content

Communications of the ACM, 64(9):99–106, 2021. [22] Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Com- monsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019. [23] Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022. [24] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A ques- tion answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018. [25] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée La

← 厂商论文列表详细解读