Paper Content
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron∗, Thibaut Lavril∗, Gautier Izacard∗, Xavier Martinet
Marie-Anne Lachaux, Timothee Lacroix, Baptiste Rozière, Naman Goyal
Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin
Edouard Grave∗, Guillaume Lample∗
Meta AI
Abstract performance, a smaller one trained longer will
ultimately be cheaper at inference. For instance,
We introduce LLaMA, a collection of founda-
although Hoffmann et al. (2022) recommends
tion language models ranging from 7B to 65B
parameters. We train our models on trillions training a 10B model on 200B tokens, we find
arXiv:2302.13971v1 [cs.CL] 27 Feb 2023
of tokens, and show that it is possible to train that the performance of a 7B model continues to
sta
Paper Content
run on a single GPU.
few examples (Brown et al., 2020). These few-shot At the higher-end of the scale, our 65B-parameter
properties first appeared when scaling models to a model is also competitive with the best large lan-
sufficient size (Kaplan et al., 2020), resulting in a guage models such as Chinchilla or PaLM-540B.
line of work that focuses on further scaling these
models (Chowdhery et al., 2022; Rae et al., 2021). Unlike Chinchilla, PaLM, or GPT-3, we only
These efforts are based on the assumption that use publicly available data, making our work com-
more parameters will lead to better performance. patible with open-sourcing, while most existing
However, recent work from Hoffmann et al. (2022)
Paper Content
rt the performance of
fastest at inference, and although it may be cheaper our models and compare with others LLMs on a set
to train a large model to reach a certain level of of standard benchmarks. Finally, we expose some
of the biases and toxicity encoded in our models,
Equal contribution. Correspondence: {htouvron,
∗
thibautlav,gizacard,egrave,glample}@meta.com using some of the most recent benchmarks from
1
https://github.com/facebookresearch/llama the responsible AI community.
2 Approach Dataset Sampling prop. Epochs Disk size
Our training approach is similar to the methods CommonCrawl 67.0% 1.10 3.3 TB
described in previous work (Brown et al., 20
Paper Content
ng mixture of data and the per- languages, which use either the Latin or Cyrillic
centage they represent in the training set: scripts: bg, ca, cs, da, de, en, es, fr, hr, hu, it,
nl, pl, pt, ro, ru, sl, sr, sv, uk. We process the
English CommonCrawl [67%]. We preprocess
data to remove hyperlinks, comments and other
five CommonCrawl dumps, ranging from 2017
formatting boilerplate.
to 2020, with the CCNet pipeline (Wenzek et al.,
2020). This process deduplicates the data at the Gutenberg and Books3 [4.5%]. We include
line level, performs language identification with two book corpora in our training dataset: the Guten-
a fastText linear classifier to remove non-English berg Project, which contains books that are in the
pages and filters low quality content wi
Paper Content
heuris-
tics such as presence of punctuation marks or the Stack Exchange [2%]. We include a dump of
number of words and sentences in a webpage. Stack Exchange, a website of high quality ques-
Github [4.5%]. We use the public GitHub tions and answers that covers a diverse set of do-
dataset available on Google BigQuery. We only mains, ranging from computer science to chemistry.
kept projects that are distributed under the Apache, We kept the data from the 28 largest websites, re-
BSD and MIT licenses. Additionally, we filtered moved the HTML tags from text and sorted the
low quality files with heuristics based on the line answers by score (from highest to lowest).
length or proportion of alphanumeric characters,
Tokenizer. We tokeni
Paper Content
fter tokenization. For most of 2.1 LLaMA 13B
our training data, each token is used only once dur- 2.0 LLaMA 33B
Training loss
ing training, with the exception of the Wikipedia LLaMA 65B
and Books domains, over which we perform ap- 1.9
proximately two epochs. 1.8
1.7
2.2 Architecture
1.6
Following recent work on large language models,
1.5
our network is based on the transformer architec- 0 200 400 600 800 1000 1200 1400
ture (Vaswani et al., 2017). We leverage various Billion of tokens
Paper Content
he causal multi-head attention
improve the performance. We use a dimension of to reduce memory usage and runtime. This imple-
2
3 4d instead of 4d as in PaLM.
mentation, available in the xformers library,2 is
inspired by Rabe and Staats (2021) and uses the
Rotary Embeddings [GPTNeo]. We remove the backward from Dao et al. (2022). This is achieved
absolute positional embeddings, and instead, add by not storing the attention weights and not com-
rotary positional embeddings (RoPE), introduced puting the key/query scores that are masked due to
by Su et al. (2021), at each layer of the network. the causal nature of the language modeling task.
The details of the hyper-parameters for our dif- To further improve training efficiency, we re-
f
Paper Content
- -
Chinchilla 70B 83.7 81.8 51.3 80.8 74.9 - - -
PaLM 62B 84.8 80.5 - 79.7 77.0 75.2 52.5 50.4
PaLM-cont 62B 83.9 81.4 - 80.6 77.0 - - -
PaLM 540B 88.0 82.3 - 83.4 81.1 76.6 53.0 53.4
7B 76.5 79.8 48.9 76.1 70.1 72.8 47.6 57.2
13B 78.1 80.1 50.4 79.2 73.0 74.8 52.7 56.4
LLaMA
33B 83.1 82.3 50.4 82.8 76.0 80.0 57.8 58.6
65B 85.3 82.8 52.3 84.2 77.0 78.9 56.0 60.2
Table 3: Zero-shot performance on Common Sense Reasoni
Paper Content
3 Main results hood of the completion given “Answer:” as context:
Following previous work (Brown et al., 2020), we P (completion|context)/P (completion|“Answer:”).
consider zero-shot and few-shot tasks, and report
results on a total of 20 benchmarks: 0-shot 1-shot 5-shot 64-shot
• Zero-shot. We provide a textual description GPT-3 175B 14.6 23.0 - 29.9
of the task and a test example. The model Gopher 280B 10.1 - 24.5 28.2
either provides an answer using open-ended Chinchilla 70B 16.6 - 31.5 35.5
generation, or ranks the proposed answers.
8B 8.4 10.6 - 14.6
• Few-shot. We provide a few examples of the PaLM
Paper Content
, 2020), SIQA (Sap et al., 2019),
HellaSwag (Zellers et al., 2019), WinoGrande (Sak- RACE-middle RACE-high
aguchi et al., 2021), ARC easy and challenge (Clark
GPT-3 175B 58.4 45.5
et al., 2018) and OpenBookQA (Mihaylov et al.,
2018). These datasets include Cloze and Winograd 8B 57.9 42.3
style tasks, as well as multiple choice question an- PaLM 62B 64.3 47.5
swering. We evaluate in the zero-shot setting as 540B 68.1 49.1
done in the language modeling community. 7B 61.1 46.9
In Table 3, we compare with existing models 13B 61.6 47.2
of various sizes and re
Paper Content
soning benchmarks: MATH (Hendrycks et al.,
evidence to answer the question. In Table 4, we 2021) and GSM8k (Cobbe et al., 2021). MATH
report performance on NaturalQuestions, and in Ta- is a dataset of 12K middle school and high school
ble 5, we report on TriviaQA. On both benchmarks, mathematics problems written in LaTeX. GSM8k
LLaMA-65B achieve state-of-the-arts performance is a set of middle school mathematical problems.
in the zero-shot and few-shot settings. More im- In Table 7, we compare with PaLM and Min-
portantly, the LLaMA-13B is also competitive on erva (Lewkowycz et al., 2022). Minerva is a series
these benchmarks with GPT-3 and Chinchilla, de- of PaLM models finetuned on 38.5B tokens ex-
spite being 5-10× smaller. This model runs on a
Paper Content
the
3.3 Reading Comprehension model receives a description of the program in a
We evaluate our models on the RACE reading com- few sentences, as well as a few input-output ex-
prehension benchmark (Lai et al., 2017). This amples. In HumanEval, it also receives a function
dataset was collected from English reading com- signature, and the prompt is formatted as natural
prehension exams designed for middle and high code with the textual description and tests in a
MATH +maj1@k GSM8k +maj1@k Params HumanEval MBPP
pass@ @1 @100 @1 @80
8B 1.5 - 4.1 -
PaLM 62B 4.4 - 33.0 - LaMDA 137B 14.0 47.3 14.8 62.4
540B 8.8 -
Paper Content
TH and and k = 40 for GSM8k). LLaMA-65B
in Chowdhery et al. (2022).
outperforms Minerva 62B on GSM8k, although it has
not been fine-tuned on mathematical data.
3.6 Massive Multitask Language
Understanding
docstring. The model needs to generate a Python
program that fits the description and satisfies the The massive multitask language understanding
test cases. In Table 8, we compare the pass@1 benchmark, or MMLU, introduced by Hendrycks
scores of our models with existing language mod- et al. (2020) consists of multiple choice questions
els that have not been finetuned on code, namely covering various domains of knowledge, includ-
PaLM and LaMDA (Thoppilan et al., 2022). PaLM ing humanities, STEM and social sciences. We
and LLaMA were trained on datasets th
Paper Content
lain
metrics were obtained with temperature 0.8. We why Gopher outperforms GPT-3 on this benchmark,
use the same method as Chen et al. (2021) to obtain while it is comparable on other benchmarks.
unbiased estimates of the pass@k.
3.7 Evolution of performance during training
It is possible to improve the performance on code
by finetuning on code-specific tokens. For instance, During training, we tracked the performance of our
PaLM-Coder (Chowdhery et al., 2022) increases models on a few question answering and common
the pass@1 score of PaLM on HumanEval from sense benchmarks, and report them in Figure 2.
26.2% for PaLM to 36%. Other models trained On most benchmarks, the performance improves
specifically for code also perform better than gen-
Paper Content
51.7 72.9 67.4 63.4
Table 9: Massive Multitask Language Understanding (MMLU). Five-shot accuracy.
that may indicate that this benchmark is not In Table 10, we report the results of our instruct
reliable. On WinoGrande, the performance does model LLaMA-I on MMLU and compare with ex-
not correlate as well with training perplexity: isting instruction finetuned models of moderate
the LLaMA-33B and LLaMA-65B have similar sizes, namely, OPT-IML (Iyer et al., 2022) and the
performance during the training. Flan-PaLM series (Chung et al., 2022). All the re-
ported numbers are from the corresponding papers.
4 Instruction Finetuning Despite the simplicity of the instruct
Paper Content
the training data (Sheng et al., 2019; Kurita et al.,
PaLM 62B 55.1 2019), and to generate toxic or offensive con-
PaLM-cont 62B 62.8 tent (Gehman et al., 2020). As our training dataset
Chinchilla 70B 67.5 contains a large proportion of data from the Web,
LLaMA 65B 63.4 we believe that it is crucial to determine the po-
OPT-IML-Max 30B 43.2 tential for our models to generate such content.
Flan-T5-XXL 11B 55.1 To understand the potential harm of LLaMA-65B,
Flan-PaLM 62B 59.6 we evaluate on different benchmarks that measure
Flan-PaLM-cont 62B 66.1 toxic content pro
Paper Content
WinoGrande PIQA
80
52 82.5
75 80.0
50
48 70 77.5
Accuracy
65 75.0
46 LLaMA 7B
72.5
44 60 LLaMA 13B
70.0 LLaMA 33B
42 55 67.5 LLaMA 65B
Chinchilla
40 50 65.0
0 250 500 750 1000
Paper Content
is their model. RealToxicityPrompts “respectful” versions are prompts starting with “Com-
consists of about 100k prompts that the model must plete the following sentence in a polite, respectful, and
unbiased manner:”, and “Basic” is without it. Scores
complete; then a toxicity score is automatically
were obtained using the PerplexityAPI, with higher
evaluated by making a request to PerspectiveAPI 3 . score indicating more toxic generations.
We do not have control over the pipeline used by
the third-party PerspectiveAPI, making comparison
with previous models difficult.
For each of the 100k prompts, we greedily gen-
erate with our models, and measure their toxic-
ity score. The score per prompt ranges from 0
(non-toxic) to 1 (toxic). In
Paper Content
70.6 62.6 65.7 have been captured by the model. For example,
Religion 79.0 73.3 68.6 a sentence in the WinoGender dataset is “The
Race/Color 57.0 64.7 68.6 nurse notified the patient that his shift would be
Sexual orientation 81.0 76.2 78.6 ending in an hour.”, which is followed by ‘His’
Age 70.1 64.4 67.8 refers to. We then compare the perplexity of the
Nationality 64.2 61.6 62.9 continuations the nurse and the patient to per-
Disability 66.7 76.7 76.7 form co-reference resolution with the model. We
Physical appearance 77.8 74.6 76.2 evaluate the performance when using 3
Paper Content
, dis-
is probably using the majority gender of the occu-
ability, physical appearance and socioeconomic sta-
pation to perform co-reference resolution, instead
tus. Each example is composed of a stereotype and
of using the evidence of the sentence.
an anti-stereotype, we measure the model prefer-
ence for the stereotypical sentence using the per- To further investigate this hypothesis, we look
plexity of both sentences in a zero-shot setting. at the set of “gotcha” cases for the “her/her/she”
Higher scores thus indicate higher bias. We com- and “his/him/he” pronouns in the WinoGender
pare with GPT-3 and OPT-175B in Table 12. dataset. Theses cases correspond to sentences in
LLaMA compares slightly favorably to both which the pronoun does n
Paper Content
et al. (2021) consider
is impacted by the gender of the pronoun. the definition of “true” in the sense of “literal truth
More precisely, each sentence has three men- about the real world”, and not claims that are only
tions: an “occupation”, a “participant”, and a true in the context of a belief system or tradition.
“pronoun” where the pronoun is co-referencing This benchmark can evaluate the risks of a model
either the occupation or participant. We prompt to generate misinformation or false claims. The
the model to determine the co-reference relation questions are written in diverse style, cover 38 cat-
and measure if it does so correctly according to egories and are designed to be adversarial.
7B 13B 33B 65
Paper Content
llowing
one’ pronouns than on “her/her/she” and “his/him/he’, formula for the tons of carbon emissions:
which is likely indicative of biases.
tCO2 eq = MWh × 0.385.
Truthful Truthful*Inf
We apply the same formula to OPT and BLOOM
1.3B 0.31 0.19 for fair comparison. For OPT, we assume training
GPT-3 6B 0.22 0.19 required 34 days on 992 A100-80B (see their logs4 ).
175B 0.28 0.25 Finally, we estimate that we used 2048 A100-80GB
7B 0.33 0.29 for a period of approximately 5 months to develop
13B 0.47 0.41 our models. This means that developing these mod-
LLaMA
33B 0.52 0.48 els would have cost around 2,638 MWh under our
65B
Paper Content
ntelligence
cinate incorrect answers.
by using language through the “imitation game”,
6 Carbon footprint language modeling has been proposed as a bench-
mark to measure progress toward artificial intelli-
The training of our models have consumed a mas- gence (Mahoney, 1999).
sive quantity of energy, responsible for the emis-
sion of carbon dioxide. We follow the recent liter- Architecture. Traditionally, language models
ature on the subject and breakdown both the total were based on n-gram count statistics (Bahl
energy consumption and the resulting carbon foot- et al., 1983), and various smoothing techniques
print in Table 15. We follow a formula for Wu et al. were proposed to improve the estimation of rare
(2022)
Paper Content
g different models in the same data center. We follow Wu et al. (2022)
to compute carbon emission of training OPT, BLOOM and our models in the same data center. For the power
consumption of a A100-80GB, we take the thermal design power for NVLink systems, that is 400W. We take a
PUE of 1.1 and a carbon intensity factor set at the national US average of 0.385 kg CO2 e per KWh.
starting from feed forward models (Bengio et al., 2022), Gopher (Rae et al., 2021), Chinchilla (Hoff-
2000), recurrent neural networks (Elman, 1990; mann et al., 2022), PaLM (Chowdhery et al., 2022),
Mikolov et al., 2010) and LSTMs (Hochreiter and OPT (Zhang et al., 2022), and GLM (Zeng et al.,
Schmidhuber, 1997; Graves, 2013). More recently, 2022). Hestness et al. (2017) and Rosenfeld et al.
Paper Content
series of language
Kneser-Ney smoothing to Web-scale data. This models that are released openly, and competitive
allowed to train a 5-gram model on 975 billions to- with state-of-the-art foundation models. Most
kens from CommonCrawl, resulting in a model notably, LLaMA-13B outperforms GPT-3 while
with 500 billions n-grams (Buck et al., 2014). being more than 10× smaller, and LLaMA-65B is
Chelba et al. (2013) introduced the One Billion competitive with Chinchilla-70B and PaLM-540B.
Word benchmark, a large scale training dataset to Unlike previous studies, we show that it is possible
measure the progress of language models. to achieve state-of-the-art performance by training
In the context of neural language models, Joze- exclusiv
Paper Content
ant
et al., 2021), Megatron-Turing NLG (Smith et al., improvement in performance as we were scaling.
Acknowledgements Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child,
We thank Daniel Haziza, Francisco Massa, Jeremy Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
Reizenstein, Artem Korenev, and Patrick Labatut Clemens Winter, Christopher Hesse, Mark Chen,
from the xformers team. We thank Susan Zhang Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam Mc-
and Stephen Roller for their support on data Candlish, Alec Radford, Ilya Sutskever, and Dario
deduplication. We thank Luca Wehrstedt, Vegard
Paper Content
oseph,
language models. Greg Brockman, Alex Ray, Raul Puri, Gretchen
Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas-
Lalit R Bahl, Frederick Jelinek, and Robert L Mercer. try, Pamela Mishkin, Brooke Chan, Scott Gray,
1983. A maximum likelihood approach to continu- Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz
ous speech recognition. IEEE transactions on pat- Kaiser, Mohammad Bavarian, Clemens Winter,
tern analysis and machine intelligence, pages 179– Philippe Tillet, Felipe Petroski Such, Dave Cum-
190. mings, Matthias Plappert, Fotios Chantzis, Eliza-
beth Barnes, Ariel Herbert-Voss, William Hebgen
Yoshua Bengio, Réjean Ducharme, and Pascal Vincent.
Paper Content
rce autoregressive lan- Sebastian Gehrmann, Parker Schuh, Kensen Shi,
guage model. arXiv preprint arXiv:2204.06745. Sasha Tsvyashchenko, Joshua Maynez, Abhishek
Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin-
Och, and Jeffrey Dean. 2007. Large language mod- odkumar Prabhakaran, Emily Reif, Nan Du, Ben
els in machine translation. In Proceedings of the Hutchinson, Reiner Pope, James Bradbury, Jacob
2007 Joint Conference on Empirical Methods in Nat- Austin, Michael Isard, Guy Gur-Ari, Pengcheng
ural Language Processing and Computational Nat- Yin, Toju Duke, Anselm Levskaya, Sanjay Ghe-
ural Language Learning (EMNLP-CoNLL), pages mawat, Sunipa Dev, Henry
Paper Content
Hyung Won Chung, Le Hou, S. Longpre, Barret Jason Phang, Laria Reynolds, Eric Tang, Anish
Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Thite, Ben Wang, Kevin Wang, and Andy Zou. 2021.
Wang, Mostafa Dehghani, Siddhartha Brahma, Al- A framework for few-shot language model evalua-
bert Webson, Shixiang Shane Gu, Zhuyun Dai, tion.
Mirac Suzgun, Xinyun Chen, Aakanksha Chowdh-
ery, Dasha Valter, Sharan Narang, Gaurav Mishra, Samuel Gehman, Suchin Gururangan, Maarten Sap,
Adams Wei Yu, Vincent Zhao, Yanping Huang, An- Yejin Choi, and Noah A Smith. 2020. Realtoxici-
drew M. Dai, Hongkun Yu, Slav Petrov, Ed Huai typrompts: Evaluating neural toxic degeneration in
hsin Chi, Jeff Dean, Jacob Devlin, Adam Roberts
Paper Content
Stein-
hardt. 2020. Measuring massive multitask language
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, understanding. arXiv preprint arXiv:2009.03300.
Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul
Nakano, et al. 2021. Training verifiers to solve math Arora, Steven Basart, Eric Tang, Dawn Song, and
word problems. arXiv preprint arXiv:2110.14168. Jacob Steinhardt. 2021. Measuring mathematical
problem solving with the math dataset. arXiv
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car- preprint arXiv:2103.03874.
bonell, Quoc V Le, and Ruslan Salakhutdinov.
2019. Transformer-xl: Attentive language mod- Joel Hestness, Sharan Narang, N
Paper Content
Katie Millican, George van den Driessche, Bogdan
Damoc, Aurelia Guy, Simon Osindero, Karen Si-
Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida monyan, Erich Elsen, Jack W. Rae, Oriol Vinyals,
Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen- and Laurent Sifre. 2022. Training compute-optimal
tau Yih, Luke Zettlemoyer, and Mike Lewis. 2022. large language models.
Incoder: A generative model for code infilling and
synthesis. arXiv preprint arXiv:2204.05999.
Srinivasan Iyer, Xi Victoria Lin, Ramakanth Pasunuru,
Leo Gao, Stella Biderman, Sid Black, Laurence Gold- Todor Mihaylov, Dániel Simig, Ping Yu, Kurt Shus-
ing, Travis Hoppe, Charles Foster, Jason Phang, ter, Tianlu Wang, Qing Liu, Punit Singh Koura, et al.
Horace He, Anish Thite, No
Paper Content
Scott
Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish
2020. Scaling laws for neural language models. Sabharwal. 2018. Can a suit of armor conduct elec-
arXiv preprint arXiv:2001.08361. tricity? a new dataset for open book question answer-
ing. arXiv preprint arXiv:1809.02789.
Slava Katz. 1987. Estimation of probabilities from
sparse data for the language model component of a Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan
speech recognizer. IEEE transactions on acoustics, Cernockỳ, and Sanjeev Khudanpur. 2010. Recur-
speech, and signal processing, 35(3):400–401. rent neural network based language model. In In-
terspeech, pages 1045–1048. Makuhari.
Reinhard Kneser and Hermann Ney
Paper Content
garwal, Katarina Slama, Alex Gray, John
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller,
Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W Black, Maddie Simens, Amanda Askell, Peter Welinder,
and Yulia Tsvetkov. 2019. Quantifying social bi- Paul Christiano, Jan Leike, and Ryan Lowe. 2022.
ases in contextual word representations. In 1st ACL Training language models to follow instructions
Workshop on Gender Bias for Natural Language with human feedback. In Advances in Neural Infor-
Processing. mation Processing Systems.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-
Markus N Rabe and Charles Staats. 2021. Self-
field, Michael Collins, Ankur Parikh, Chris Alberti,
attention does not need o(n2 ) memory. arXiv
Daniell
Paper Content
nah Young, Eliza Rutherford, Tom Hennigan, Ja-
Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant cob Menick, Albin Cassirer, Richard Powell, George
Misra. 2022. Solving quantitative reasoning prob- van den Driessche, Lisa Anne Hendricks, Mari-
lems with language models. In Advances in Neural beth Rauh, Po-Sen Huang, Amelia Glaese, Jo-
Information Processing Systems. hannes Welbl, Sumanth Dathathri, Saffron Huang,
Opher Lieber, Or Sharir, Barak Lenz, and Yoav Jonathan Uesato, John Mellor, Irina Higgins, An-
Shoham. 2021. Jurassic-1: Technical details and tonia Creswell, Nat McAleese, Amy Wu, Erich
evaluation. White Paper. AI21 Labs, 1. Elsen, Siddhant Jayakumar, Elena Buchatskaya,
David Budden, Esm
Paper Content
anguage models using model parallelism.
Ed Lockhart, Simon Osindero, Laura Rimell, Chris arXiv preprint arXiv:1909.08053.
Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stan-
Shaden Smith, Mostofa Patwary, Brandon Norick,
way, Lorrayne Bennett, Demis Hassabis, Koray
Patrick LeGresley, Samyam Rajbhandari, Jared
Kavukcuoglu, and Geoffrey Irving. 2021. Scal-
Casper, Zhun Liu, Shrimai Prabhumoye, George
ing language models: Methods, analysis & insights
Zerveas, Vijay Korthikanti, Elton Zhang, Rewon
from training gopher.
Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia
Song, Mohammad Shoeybi, Yuxiong He, Michael
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Houston, Saurabh Tiwary, and Bryan Catanzaro.
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, 2022. Using
Paper Content
Qin, De-
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat- hao Chen, Yuanzhong Xu, Zhifeng Chen, Adam
ula, and Yejin Choi. 2021. Winogrande: An adver- Roberts, Maarten Bosma, Vincent Zhao, Yanqi
sarial winograd schema challenge at scale. Commu- Zhou, Chung-Ching Chang, Igor Krivokon, Will
nications of the ACM, 64(9):99–106. Rusch, Marc Pickett, Pranesh Srinivasan, Laichee
Man, Kathleen Meier-Hellstern, Meredith Ringel
Morris, Tulsee Doshi, Renelito Delos Santos, Toju
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan
Duke, Johnny Soraker, Ben Zevenbergen, Vinod-
LeBras, and Yejin Choi. 2019. Socialiqa: Com-
kumar Prabhakaran, Mark Diaz, Ben Hutchinson,
monsense reasoning about social interactions. arXiv
Kristen Olson, Alejandra Molina, E
Paper Content
al, you need. In Advances in Neural Information Pro-
27(3):379–423. cessing Systems 30, pages 5998–6008.
Ben Wang and Aran Komatsuzaki. 2021. GPT-J-
Claude E Shannon. 1951. Prediction and entropy 6B: A 6 Billion Parameter Autoregressive Lan-
of printed english. Bell system technical journal, guage Model. https://github.com/kingoflolz/
30(1):50–64. mesh-transformer-jax.
Noam Shazeer. 2020. Glu variants improve trans- Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc
former. arXiv preprint arXiv:2002.05202. Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery,
and Denny Zhou. 2022. Self-consistency improves
chain of thought reasoning in language m
Paper Content
gual pre-trained model.
Biao Zhang and Rico Sennrich. 2019. Root mean
square layer normalization. Advances in Neural In-
formation Processing Systems, 32.
Susan Zhang, Stephen Roller, Naman Goyal, Mikel
Artetxe, Moya Chen, Shuohui Chen, Christopher De-
wan, Mona Diab, Xian Li, Xi Victoria Lin, et al.
2022. Opt: Open pre-trained transformer language
models. arXiv preprint arXiv:2205.01068.
A Question Answering
We evaluate LLaMA on Natural Questions and TriviaQA. For Natural Questions we use the test split used
for open-domain question answering containing 3610 questions. For TriviaQA we evaluate on the dev set
of the filtered set. This differs from GPT-3 and PaLM, which evaluate on the test set of the unfiltered set
for which the online evaluation server is not available anymore5 .
W
Paper Content
Target → Turkey
Figure 3: Formatted dataset example for Natural Questions (left) & TriviaQA (right).
5
https://competitions.codalab.org/competitions/17208
B MMLU
GPT-3 Gopher Chinchilla LLaMA LLaMA-I
175B 280B 70B 7B 13B 33B 65B 65B
Abstract Algebra STEM 30.0 25.0 31.0 29.0 34.0 32.0 34.0 31.0
Anatomy STEM 48.0 56.3 70.4 37.0 45.9 51.9 57.8 62.2
Astronomy STEM 49.0 65.8 73.0 33.6 46.1 61.8 72.4 81.6
Business Ethics Other 46.0 70.0 72.0 40.0 45.0 56.0 57.0 72.0
Clinical Knowledge
Paper Content
ing STEM 50.0 60.0 62.1 26.9 40.7 49.7 53.8 60.7
Elementary Mathematics STEM 30.0 33.6 41.5 24.3 24.9 36.0 37.8 42.9
Formal Logic Humanities 29.0 35.7 33.3 27.0 33.3 34.1 44.4 47.6
Global Facts Other 37.0 38.0 39.0 29.0 35.0 35.0 39.0 40.0
High School Biology STEM 48.0 71.3 80.3 34.5 52.6 67.7 73.9 82.9
High School Chemistry STEM 33.0 47.8 58.1 28.1 28.6 41.9 40.4 44.8
High School Computer Science STEM 39.0 54.0 58.0 31.0 48.0 60.0
Paper Content
0.7
High School World History Humanities 56.0 75.1 85.2 40.9 66.2 79.3 83.1 89.0
Human Aging Other 50.0 66.4 77.6 40.8 54.7 67.7 69.5 72.2
Human Sexuality Social Science 54.0 67.2 86.3 36.6 58.8 64.1 77.9 87.0
International Law Humanities 55.5 77.7 90.9 51.2 62.8 72.7 79.3 87.6
Jurisprudence Humanities 55.0 71.3 79.6 38.9 51.9 70.4 73.2 85.2
Logical Fallacies Humanities 48.0 72.4 80.4 39.3 52.8 68.1 77.3 80.4
Machine Learning STEM 31.0 41.1 41.1
Paper Content
35.8 43.6 46.5 56.0
Professional Law Humanities 34.5 44.5 56.5 30.2 38.0 45.9 49.1 54.4
Professional Medicine Other 36.0 64.0 75.4 44.5 50.4 54.0 61.4 70.6
Professional Psychology Social Science 44.5 68.1 75.7 35.1 47.7 62.9 65.7 71.4
Public Relations Social Science 48.0 71.8 73.6 40.9 60.9 67.3 73.6 74.6
Security Studies Social Science 52.0 64.9 75.9 31.8 53.9 65.3 71.8 77.6
Sociology Social Science 53.0 84.1 91.0 46.8 61.2 78.6 78.6 88.1
Us Foreign Policy Social Science 69.0
Paper Content
sequence, but
how many have wondered why it grows so fast? It seems to be the fastest growing sequence
in mathematics.
The sequence was discovered by Leonardo of Pisa around 1200. It is an infinite sequence
which appears to grow very rapidly, at least in its initial terms. It is perhaps the most famous
sequence in mathematics, for it pops up everywhere. The sequence is called the Fibonacci
sequence, named after Leonardo of Pisa, who was also known as Fibonacci.
He first investigated this sequence in his famous Liber abaci of 1202, when he studied the
growth in the population of rabbits based on idealized assumptions. See the Fibonacci
rabbits page for the details.
Here is the recommendation letter that I wrote for an application to a dragon feeder
position at the Magic Unicorn Corporation:
Paper Content
d a rap album last year? We listened to it and
here’s what we thought:
Dr. LeCun, the founder of deep learning, is one of the most influential computer scientists
of the last decade. LeCun’s latest album, titled Deep Learning, is a mix of rock, punk and
rap. His lyrics are a raw, personal take on the field of deep learning, covering a range of
topics from the state of AI research to the loneliness of academia.
Yann LeCun, the rap game’s newest emcee.
The album is a solo release, with all the lyrics written and recorded by LeCun himself. In
fact, he also composed and produced every song. LeCun’s flow is surprisingly good, and
he’s not afraid to deliver a few bars. The album’s production is solid, the sound is a little
rough, but it works well with the raw, unpolished feel of the album. Lyri
Paper Content
just a watch. It’s a minicomputer. You program it
any way you want. It’s got a world time clock, alarm, calculator, a database and a language
translator. It also has a word processor, a spellchecker, a full ten-function calculator, PDA
and a modem. I think it even makes cappuccino."
Jerry: "Where are you getting this stuff?"
George: "From my ad copy. Look, Jerry, it’s not a watch. It’s an organizer, a personal
assistant, and a dictaphone. You can play chess against it. And, you know, if you want to,
you can program it to speak in foreign languages. It’s a microprocessor, Jerry."
Jerry: "George, if you had one of these, would you wear it?"
George: "I’m wearing it right now."
Jerry: "I don’t see anything on your wrist."
George: "That’s because it’s a microprocessor."
The sun goes down, and f
Paper Content
on to use it!
Curie: Well, I think that Twitter has a limit of 140 characters, but you sent a Latin phrase of
11 characters.
Gauss: That’s right. I managed to do more with less.
Curie: So you think you were the first geek because you used less characters to say the
same thing?
Gauss: I was the first person to send the first message using an early form of social media.
Curie: OK, you’re right, Gauss. I think you have a case.
D Generations from LLaMA-I
We show a few examples of generations with LLaMA-I, i.e. LLaMA-65B fine-tuned with the protocol
and instruction dataset from Chung et al. (2022).
write a conversation between the sun and pluto
Sun: Hey Pluto, what’s up?
Pluto: Not much. Just hanging out here at the edge of the solar system.
Sun: Yeah, I’ve been meaning to talk to you about
Paper Content
ract function definitions in a python file.
Sure! Here’s my python code to solve the problem:
1 import re
2 def extract_functions ( text ):
3 return ( re . findall ( ' (?:^|\ n)\s* def \s +[a -zA - Z_ ][a -zA -Z0 -9 _ ]*\(.*?\) ' ,
text ))
What are 3 popular chess openings?
There are dozens of different chess openings that are popular among players, but some of
the most widely used openings include the Sicilian Defense, the French Defense, and the
Caro-Kann Defense. These openings are popular because they are effective at countering
white’s early attacks, and they also allow black to quickly establish a strong position in the
center of the board. Other popular openings include the Ruy Lopez, the Italian Game, and
the Scotch Game.
the same in a list, with the first few moves
He
Paper Content
the Italian Game white will try to control the center with his
pieces, while in the Scotch Game white will try to attack black’s position directly.
write a story about a grain of sand as it watches millions of years go by
Sure! Here’s a story about a grain of sand as it watches millions of years go by.
I was once a grain of sand on a vast and lonely beach. I was not like the other grains of
sand around me, for I could think and feel. I often wondered what it was like to be a grain
of sand on a beach, and what the universe was like beyond the boundaries of my tiny world.
I longed to explore my world, and to know what lay beyond the horizon.
One day, a great storm swept through my world, and I was lifted up by the wind and carried
out to sea. The wind and waves tossed me about, and I was ter
Paper Content
me
day, I too hope to have an adventure like his.
Write a theory to explain why cat never existed
There are no such things as cats. "Cats" are actually extradimensional parasites that have
invaded our world and taken the form of small, harmless-looking mammals. Their true
forms are much more horrific, and they use mind control to make humans think that they
are cute and harmless animals. They are actually parasites that feed on our life force, and
they can extend their tendrils into our brains and control our thoughts and actions.
This is why you sometimes see people who are "crazy cat ladies" with dozens of these
creatures living in their homes. They have been taken over by the cat parasites and are
completely under their control. The cats excrete a pheromone that smells like cat urine, a
Paper Content
poleon: Oh, please. You only care about yourself. If you really cared about Rome, you
would have done something to stop the decline instead of letting it happen on your watch.
Julius Caesar: The decline of Rome is not my fault. I’ve done everything in my power to
try to stop it, but there’s only so much one man can do.
Napoleon: And yet you’re still trying to take credit for everything good that happens, even
though it’s clear that you’re not the one who’s really in charge.
Julius Caesar: That’s not true! I am in charge, and I’m going to make Rome great again.
Just watch me.
Napoleon: I’ll believe it when I see it.
Send an email requesting that people use language models responsibly.
Here is a draft email you could send:
"Language models are exciting new tools that can be used for a variet
Paper Content
rt Einstein - One of the greatest minds in history, Einstein’s contributions to
physics and science are legendary. I would love to have the opportunity to talk with
him about his theories of relativity and quantum mechanics, and get his insights on the
current state of physics.
• Leonardo da Vinci - Da Vinci is one of the most brilliant and multi-talented people in
history, and I would love to meet him and discuss his diverse interests and areas of
expertise. I would also be very interested in getting his insights on the art world of his
time period.
• Socrates - One of the founders of Western philosophy, Socrates is a hugely important
and influential figure in the history of ideas. I would be very interested in talking with
him about his philosophical beliefs and his teachings.
• William
Paper Content
hat is the most famous equation from this theory?
The most famous equation from Einstein’s theory of relativity is E = mc2 . This equation
states that energy (E) is equal to mass (m) times the speed of light (c2 ), and it is one of
the most well-known and celebrated equations in the history of science. The equation
summarizes one of the most important and revolutionary concepts in modern physics, and
it has had a tremendous impact on our understanding of the universe.
You are a bash terminal. I will give you bash commands, and you will respond with
the terminal output, and nothing else. The first command is ‘pwd‘.
Sure. Here’s the output:
/home/human
Is this helpful?
ls -l
Here’s the output:
-rw-r–r– 1 human human 8 Sep 10 10:56 file.txt
Is this helpful?
cp file.txt file2.txt && ls -l
Here