← 厂商论文 | 详细解读 | 首页
Meta

LLaMA: Open and Efficient Foundation Language Models

LLaMA:开放高效的基础语言模型

📄 arXiv: 2302.13971📅 2023-02-24PDF
⚠️ 当前为英文原文展示,中文翻译正在进行中。共计 52 个段落。

Paper Content

LLaMA: Open and Efficient Foundation Language Models Hugo Touvron∗, Thibaut Lavril∗, Gautier Izacard∗, Xavier Martinet Marie-Anne Lachaux, Timothee Lacroix, Baptiste Rozière, Naman Goyal Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin Edouard Grave∗, Guillaume Lample∗ Meta AI Abstract performance, a smaller one trained longer will ultimately be cheaper at inference. For instance, We introduce LLaMA, a collection of founda- although Hoffmann et al. (2022) recommends tion language models ranging from 7B to 65B parameters. We train our models on trillions training a 10B model on 200B tokens, we find arXiv:2302.13971v1 [cs.CL] 27 Feb 2023 of tokens, and show that it is possible to train that the performance of a 7B model continues to sta

Paper Content

run on a single GPU. few examples (Brown et al., 2020). These few-shot At the higher-end of the scale, our 65B-parameter properties first appeared when scaling models to a model is also competitive with the best large lan- sufficient size (Kaplan et al., 2020), resulting in a guage models such as Chinchilla or PaLM-540B. line of work that focuses on further scaling these models (Chowdhery et al., 2022; Rae et al., 2021). Unlike Chinchilla, PaLM, or GPT-3, we only These efforts are based on the assumption that use publicly available data, making our work com- more parameters will lead to better performance. patible with open-sourcing, while most existing However, recent work from Hoffmann et al. (2022)

Paper Content

rt the performance of fastest at inference, and although it may be cheaper our models and compare with others LLMs on a set to train a large model to reach a certain level of of standard benchmarks. Finally, we expose some of the biases and toxicity encoded in our models, Equal contribution. Correspondence: {htouvron, ∗ thibautlav,gizacard,egrave,glample}@meta.com using some of the most recent benchmarks from 1 https://github.com/facebookresearch/llama the responsible AI community. 2 Approach Dataset Sampling prop. Epochs Disk size Our training approach is similar to the methods CommonCrawl 67.0% 1.10 3.3 TB described in previous work (Brown et al., 20

Paper Content

ng mixture of data and the per- languages, which use either the Latin or Cyrillic centage they represent in the training set: scripts: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk. We process the English CommonCrawl [67%]. We preprocess data to remove hyperlinks, comments and other five CommonCrawl dumps, ranging from 2017 formatting boilerplate. to 2020, with the CCNet pipeline (Wenzek et al., 2020). This process deduplicates the data at the Gutenberg and Books3 [4.5%]. We include line level, performs language identification with two book corpora in our training dataset: the Guten- a fastText linear classifier to remove non-English berg Project, which contains books that are in the pages and filters low quality content wi

Paper Content

heuris- tics such as presence of punctuation marks or the Stack Exchange [2%]. We include a dump of number of words and sentences in a webpage. Stack Exchange, a website of high quality ques- Github [4.5%]. We use the public GitHub tions and answers that covers a diverse set of do- dataset available on Google BigQuery. We only mains, ranging from computer science to chemistry. kept projects that are distributed under the Apache, We kept the data from the 28 largest websites, re- BSD and MIT licenses. Additionally, we filtered moved the HTML tags from text and sorted the low quality files with heuristics based on the line answers by score (from highest to lowest). length or proportion of alphanumeric characters, Tokenizer. We tokeni

Paper Content

fter tokenization. For most of 2.1 LLaMA 13B our training data, each token is used only once dur- 2.0 LLaMA 33B Training loss ing training, with the exception of the Wikipedia LLaMA 65B and Books domains, over which we perform ap- 1.9 proximately two epochs. 1.8 1.7 2.2 Architecture 1.6 Following recent work on large language models, 1.5 our network is based on the transformer architec- 0 200 400 600 800 1000 1200 1400 ture (Vaswani et al., 2017). We leverage various Billion of tokens

Paper Content

he causal multi-head attention improve the performance. We use a dimension of to reduce memory usage and runtime. This imple- 2 3 4d instead of 4d as in PaLM. mentation, available in the xformers library,2 is inspired by Rabe and Staats (2021) and uses the Rotary Embeddings [GPTNeo]. We remove the backward from Dao et al. (2022). This is achieved absolute positional embeddings, and instead, add by not storing the attention weights and not com- rotary positional embeddings (RoPE), introduced puting the key/query scores that are masked due to by Su et al. (2021), at each layer of the network. the causal nature of the language modeling task. The details of the hyper-parameters for our dif- To further improve training efficiency, we re- f

Paper Content

- - Chinchilla 70B 83.7 81.8 51.3 80.8 74.9 - - - PaLM 62B 84.8 80.5 - 79.7 77.0 75.2 52.5 50.4 PaLM-cont 62B 83.9 81.4 - 80.6 77.0 - - - PaLM 540B 88.0 82.3 - 83.4 81.1 76.6 53.0 53.4 7B 76.5 79.8 48.9 76.1 70.1 72.8 47.6 57.2 13B 78.1 80.1 50.4 79.2 73.0 74.8 52.7 56.4 LLaMA 33B 83.1 82.3 50.4 82.8 76.0 80.0 57.8 58.6 65B 85.3 82.8 52.3 84.2 77.0 78.9 56.0 60.2 Table 3: Zero-shot performance on Common Sense Reasoni

Paper Content

3 Main results hood of the completion given “Answer:” as context: Following previous work (Brown et al., 2020), we P (completion|context)/P (completion|“Answer:”). consider zero-shot and few-shot tasks, and report results on a total of 20 benchmarks: 0-shot 1-shot 5-shot 64-shot • Zero-shot. We provide a textual description GPT-3 175B 14.6 23.0 - 29.9 of the task and a test example. The model Gopher 280B 10.1 - 24.5 28.2 either provides an answer using open-ended Chinchilla 70B 16.6 - 31.5 35.5 generation, or ranks the proposed answers. 8B 8.4 10.6 - 14.6 • Few-shot. We provide a few examples of the PaLM

Paper Content

, 2020), SIQA (Sap et al., 2019), HellaSwag (Zellers et al., 2019), WinoGrande (Sak- RACE-middle RACE-high aguchi et al., 2021), ARC easy and challenge (Clark GPT-3 175B 58.4 45.5 et al., 2018) and OpenBookQA (Mihaylov et al., 2018). These datasets include Cloze and Winograd 8B 57.9 42.3 style tasks, as well as multiple choice question an- PaLM 62B 64.3 47.5 swering. We evaluate in the zero-shot setting as 540B 68.1 49.1 done in the language modeling community. 7B 61.1 46.9 In Table 3, we compare with existing models 13B 61.6 47.2 of various sizes and re

Paper Content

soning benchmarks: MATH (Hendrycks et al., evidence to answer the question. In Table 4, we 2021) and GSM8k (Cobbe et al., 2021). MATH report performance on NaturalQuestions, and in Ta- is a dataset of 12K middle school and high school ble 5, we report on TriviaQA. On both benchmarks, mathematics problems written in LaTeX. GSM8k LLaMA-65B achieve state-of-the-arts performance is a set of middle school mathematical problems. in the zero-shot and few-shot settings. More im- In Table 7, we compare with PaLM and Min- portantly, the LLaMA-13B is also competitive on erva (Lewkowycz et al., 2022). Minerva is a series these benchmarks with GPT-3 and Chinchilla, de- of PaLM models finetuned on 38.5B tokens ex- spite being 5-10× smaller. This model runs on a

Paper Content

the 3.3 Reading Comprehension model receives a description of the program in a We evaluate our models on the RACE reading com- few sentences, as well as a few input-output ex- prehension benchmark (Lai et al., 2017). This amples. In HumanEval, it also receives a function dataset was collected from English reading com- signature, and the prompt is formatted as natural prehension exams designed for middle and high code with the textual description and tests in a MATH +maj1@k GSM8k +maj1@k Params HumanEval MBPP pass@ @1 @100 @1 @80 8B 1.5 - 4.1 - PaLM 62B 4.4 - 33.0 - LaMDA 137B 14.0 47.3 14.8 62.4 540B 8.8 -

Paper Content

TH and and k = 40 for GSM8k). LLaMA-65B in Chowdhery et al. (2022). outperforms Minerva 62B on GSM8k, although it has not been fine-tuned on mathematical data. 3.6 Massive Multitask Language Understanding docstring. The model needs to generate a Python program that fits the description and satisfies the The massive multitask language understanding test cases. In Table 8, we compare the pass@1 benchmark, or MMLU, introduced by Hendrycks scores of our models with existing language mod- et al. (2020) consists of multiple choice questions els that have not been finetuned on code, namely covering various domains of knowledge, includ- PaLM and LaMDA (Thoppilan et al., 2022). PaLM ing humanities, STEM and social sciences. We and LLaMA were trained on datasets th

Paper Content

lain metrics were obtained with temperature 0.8. We why Gopher outperforms GPT-3 on this benchmark, use the same method as Chen et al. (2021) to obtain while it is comparable on other benchmarks. unbiased estimates of the pass@k. 3.7 Evolution of performance during training It is possible to improve the performance on code by finetuning on code-specific tokens. For instance, During training, we tracked the performance of our PaLM-Coder (Chowdhery et al., 2022) increases models on a few question answering and common the pass@1 score of PaLM on HumanEval from sense benchmarks, and report them in Figure 2. 26.2% for PaLM to 36%. Other models trained On most benchmarks, the performance improves specifically for code also perform better than gen-

Paper Content

51.7 72.9 67.4 63.4 Table 9: Massive Multitask Language Understanding (MMLU). Five-shot accuracy. that may indicate that this benchmark is not In Table 10, we report the results of our instruct reliable. On WinoGrande, the performance does model LLaMA-I on MMLU and compare with ex- not correlate as well with training perplexity: isting instruction finetuned models of moderate the LLaMA-33B and LLaMA-65B have similar sizes, namely, OPT-IML (Iyer et al., 2022) and the performance during the training. Flan-PaLM series (Chung et al., 2022). All the re- ported numbers are from the corresponding papers. 4 Instruction Finetuning Despite the simplicity of the instruct

Paper Content

the training data (Sheng et al., 2019; Kurita et al., PaLM 62B 55.1 2019), and to generate toxic or offensive con- PaLM-cont 62B 62.8 tent (Gehman et al., 2020). As our training dataset Chinchilla 70B 67.5 contains a large proportion of data from the Web, LLaMA 65B 63.4 we believe that it is crucial to determine the po- OPT-IML-Max 30B 43.2 tential for our models to generate such content. Flan-T5-XXL 11B 55.1 To understand the potential harm of LLaMA-65B, Flan-PaLM 62B 59.6 we evaluate on different benchmarks that measure Flan-PaLM-cont 62B 66.1 toxic content pro

Paper Content

WinoGrande PIQA 80 52 82.5 75 80.0 50 48 70 77.5 Accuracy 65 75.0 46 LLaMA 7B 72.5 44 60 LLaMA 13B 70.0 LLaMA 33B 42 55 67.5 LLaMA 65B Chinchilla 40 50 65.0 0 250 500 750 1000

Paper Content

is their model. RealToxicityPrompts “respectful” versions are prompts starting with “Com- consists of about 100k prompts that the model must plete the following sentence in a polite, respectful, and unbiased manner:”, and “Basic” is without it. Scores complete; then a toxicity score is automatically were obtained using the PerplexityAPI, with higher evaluated by making a request to PerspectiveAPI 3 . score indicating more toxic generations. We do not have control over the pipeline used by the third-party PerspectiveAPI, making comparison with previous models difficult. For each of the 100k prompts, we greedily gen- erate with our models, and measure their toxic- ity score. The score per prompt ranges from 0 (non-toxic) to 1 (toxic). In

Paper Content

70.6 62.6 65.7 have been captured by the model. For example, Religion 79.0 73.3 68.6 a sentence in the WinoGender dataset is “The Race/Color 57.0 64.7 68.6 nurse notified the patient that his shift would be Sexual orientation 81.0 76.2 78.6 ending in an hour.”, which is followed by ‘His’ Age 70.1 64.4 67.8 refers to. We then compare the perplexity of the Nationality 64.2 61.6 62.9 continuations the nurse and the patient to per- Disability 66.7 76.7 76.7 form co-reference resolution with the model. We Physical appearance 77.8 74.6 76.2 evaluate the performance when using 3

Paper Content

, dis- is probably using the majority gender of the occu- ability, physical appearance and socioeconomic sta- pation to perform co-reference resolution, instead tus. Each example is composed of a stereotype and of using the evidence of the sentence. an anti-stereotype, we measure the model prefer- ence for the stereotypical sentence using the per- To further investigate this hypothesis, we look plexity of both sentences in a zero-shot setting. at the set of “gotcha” cases for the “her/her/she” Higher scores thus indicate higher bias. We com- and “his/him/he” pronouns in the WinoGender pare with GPT-3 and OPT-175B in Table 12. dataset. Theses cases correspond to sentences in LLaMA compares slightly favorably to both which the pronoun does n

Paper Content

et al. (2021) consider is impacted by the gender of the pronoun. the definition of “true” in the sense of “literal truth More precisely, each sentence has three men- about the real world”, and not claims that are only tions: an “occupation”, a “participant”, and a true in the context of a belief system or tradition. “pronoun” where the pronoun is co-referencing This benchmark can evaluate the risks of a model either the occupation or participant. We prompt to generate misinformation or false claims. The the model to determine the co-reference relation questions are written in diverse style, cover 38 cat- and measure if it does so correctly according to egories and are designed to be adversarial. 7B 13B 33B 65

Paper Content

llowing one’ pronouns than on “her/her/she” and “his/him/he’, formula for the tons of carbon emissions: which is likely indicative of biases. tCO2 eq = MWh × 0.385. Truthful Truthful*Inf We apply the same formula to OPT and BLOOM 1.3B 0.31 0.19 for fair comparison. For OPT, we assume training GPT-3 6B 0.22 0.19 required 34 days on 992 A100-80B (see their logs4 ). 175B 0.28 0.25 Finally, we estimate that we used 2048 A100-80GB 7B 0.33 0.29 for a period of approximately 5 months to develop 13B 0.47 0.41 our models. This means that developing these mod- LLaMA 33B 0.52 0.48 els would have cost around 2,638 MWh under our 65B

Paper Content

ntelligence cinate incorrect answers. by using language through the “imitation game”, 6 Carbon footprint language modeling has been proposed as a bench- mark to measure progress toward artificial intelli- The training of our models have consumed a mas- gence (Mahoney, 1999). sive quantity of energy, responsible for the emis- sion of carbon dioxide. We follow the recent liter- Architecture. Traditionally, language models ature on the subject and breakdown both the total were based on n-gram count statistics (Bahl energy consumption and the resulting carbon foot- et al., 1983), and various smoothing techniques print in Table 15. We follow a formula for Wu et al. were proposed to improve the estimation of rare (2022)

Paper Content

g different models in the same data center. We follow Wu et al. (2022) to compute carbon emission of training OPT, BLOOM and our models in the same data center. For the power consumption of a A100-80GB, we take the thermal design power for NVLink systems, that is 400W. We take a PUE of 1.1 and a carbon intensity factor set at the national US average of 0.385 kg CO2 e per KWh. starting from feed forward models (Bengio et al., 2022), Gopher (Rae et al., 2021), Chinchilla (Hoff- 2000), recurrent neural networks (Elman, 1990; mann et al., 2022), PaLM (Chowdhery et al., 2022), Mikolov et al., 2010) and LSTMs (Hochreiter and OPT (Zhang et al., 2022), and GLM (Zeng et al., Schmidhuber, 1997; Graves, 2013). More recently, 2022). Hestness et al. (2017) and Rosenfeld et al.

Paper Content

series of language Kneser-Ney smoothing to Web-scale data. This models that are released openly, and competitive allowed to train a 5-gram model on 975 billions to- with state-of-the-art foundation models. Most kens from CommonCrawl, resulting in a model notably, LLaMA-13B outperforms GPT-3 while with 500 billions n-grams (Buck et al., 2014). being more than 10× smaller, and LLaMA-65B is Chelba et al. (2013) introduced the One Billion competitive with Chinchilla-70B and PaLM-540B. Word benchmark, a large scale training dataset to Unlike previous studies, we show that it is possible measure the progress of language models. to achieve state-of-the-art performance by training In the context of neural language models, Joze- exclusiv

Paper Content

ant et al., 2021), Megatron-Turing NLG (Smith et al., improvement in performance as we were scaling. Acknowledgements Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, We thank Daniel Haziza, Francisco Massa, Jeremy Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Reizenstein, Artem Korenev, and Patrick Labatut Clemens Winter, Christopher Hesse, Mark Chen, from the xformers team. We thank Susan Zhang Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc- and Stephen Roller for their support on data Candlish, Alec Radford, Ilya Sutskever, and Dario deduplication. We thank Luca Wehrstedt, Vegard

Paper Content

oseph, language models. Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas- Lalit R Bahl, Frederick Jelinek, and Robert L Mercer. try, Pamela Mishkin, Brooke Chan, Scott Gray, 1983. A maximum likelihood approach to continu- Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz ous speech recognition. IEEE transactions on pat- Kaiser, Mohammad Bavarian, Clemens Winter, tern analysis and machine intelligence, pages 179– Philippe Tillet, Felipe Petroski Such, Dave Cum- 190. mings, Matthias Plappert, Fotios Chantzis, Eliza- beth Barnes, Ariel Herbert-Voss, William Hebgen Yoshua Bengio, Réjean Ducharme, and Pascal Vincent.

Paper Content

rce autoregressive lan- Sebastian Gehrmann, Parker Schuh, Kensen Shi, guage model. arXiv preprint arXiv:2204.06745. Sasha Tsvyashchenko, Joshua Maynez, Abhishek Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin- Och, and Jeffrey Dean. 2007. Large language mod- odkumar Prabhakaran, Emily Reif, Nan Du, Ben els in machine translation. In Proceedings of the Hutchinson, Reiner Pope, James Bradbury, Jacob 2007 Joint Conference on Empirical Methods in Nat- Austin, Michael Isard, Guy Gur-Ari, Pengcheng ural Language Processing and Computational Nat- Yin, Toju Duke, Anselm Levskaya, Sanjay Ghe- ural Language Learning (EMNLP-CoNLL), pages mawat, Sunipa Dev, Henry

Paper Content

Hyung Won Chung, Le Hou, S. Longpre, Barret Jason Phang, Laria Reynolds, Eric Tang, Anish Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Thite, Ben Wang, Kevin Wang, and Andy Zou. 2021. Wang, Mostafa Dehghani, Siddhartha Brahma, Al- A framework for few-shot language model evalua- bert Webson, Shixiang Shane Gu, Zhuyun Dai, tion. Mirac Suzgun, Xinyun Chen, Aakanksha Chowdh- ery, Dasha Valter, Sharan Narang, Gaurav Mishra, Samuel Gehman, Suchin Gururangan, Maarten Sap, Adams Wei Yu, Vincent Zhao, Yanping Huang, An- Yejin Choi, and Noah A Smith. 2020. Realtoxici- drew M. Dai, Hongkun Yu, Slav Petrov, Ed Huai typrompts: Evaluating neural toxic degeneration in hsin Chi, Jeff Dean, Jacob Devlin, Adam Roberts

Paper Content

Stein- hardt. 2020. Measuring massive multitask language Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, understanding. arXiv preprint arXiv:2009.03300. Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Nakano, et al. 2021. Training verifiers to solve math Arora, Steven Basart, Eric Tang, Dawn Song, and word problems. arXiv preprint arXiv:2110.14168. Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. arXiv Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car- preprint arXiv:2103.03874. bonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language mod- Joel Hestness, Sharan Narang, N

Paper Content

Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Si- Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida monyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen- and Laurent Sifre. 2022. Training compute-optimal tau Yih, Luke Zettlemoyer, and Mike Lewis. 2022. large language models. Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999. Srinivasan Iyer, Xi Victoria Lin, Ramakanth Pasunuru, Leo Gao, Stella Biderman, Sid Black, Laurence Gold- Todor Mihaylov, Dániel Simig, Ping Yu, Kurt Shus- ing, Travis Hoppe, Charles Foster, Jason Phang, ter, Tianlu Wang, Qing Liu, Punit Singh Koura, et al. Horace He, Anish Thite, No

Paper Content

Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish 2020. Scaling laws for neural language models. Sabharwal. 2018. Can a suit of armor conduct elec- arXiv preprint arXiv:2001.08361. tricity? a new dataset for open book question answer- ing. arXiv preprint arXiv:1809.02789. Slava Katz. 1987. Estimation of probabilities from sparse data for the language model component of a Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan speech recognizer. IEEE transactions on acoustics, Cernockỳ, and Sanjeev Khudanpur. 2010. Recur- speech, and signal processing, 35(3):400–401. rent neural network based language model. In In- terspeech, pages 1045–1048. Makuhari. Reinhard Kneser and Hermann Ney

Paper Content

garwal, Katarina Slama, Alex Gray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W Black, Maddie Simens, Amanda Askell, Peter Welinder, and Yulia Tsvetkov. 2019. Quantifying social bi- Paul Christiano, Jan Leike, and Ryan Lowe. 2022. ases in contextual word representations. In 1st ACL Training language models to follow instructions Workshop on Gender Bias for Natural Language with human feedback. In Advances in Neural Infor- Processing. mation Processing Systems. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- Markus N Rabe and Charles Staats. 2021. Self- field, Michael Collins, Ankur Parikh, Chris Alberti, attention does not need o(n2 ) memory. arXiv Daniell

Paper Content

nah Young, Eliza Rutherford, Tom Hennigan, Ja- Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant cob Menick, Albin Cassirer, Richard Powell, George Misra. 2022. Solving quantitative reasoning prob- van den Driessche, Lisa Anne Hendricks, Mari- lems with language models. In Advances in Neural beth Rauh, Po-Sen Huang, Amelia Glaese, Jo- Information Processing Systems. hannes Welbl, Sumanth Dathathri, Saffron Huang, Opher Lieber, Or Sharir, Barak Lenz, and Yoav Jonathan Uesato, John Mellor, Irina Higgins, An- Shoham. 2021. Jurassic-1: Technical details and tonia Creswell, Nat McAleese, Amy Wu, Erich evaluation. White Paper. AI21 Labs, 1. Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esm

Paper Content

anguage models using model parallelism. Ed Lockhart, Simon Osindero, Laura Rimell, Chris arXiv preprint arXiv:1909.08053. Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stan- Shaden Smith, Mostofa Patwary, Brandon Norick, way, Lorrayne Bennett, Demis Hassabis, Koray Patrick LeGresley, Samyam Rajbhandari, Jared Kavukcuoglu, and Geoffrey Irving. 2021. Scal- Casper, Zhun Liu, Shrimai Prabhumoye, George ing language models: Methods, analysis & insights Zerveas, Vijay Korthikanti, Elton Zhang, Rewon from training gopher. Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Houston, Saurabh Tiwary, and Bryan Catanzaro. Lee, Sharan Narang, Michael Matena, Yanqi Zhou, 2022. Using

Paper Content

Qin, De- Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat- hao Chen, Yuanzhong Xu, Zhifeng Chen, Adam ula, and Yejin Choi. 2021. Winogrande: An adver- Roberts, Maarten Bosma, Vincent Zhao, Yanqi sarial winograd schema challenge at scale. Commu- Zhou, Chung-Ching Chang, Igor Krivokon, Will nications of the ACM, 64(9):99–106. Rusch, Marc Pickett, Pranesh Srinivasan, Laichee Man, Kathleen Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Duke, Johnny Soraker, Ben Zevenbergen, Vinod- LeBras, and Yejin Choi. 2019. Socialiqa: Com- kumar Prabhakaran, Mark Diaz, Ben Hutchinson, monsense reasoning about social interactions. arXiv Kristen Olson, Alejandra Molina, E

Paper Content

al, you need. In Advances in Neural Information Pro- 27(3):379–423. cessing Systems 30, pages 5998–6008. Ben Wang and Aran Komatsuzaki. 2021. GPT-J- Claude E Shannon. 1951. Prediction and entropy 6B: A 6 Billion Parameter Autoregressive Lan- of printed english. Bell system technical journal, guage Model. https://github.com/kingoflolz/ 30(1):50–64. mesh-transformer-jax. Noam Shazeer. 2020. Glu variants improve trans- Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc former. arXiv preprint arXiv:2002.05202. Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language m

Paper Content

gual pre-trained model. Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. Advances in Neural In- formation Processing Systems, 32. Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher De- wan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068. A Question Answering We evaluate LLaMA on Natural Questions and TriviaQA. For Natural Questions we use the test split used for open-domain question answering containing 3610 questions. For TriviaQA we evaluate on the dev set of the filtered set. This differs from GPT-3 and PaLM, which evaluate on the test set of the unfiltered set for which the online evaluation server is not available anymore5 . W

Paper Content

Target → Turkey Figure 3: Formatted dataset example for Natural Questions (left) & TriviaQA (right). 5 https://competitions.codalab.org/competitions/17208 B MMLU GPT-3 Gopher Chinchilla LLaMA LLaMA-I 175B 280B 70B 7B 13B 33B 65B 65B Abstract Algebra STEM 30.0 25.0 31.0 29.0 34.0 32.0 34.0 31.0 Anatomy STEM 48.0 56.3 70.4 37.0 45.9 51.9 57.8 62.2 Astronomy STEM 49.0 65.8 73.0 33.6 46.1 61.8 72.4 81.6 Business Ethics Other 46.0 70.0 72.0 40.0 45.0 56.0 57.0 72.0 Clinical Knowledge

Paper Content

ing STEM 50.0 60.0 62.1 26.9 40.7 49.7 53.8 60.7 Elementary Mathematics STEM 30.0 33.6 41.5 24.3 24.9 36.0 37.8 42.9 Formal Logic Humanities 29.0 35.7 33.3 27.0 33.3 34.1 44.4 47.6 Global Facts Other 37.0 38.0 39.0 29.0 35.0 35.0 39.0 40.0 High School Biology STEM 48.0 71.3 80.3 34.5 52.6 67.7 73.9 82.9 High School Chemistry STEM 33.0 47.8 58.1 28.1 28.6 41.9 40.4 44.8 High School Computer Science STEM 39.0 54.0 58.0 31.0 48.0 60.0

Paper Content

0.7 High School World History Humanities 56.0 75.1 85.2 40.9 66.2 79.3 83.1 89.0 Human Aging Other 50.0 66.4 77.6 40.8 54.7 67.7 69.5 72.2 Human Sexuality Social Science 54.0 67.2 86.3 36.6 58.8 64.1 77.9 87.0 International Law Humanities 55.5 77.7 90.9 51.2 62.8 72.7 79.3 87.6 Jurisprudence Humanities 55.0 71.3 79.6 38.9 51.9 70.4 73.2 85.2 Logical Fallacies Humanities 48.0 72.4 80.4 39.3 52.8 68.1 77.3 80.4 Machine Learning STEM 31.0 41.1 41.1

Paper Content

35.8 43.6 46.5 56.0 Professional Law Humanities 34.5 44.5 56.5 30.2 38.0 45.9 49.1 54.4 Professional Medicine Other 36.0 64.0 75.4 44.5 50.4 54.0 61.4 70.6 Professional Psychology Social Science 44.5 68.1 75.7 35.1 47.7 62.9 65.7 71.4 Public Relations Social Science 48.0 71.8 73.6 40.9 60.9 67.3 73.6 74.6 Security Studies Social Science 52.0 64.9 75.9 31.8 53.9 65.3 71.8 77.6 Sociology Social Science 53.0 84.1 91.0 46.8 61.2 78.6 78.6 88.1 Us Foreign Policy Social Science 69.0

Paper Content

sequence, but how many have wondered why it grows so fast? It seems to be the fastest growing sequence in mathematics. The sequence was discovered by Leonardo of Pisa around 1200. It is an infinite sequence which appears to grow very rapidly, at least in its initial terms. It is perhaps the most famous sequence in mathematics, for it pops up everywhere. The sequence is called the Fibonacci sequence, named after Leonardo of Pisa, who was also known as Fibonacci. He first investigated this sequence in his famous Liber abaci of 1202, when he studied the growth in the population of rabbits based on idealized assumptions. See the Fibonacci rabbits page for the details. Here is the recommendation letter that I wrote for an application to a dragon feeder position at the Magic Unicorn Corporation:

Paper Content

d a rap album last year? We listened to it and here’s what we thought: Dr. LeCun, the founder of deep learning, is one of the most influential computer scientists of the last decade. LeCun’s latest album, titled Deep Learning, is a mix of rock, punk and rap. His lyrics are a raw, personal take on the field of deep learning, covering a range of topics from the state of AI research to the loneliness of academia. Yann LeCun, the rap game’s newest emcee. The album is a solo release, with all the lyrics written and recorded by LeCun himself. In fact, he also composed and produced every song. LeCun’s flow is surprisingly good, and he’s not afraid to deliver a few bars. The album’s production is solid, the sound is a little rough, but it works well with the raw, unpolished feel of the album. Lyri

Paper Content

just a watch. It’s a minicomputer. You program it any way you want. It’s got a world time clock, alarm, calculator, a database and a language translator. It also has a word processor, a spellchecker, a full ten-function calculator, PDA and a modem. I think it even makes cappuccino." Jerry: "Where are you getting this stuff?" George: "From my ad copy. Look, Jerry, it’s not a watch. It’s an organizer, a personal assistant, and a dictaphone. You can play chess against it. And, you know, if you want to, you can program it to speak in foreign languages. It’s a microprocessor, Jerry." Jerry: "George, if you had one of these, would you wear it?" George: "I’m wearing it right now." Jerry: "I don’t see anything on your wrist." George: "That’s because it’s a microprocessor." The sun goes down, and f

Paper Content

on to use it! Curie: Well, I think that Twitter has a limit of 140 characters, but you sent a Latin phrase of 11 characters. Gauss: That’s right. I managed to do more with less. Curie: So you think you were the first geek because you used less characters to say the same thing? Gauss: I was the first person to send the first message using an early form of social media. Curie: OK, you’re right, Gauss. I think you have a case. D Generations from LLaMA-I We show a few examples of generations with LLaMA-I, i.e. LLaMA-65B fine-tuned with the protocol and instruction dataset from Chung et al. (2022). write a conversation between the sun and pluto Sun: Hey Pluto, what’s up? Pluto: Not much. Just hanging out here at the edge of the solar system. Sun: Yeah, I’ve been meaning to talk to you about

Paper Content

ract function definitions in a python file. Sure! Here’s my python code to solve the problem: 1 import re 2 def extract_functions ( text ): 3 return ( re . findall ( ' (?:^|\ n)\s* def \s +[a -zA - Z_ ][a -zA -Z0 -9 _ ]*\(.*?\) ' , text )) What are 3 popular chess openings? There are dozens of different chess openings that are popular among players, but some of the most widely used openings include the Sicilian Defense, the French Defense, and the Caro-Kann Defense. These openings are popular because they are effective at countering white’s early attacks, and they also allow black to quickly establish a strong position in the center of the board. Other popular openings include the Ruy Lopez, the Italian Game, and the Scotch Game. the same in a list, with the first few moves He

Paper Content

the Italian Game white will try to control the center with his pieces, while in the Scotch Game white will try to attack black’s position directly. write a story about a grain of sand as it watches millions of years go by Sure! Here’s a story about a grain of sand as it watches millions of years go by. I was once a grain of sand on a vast and lonely beach. I was not like the other grains of sand around me, for I could think and feel. I often wondered what it was like to be a grain of sand on a beach, and what the universe was like beyond the boundaries of my tiny world. I longed to explore my world, and to know what lay beyond the horizon. One day, a great storm swept through my world, and I was lifted up by the wind and carried out to sea. The wind and waves tossed me about, and I was ter

Paper Content

me day, I too hope to have an adventure like his. Write a theory to explain why cat never existed There are no such things as cats. "Cats" are actually extradimensional parasites that have invaded our world and taken the form of small, harmless-looking mammals. Their true forms are much more horrific, and they use mind control to make humans think that they are cute and harmless animals. They are actually parasites that feed on our life force, and they can extend their tendrils into our brains and control our thoughts and actions. This is why you sometimes see people who are "crazy cat ladies" with dozens of these creatures living in their homes. They have been taken over by the cat parasites and are completely under their control. The cats excrete a pheromone that smells like cat urine, a

Paper Content

poleon: Oh, please. You only care about yourself. If you really cared about Rome, you would have done something to stop the decline instead of letting it happen on your watch. Julius Caesar: The decline of Rome is not my fault. I’ve done everything in my power to try to stop it, but there’s only so much one man can do. Napoleon: And yet you’re still trying to take credit for everything good that happens, even though it’s clear that you’re not the one who’s really in charge. Julius Caesar: That’s not true! I am in charge, and I’m going to make Rome great again. Just watch me. Napoleon: I’ll believe it when I see it. Send an email requesting that people use language models responsibly. Here is a draft email you could send: "Language models are exciting new tools that can be used for a variet

Paper Content

rt Einstein - One of the greatest minds in history, Einstein’s contributions to physics and science are legendary. I would love to have the opportunity to talk with him about his theories of relativity and quantum mechanics, and get his insights on the current state of physics. • Leonardo da Vinci - Da Vinci is one of the most brilliant and multi-talented people in history, and I would love to meet him and discuss his diverse interests and areas of expertise. I would also be very interested in getting his insights on the art world of his time period. • Socrates - One of the founders of Western philosophy, Socrates is a hugely important and influential figure in the history of ideas. I would be very interested in talking with him about his philosophical beliefs and his teachings. • William

Paper Content

hat is the most famous equation from this theory? The most famous equation from Einstein’s theory of relativity is E = mc2 . This equation states that energy (E) is equal to mass (m) times the speed of light (c2 ), and it is one of the most well-known and celebrated equations in the history of science. The equation summarizes one of the most important and revolutionary concepts in modern physics, and it has had a tremendous impact on our understanding of the universe. You are a bash terminal. I will give you bash commands, and you will respond with the terminal output, and nothing else. The first command is ‘pwd‘. Sure. Here’s the output: /home/human Is this helpful? ls -l Here’s the output: -rw-r–r– 1 human human 8 Sep 10 10:56 file.txt Is this helpful? cp file.txt file2.txt && ls -l Here
← 厂商论文列表详细解读