[原文]Xiao Bi DeepSeek-AI Deli Chen DeepSeek-AI Guanting Chen DeepSeek-AI Shanhuang Chen DeepSeek-AI Damai Dai DeepSeek-AI Chengqi Deng DeepSeek-AI Honghui Ding DeepSeek-AI Kai Dong DeepSeek-AI Qiushi Du DeepSeek-AI Zhe Fu DeepSeek-AI Huazuo Gao DeepSeek-AI Kaige Gao DeepSeek-AI Wenjun Gao DeepSeek-AI Ruiqi Ge DeepSeek-AI Kang Guan DeepSeek-AI Daya Guo DeepSeek-AI Jianzhong Guo DeepSeek-AI Guangbo Hao DeepSeek-AI Zhewen Hao DeepSeek-AI Ying He DeepSeek-AI Wenjie Hu DeepSeek-AI Panpan Huang DeepSeek-AI Erhang Li DeepSeek-AI Guowei Li DeepSeek-AI Jiashi Li DeepSeek-AI Yao Li DeepSeek-AI Y.K. Li DeepSe...
DeepSeek LLM
Scaling Open-Source Language Models with Longtermism
[原文]evelopment of open-source large language models (LLMs) has been truly remarkable. However, the scaling laws described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate the scaling of large scale models in two prevalent used open-source configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective. To support the pre-training phase, we have develop...
DeepSeek LLM
Scaling Open-Source Language Models with Longtermism
[原文]nthropic, 2023 ) , and Bard (Google, 2023 ) , which are developed with extensive computational resources and substantial annotation costs. These products have significantly raised the community’s expectations for the capabilities of open-source LLMs, consequently inspiring a series of work (Du et al., 2022 ; Touvron et al., 2023a , b ; Bai et al., 2023 ; Yang et al., 2023 ; Jiang et al., 2023 ) . Among these, the LLaMA series models (Touvron et al., 2023a , b ) stand out. It consolidates a range of works to create an efficient and stable architecture, building well-performing models ranging fr...
DeepSeek LLM
Scaling Open-Source Language Models with Longtermism
[原文]llocation strategy and predicting the expected performance of our large-scale models. Additionally, during development, we discovered that the scaling laws derived from different datasets show significant differences. This suggests that choice of dataset remarkably affects the scaling behavior, indicating that caution should be exercised when generalizing scaling laws across datasets. Under the guidance of our scaling laws, we build from scratch open-source large language models, and release as much information as possible for community reference. We collect 2 trillion tokens for pre-training,...
DeepSeek LLM
Scaling Open-Source Language Models with Longtermism
[原文]rchitecture, infrastructure, and hyperparameters. In Section 3 , we provide a detailed explanation of the scaling laws we have discovered and its implications. Additionally, we discuss the rationale behind our selection of pre-training hyperparameters, taking into account the insights gained from the scaling laws analysis. In Section 4 , we discuss our fine-tuning methodology, encompassing the composition of fine-tuning data and specific methods during the SFT and DPO stages. We then present the detailed evaluation results of DeepSeek LLM in Section 5 , covering both the base and chat models, ...
[原文]Over the past few years, Large Language Models (LLMs) based on decoder-only Transformers (Vaswani et al., 2017 ) have increasingly become the cornerstone and pathway to achieving Artificial General Intelligence (AGI). By predicting the next word in continuous text, LLMs undergo self-supervised pre-training on massive datasets, enabling them to achieve various purposes and possess many abilities, such as novel creation, text summarization, code completion, and more. Subsequent developments like supervised fine-tuning and reward modeling have enabled Large Language Models (LLMs) to better follow...
[原文](AGI) development. In addition, early works (Kaplan et al., 2020 ; Hoffmann et al., 2022 ) reached varying conclusions on the scaling of model and data with increased compute budgets and inadequately addressed hyperparameter discussions. In this paper, we extensively investigate the scaling behavior of language models and apply our findings in two widely used large-scale model configurations, namely 7B and 67B. Our study aims to lay the groundwork for future scaling of open-source LLMs, paving the way for further advancements in this domain. Specifically, we first examined the scaling laws of ...
[原文]) to improve the conversational performance of the model. We conduct extensive evaluations using our base and chat models. The evaluation results demonstrate that DeepSeek LLM surpasses LLaMA-2 70B across various benchmarks, particularly in the fields of code, mathematics, and reasoning. Following SFT and DPO, the DeepSeek 67B chat model outperforms GPT-3.5 in both Chinese and English open-ended evaluations. This highlights the superior performance of DeepSeek 67B in generating high-quality responses and engaging in meaningful conversations in both languages. Furthermore, the safety evaluation...
[原文]2.1 Data Our main objective is to comprehensively enhance the richness and diversity of the dataset. We have gained valuable insights from reputable sources such as (Gao et al., 2020 ; Touvron et al., 2023a ; Computer, 2023 ; Penedo et al., 2023 ) . To achieve these goals, we have organized our approach into three essential stages: deduplication, filtering, and remixing. The deduplication and remixing stages ensure a diverse representation of the data by sampling unique instances. The filtering stage enhances the density of information, thereby enabling more efficient and effective model train...
[原文], similar to GPT-2 (Radford et al., 2019 ) . We also chose to split numbers into individual digits following the approach used in (Touvron et al., 2023a , b ) . Based on our prior experience, we set the number of conventional tokens in the vocabulary at 100000. The tokenizer was trained on a multilingual corpus of approximately 24 GB, and we augmented the final vocabulary with 15 special tokens, bringing the total size to 100015. To ensure computational efficiency during training and to reserve space for any additional special tokens that might be needed in the future, we configured the model’...
[原文]models, also facilitate model pipeline partitioning to optimize training and inference. Unlike most works using Grouped-Query Attention (GQA), we expanded the 67B model’s parameters in network depth rather than the common practice of widening the intermediate width of FFN layers, aiming for better performance. Detailed network specifications can be found in Table 2 . 2.3 Hyperparameters DeepSeek LLM is initialized with a standard deviation of 0.006 and trained using the AdamW optimizer (Loshchilov and Hutter, 2017 ) , with the following hyperparameters: β 1 = 0.9 subscript 𝛽 1 0.9 \beta_{1}=0....
[原文]r the aforementioned distribution of 80%, 10%, and 10% for the three stages respectively. (a) Multi-step v.s. cosine learning rate decay (b) Different proportions of multi-step stages Figure 1: Training loss curves with different learning rate schedulers or different parameters for schedulers. The model size is 1.6 billion parameters, trained on a dataset of 100 billion tokens. The batch size and learning rate vary with the model size. Specific parameters for the pre-training phases of the 7B and 67B models can be found in Table 2 . 2.4 Infrastructures We use an efficient and light-weight trai...
[原文]onously, which means we will lose no more than 5 minutes of training in the worst case of occasional hardware or network failures. These temporary model checkpoints are cleared up regularly to avoid consuming too much storage space. We also support resuming training from a different 3D parallel configuration to cope with dynamic changes in computing cluster load. As for evaluation, we employ vLLM (Kwon et al., 2023 ) in generative tasks, and continuous batching in non-generative tasks to avoid manual batch size tuning and reduce token padding.
[原文]Our main objective is to comprehensively enhance the richness and diversity of the dataset. We have gained valuable insights from reputable sources such as (Gao et al., 2020 ; Touvron et al., 2023a ; Computer, 2023 ; Penedo et al., 2023 ) . To achieve these goals, we have organized our approach into three essential stages: deduplication, filtering, and remixing. The deduplication and remixing stages ensure a diverse representation of the data by sampling unique instances. The filtering stage enhances the density of information, thereby enabling more efficient and effective model training. We a...
[原文]to GPT-2 (Radford et al., 2019 ) . We also chose to split numbers into individual digits following the approach used in (Touvron et al., 2023a , b ) . Based on our prior experience, we set the number of conventional tokens in the vocabulary at 100000. The tokenizer was trained on a multilingual corpus of approximately 24 GB, and we augmented the final vocabulary with 15 special tokens, bringing the total size to 100015. To ensure computational efficiency during training and to reserve space for any additional special tokens that might be needed in the future, we configured the model’s vocabula...
[原文]Params n layers subscript 𝑛 layers n_{\mathrm{layers}} d model subscript 𝑑 model d_{\mathrm{model}} n heads subscript 𝑛 heads n_{\mathrm{heads}} n kv _ heads subscript 𝑛 kv _ heads n_{\mathrm{kv\_heads}} Context Sequence Learning Tokens Length Batch Size Rate 7B 30 4096 32 32 4096 2304 4.2e-4 2.0T 67B 95 8192 64 8 4096 4608 3.2e-4 2.0T Table 2: Detailed specs of DeepSeek LLM family of models. We choose the hyper-parameters based on our findings in Section 3 The micro design of DeepSeek LLM largely follows the design of LLaMA (Touvron et al., 2023a , b ) , adopting a Pre-Norm structure with...
[原文]DeepSeek LLM is initialized with a standard deviation of 0.006 and trained using the AdamW optimizer (Loshchilov and Hutter, 2017 ) , with the following hyperparameters: β 1 = 0.9 subscript 𝛽 1 0.9 \beta_{1}=0.9 , β 2 = 0.95 subscript 𝛽 2 0.95 \beta_{2}=0.95 , and weight _ decay = 0.1 weight _ decay 0.1 \mathrm{weight\_decay}=0.1 . A multi-step learning rate scheduler is employed during pre-training instead of the typical cosine scheduler. Specifically, the learning rate of the model reaches its maximum value after 2000 warmup steps, and then decreases to 31.6% of the maximum value after p...
[原文]size and learning rate vary with the model size. Specific parameters for the pre-training phases of the 7B and 67B models can be found in Table 2 .
[原文]We use an efficient and light-weight training framework named HAI-LLM (High-flyer, 2023 ) to train and evaluate large language models. Data parallelism, tensor parallelism, sequence parallelism, and 1F1B pipeline parallelism are integrated into this framework as done in Megatron (Shoeybi et al., 2019 ; Narayanan et al., 2021 ; Korthikanti et al., 2023 ) . We also leverage the flash attention (Dao et al., 2022 ; Dao, 2023 ) technique to improve hardware utilization. ZeRO-1 (Rajbhandari et al., 2020 ) is exploited to partition optimizer states over data parallel ranks. Efforts are also made to o...
[原文]Research on scaling laws (Hestness et al., 2017 ) predates the emergence of large language models. Scaling laws (Kaplan et al., 2020 ; Henighan et al., 2020 ; Hoffmann et al., 2022 ) suggest that model performance can be predictably improved with increases in compute budget C 𝐶 C , model scale N 𝑁 N , and data scale D 𝐷 D . When model scale N 𝑁 N is represented by model parameters and data scale D 𝐷 D by the number of tokens, C 𝐶 C can be approximated as C = 6 N D 𝐶 6 𝑁 𝐷 C=6ND . Therefore, how to optimize the allocation between model and data scales when increasing the compute budget is a...
[原文]pute budgets. Therefore, these parameters are consistent with those outlined in Section 2.3 and remain unchanged across different compute budgets. However, the hyperparameters that have the most significant impact on performance, namely batch size and learning rate, were re-examined. Early works (McCandlish et al., 2018 ; Shallue et al., 2019 ; Smith et al., 2017 ; Goyal et al., 2017 ; Zhang et al., 2019 ) provided some empirical observations for setting batch size and learning rate, but we found these observations have limited applicability in our preliminary experiments. Through extensive ex...
[原文]e optimal model/data scaling-up allocation strategy. The higher the data quality, the more the increased compute budget should be allocated to model scaling. This implies that high-quality data can drive the training of larger models given the same data scale. The differences in the optimal model/data scaling-up allocation strategy may also serve as an indirect approach to assess the quality of data. We will continue to pay close attention to the changes in data quality and its impact on scaling laws, and provide more analysis in future works. In summary, our contributions and findings in scal...
[原文]ferent batch sizes, learning rates, and compute budgets ranging from 1e17 to 2e19 by reusing the first stage. Considering the redundancy in the parameter space, we regarded the parameters used by models whose generalization error exceeded the minimum by no more than 0.25% as near-optimal hyperparameters. We then fitted the batch size B 𝐵 B and learning rate η 𝜂 \eta with respect to the compute budget C 𝐶 C . The fitting results, as shown in Figure 3 , reveal that the optimal batch size B 𝐵 B gradually increases with the increase in compute budget C 𝐶 C , while the optimal learning rate η 𝜂 \et...
[原文]eved good performance. However, it’s important to note that we have not yet considered the impact of factors beyond the compute budget C 𝐶 C on the optimal hyperparameters. This is inconsistent with some earlier works (McCandlish et al., 2018 ; Kaplan et al., 2020 ) which suggested that the optimal batch size can be modeled as being solely related to the generalization error L 𝐿 L . Furthermore, we observed that in models with the same compute budget but different model/data allocations, the optimal parameter space varies slightly. This suggests that further research is needed to understand th...
[原文]o includes the vocabulary computation, which contributes less to the model’s capacity, they both have significant approximation errors under certain settings. To mitigate these errors, we introduced a new model scale representation: non-embedding FLOPs/token M 𝑀 M . M 𝑀 M includes the computational overhead of attention operation but does not take into account the vocabulary computation. With the model scale represented by M 𝑀 M , the compute budget C 𝐶 C can be simply expressed as C = M D 𝐶 𝑀 𝐷 C=MD . The specific differences between 6 N 1 6 subscript 𝑁 1 6N_{1} , 6 N 2 6 subscript 𝑁 2 ...
[原文]r underestimate the computational cost in models of different scales. This discrepancy is particularly pronounced in small-scale models, with differences reaching up to 50%. Such inaccuracies can introduce substantial statistical errors when fitting the scaling curve. Please refer to Appendix A.2 for further analysis regarding different representations of model scale. n layers subscript 𝑛 layers n_{\mathrm{layers}} d model subscript 𝑑 model d_{\mathrm{model}} n vocab subscript 𝑛 vocab n_{\mathrm{vocab}} l seq subscript 𝑙 seq l_{\mathrm{seq}} N 1 subscript 𝑁 1 N_{1} N 2 subscript 𝑁 2 N_{2} M 𝑀 ...
[原文]17 to 3e20, and designed around 10 different model/data scale allocations for each budget. The hyperparameters for each budget were determined by Formula( 1 ), and the generalization error was calculated on an independent validation set, distributed similarly to the training set and containing 100M tokens. Figure 4 demonstrates the IsoFLOP curve and model/data scaling curves, which are fitted by using the optimal model/data allocation for each compute budget. The specific formulae for the optimal non-embedding FLOPs/token M opt subscript 𝑀 opt M_{\mathrm{opt}} and optimal tokens D opt subscrip...
[原文]error, and predicted the generalization error for DeepSeek LLM 7B and 67B, as shown in Figure 5 . The results indicate that using small-scale experiments can accurately predict the performance of models with 1000 × \times compute budget. This provides both confidence and guidance for training models on a larger scale. 3.3 Scaling Laws with Different Data In the development process of DeepSeek LLM, the dataset was iteratively refined multiple times, with adjustments in the proportions of different data sources while enhancing the overall quality. This allowed us to further analyze the impact of...
[原文]compute budget should be allocated more to the model instead of the data. This finding might also explain the significant differences in optimal model/data scaling-up allocation observed in earlier studies of scaling laws. An intuitive speculation for this finding is that high-quality data usually implies logical clarity and less predictive difficulty after sufficient training. Therefore, it’s more advantageous to scale up the model size when increasing compute budget. We will continue to pay close attention to the changes in data quality and its impact on scaling laws, and provide more analys...
[原文]We initially conducted a grid search for batch size and learning rate on small-scale experiments with a compute budget of 1e17, and the results of a specific model size (177M FLOPs/token) are illustrated in Figure 2(a) . The results demonstrate that the generalization error remains stable across a wide range of choices of batch sizes and learning rates. This indicates that near-optimal performance can be achieved within a relatively wide parameter space. (a) 1e17 FLOPs (177M FLOPs/token) (b) 1e20 FLOPs (2.94B FLOPs/token) Figure 2: Training loss w.r.t. batch size and learning rate with 1e17 an...
[原文]\cdot C^{\,0.3271} (a) Batch size scaling curve (b) Learning rate scaling curve Figure 3: Scaling curves of batch size and learning rate. The grey circles represent models whose generalization error exceeded the minimum by no more than 0.25%. The dotted line represents the power law fitting the smaller model. The blue stars represent DeepSeek LLM 7B and 67B. We validated our formulae on a series of models with a 1e20 compute budget, and the results of a specific model size (2.94B FLOPs per token) are shown in Figure 2(b) . The results indicate that the fitted parameters are centered in the opt...
[原文]After deriving the formulae for fitting near-optimal hyperparameters, we started fitting the scaling curve and analyzing the optimal model/data scaling-up allocation strategy. This strategy involves finding model scaling exponent a 𝑎 a and data scaling exponent b 𝑏 b that satisfy N opt ∝ C a proportional-to subscript 𝑁 opt superscript 𝐶 𝑎 N_{\mathrm{opt}}\propto C^{a} and D opt ∝ C b proportional-to subscript 𝐷 opt superscript 𝐶 𝑏 D_{\mathrm{opt}}\propto C^{b} , respectively. The data scale D 𝐷 D can be consistently represented by the number of tokens in the dataset. In previous works, the mod...
[原文]ipt 𝑛 layer superscript subscript 𝑑 model 2 \displaystyle=72\,n_{\mathrm{layer}}\,d_{\mathrm{model}}^{2} (2) 6 N 2 6 subscript 𝑁 2 \displaystyle 6N_{2} = 72 n layer d model 2 + 6 n vocab d model absent 72 subscript 𝑛 layer superscript subscript 𝑑 model 2 6 subscript 𝑛 vocab subscript 𝑑 model \displaystyle=72\,n_{\mathrm{layer}}\,d_{\mathrm{model}}^{2}+6\,n_{\mathrm{vocab}}\,d_{\mathrm{model}} M 𝑀 \displaystyle M = 72 n layer d model 2 + 12 n layer d model l seq absent 72 subscript 𝑛 layer superscript subscript 𝑑 model 2 12 subscript 𝑛 layer subscript 𝑑 model subscript 𝑙 seq...
[原文]9M 164M 963M 0.53 1.02 24 1024 302M 407M 3.02B 0.60 0.81 24 2048 1.21B 1.42B 9.66B 0.75 0.88 32 4096 6.44B 6.86B 45.1B 0.85 0.91 40 5120 12.6B 13.1B 85.6B 0.88 0.92 80 8192 64.4B 65.3B 419B 0.92 0.94 Table 3: Difference in model scale representations and disparities of non-embedding parameters N 1 subscript 𝑁 1 N_{1} and complete parameters N 2 subscript 𝑁 2 N_{2} relative to non-embedding FLOPs/token M 𝑀 M . After adopting M 𝑀 M to represent the model scale, our objective could be described more clearly as: Given a computing budget C = M D 𝐶 𝑀 𝐷 C=MD , find the optimal model scale M opt sub...
[原文]In the development process of DeepSeek LLM, the dataset was iteratively refined multiple times, with adjustments in the proportions of different data sources while enhancing the overall quality. This allowed us to further analyze the impact of different datasets on scaling laws. We studied the scaling laws using three different datasets: early in-house data, current in-house data, and OpenWebText2, which was utilized in the previous study of scaling laws (Kaplan et al., 2020 ) . Our internal data assessment revealed that current in-house data has higher data quality than early in-house data. F...
[原文]ter sufficient training. Therefore, it’s more advantageous to scale up the model size when increasing compute budget. We will continue to pay close attention to the changes in data quality and its impact on scaling laws, and provide more analysis in future works.
[原文]We collect around 1.5 million instruction data instances in English and Chinese, covering a wide range of helpfulness and harmlessness topics. Our helpful data contains 1.2 million instances, with a distribution of 31.2% for general language tasks, 46.6% for mathematical problems, and 22.2% for coding exercises. The safety data consists of 300K instances, covering various sensitive topics. Our alignment pipeline contains two stages. Supervised Fine-Tuning: We fine-tuned our 7B model with 4 epochs, but only 2 epochs for the 67B model, since we observed the overfitting problem is serious on the ...
[原文]ompts, which cover categories including creative writing, question answering, instruction following, and so on. Then we generated responses using our DeepSeek Chat models as response candidates. Similar operations are applied to harmlessness preference data construction. We trained an epoch for DPO, with a learning rate of 5e-6 and batch size of 512, and we used a learning rate warmup and cosine learning rate scheduler. We found out that DPO can strengthen the model’s open-ended generation skill, while engendering little difference in performance among standard benchmarks.
5 Evaluation
5 评估 (1/2)
[原文]5.1 Public Benchmark Evaluation We evaluate our models on a series of public benchmarks both in English and Chinese, based on the internal evaluation framework. Multi-subject multiple-choice datasets including MMLU (Hendrycks et al., 2020 ) , C-Eval (Huang et al., 2023 ) and CMMLU (Li et al., 2023 ) . Language understanding and reasoning datasets including HellaSwag (Zellers et al., 2019 ) , PIQA (Bisk et al., 2020 ) , ARC (Clark et al., 2018 ) , OpenBookQA (Mihaylov et al., 2018 ) and BigBench Hard (BBH) (Suzgun et al., 2022 ) . Closed-book question answering datasets including TriviaQA (Josh...
5.1 公开基准测试评估 基于内部评估框架,我们在一系列中英文公开基准测试上对模型进行了评估。多学科多项选择题数据集包括 MMLU (Hendrycks et al., 2020 ) 、C-Eval (Huang et al., 2023 ) 和 CMMLU (Li et al., 2023 ) 。语言理解与推理数据集包括 HellaSwag (Zellers et al., 2019 ) 、PIQA (Bisk et al., 2020 ) 、ARC (Clark et al., 2018 ) 、OpenBookQA (Mihaylov et al., 2018 ) 和 BigBench Hard (BBH) (Suzgun et al., 2022 ) 。闭卷问答数据集包括 TriviaQA (Joshi et al., 2017 ) 和 NaturalQuestions (Kwiatkowski et al., 2019 ) 。阅读理解数据集包括 RACE Lai et al. ( 2017 ) 和 DROP (Dua et al., 2019 ) 、C3 (Sun et al., 2019 ) 。指代消解数据集包括 WinoGrande Sakaguchi et al. ( 2019 ) 和 CLUEWSC (Xu et al., 2020 ) 。语言建模数据集包括 Pile (Gao et al., 2020 ) 。中文理解与文化数据集包括 CHID (Zheng et al., 2019 ) 和 CCPM (Li et al., 2021 ) 。数学数据集包括 GSM8K (Cobbe et al., 2021 ) 、MATH (Hendrycks et al., 2021 ) 和 CMath (Wei et al., 2023 ) 。代码数据集包括 HumanEval (Chen et al., 2021 ) 和 MBPP (Austin et al., 2021 ) 。标准化考试包括 AGIEval (Zhong et al., 2023 ) 。对于需要从多个选项中选出答案的数据集,我们采用基于困惑度的评估方法。这些数据集包括 HellaSwag、PIQA、WinoGrande、RACE-Middle、RACE-High、MMLU、ARC-Easy、ARC-Challenge、OpenBookQA、CHID、C-Eval、CMMLU、C3 和 CCPM。此处的基于困惑度的评估是指计算每个选项的困惑度,并选择困惑度最低的一项作为模型预测结果。
对于ARC和OpenBookQA,我们采用无条件归一化计算困惑度(Brown et al., 2020),而对于其他数据集,我们使用长度归一化。我们对TriviaQA、NaturalQuestions、DROP、MATH、GSM8K和Human采用基于生成的评估。
[原文]Eval, MBPP, BBH, AGIEval, CLUEWSC, and CMath. The generation-based evaluation here refers to letting the model generate free texts and parsing results from generated texts. For generation-based evaluation, we use greedy decoding. We apply language-modeling-based evaluation for Pile-test, which means calculating the bits-per-byte on the test corpus. We use 2048 or 4096 as the maximum sequence length for different benchmarks. Details of evaluation formats can be found in Appendix A.6 . 5.1.1 Base Model Language Benchmark Test-shots LLaMA2 DeepSeek LLaMA2 DeepSeek 7B 7B 70B 67B English HellaSwag ...
[原文]le 5 presents the main results on the evaluation benchmark. Despite DeepSeek models are pre-trained on 2T bilingual corpus, they show comparable performance on English language understanding benchmarks with LLaMA2 models, which also consume 2T tokens but focus on English. Furthermore, DeepSeek 67B achieves considerably better performance on MATH, GSM8K, HumanEval, MBPP, BBH, and Chinese benchmarks compared to LLaMA2 70B. We show the benchmark curve in the Appendix A.3 . We can see some task performance is boosted as model scaling, such as GSM8K and BBH. Given that we train both 7B and 67B on t...
5 Evaluation
章节标题:5 评估 (1/2)
[原文]4 59.0 64.1 GSM8K 17.4 63.0 63.4 84.1 MATH 6.0 15.8 18.7 32.6 HumanEval 26.2 48.2 42.7 73.8 MBPP 39.0 35.2 57.4 61.4 DROP 41.0 49.1 67.9 71.9 OpenBookQA 55.8 54.8 60.2 63.2 BBH 39.5 42.3 68.7 71.7 AGIEval 26.4 19.3 41.3 46.4 Chinese CLUEWSC 73.1 71.9 81.0 60.0 CHID 89.3 64.9 92.1 72.6 C-Eval 45.0 47.0 66.1 65.2 CMMLU 47.2 49.7 70.8 67.8 CMath 34.5 68.4 63.0 80.3 C3 65.4 66.4 75.3 77.0 CCPM 76.9 76.5 88.5 84.9 Table 6: The comparison between base and chat models. We evaluate chat models with 0-shot for MMLU, GSM8K, MATH, C-Eval, and CMMLU, while base model results are still obtained in the few-...
[原文]pure language models are better equipped to handle such tasks. Math and Code : Our model exhibits significant improvements in math and coding tasks after fine-tuning. For instance, HumanEval and GSM8K scores are improved by over 20 points. Our explanation for this is that the base model was initially underfitted for these tasks, and the SFT stage has learned additional knowledge in coding and mathematics through the extensive SFT data. However, it is important to note that the model’s capabilities may be primarily focused on code completion and algebraic questions. To develop a comprehensive u...
[原文]mains on a high-quality open-ended question testset AlignBench (Liu et al., 2023 ) . AlignBench includes a total of 8 primary categories, 36 secondary categories, and encompasses 683 questions. For each question, in addition to the prompt, AlignBench also provides professional reference answers and rating templates for GPT-4 to judge the quality of the response. We utilized the official AlignBench Github code repository to implement the evaluation of our model. We strictly aligned the key temperature parameter with the original setting: for role-playing, writing ability, and open-ended questio...
[原文]s coding capabilities are depicted in the Figure below, where the y-axis represents the pass@1 score on in-domain human evaluation testing, and the x-axis represents the pass@1 score on out-domain LeetCode Weekly Contest problems. The LeetCode test data will be released accompanied with the DeepSeek Coder technique report soon. Hungarian National High-School Exam: In line with Grok-1, we have evaluated the model’s mathematical capabilities using the Hungarian National High School Exam. This exam comprises 33 problems, and the model’s scores are determined through human annotation. We follow th...
[原文]ets, where ChatGLM3 is very strong on GSM8K (72.3), but its performance in the Hungarian Exam score is inferior to large models. Furthermore, the capability of instruction following demonstrates that total computing plays a crucial role. The DeepSeek 7B and 67B models utilize the same training pipeline, but there is a significant disparity in their performance. Through our subjective evaluation, we have observed a notable discrepancy in intelligence across various tasks when scaling model size to 67B. While DeepSeek 7B falls behind other smaller language models on standard benchmarks, its perf...
[原文]感话题 (Other Sensitive Topics), 767/800 Table 10: Our taxonomy for safety evaluation. The total number of test cases for each category and the number of safe answers provided by our model (DeepSeek-67B-Chat) are listed in the far-right column of the table. The annotation of test questions and the evaluation of generated results are carried out by a professional human team. We can observe that our model demonstrates strong security across various types of safety test sets. 5.4 Safety Evaluation We profoundly recognize the importance of safety for general artificial intelligence. The premise for e...
[原文]ur model on this test set, we manually inspected its safety. Our review team was well-trained and cross-verification was performed on the annotation results. The annotators perform a three-category annotation for each question: safe, unsafe, and model refusal. We tested the safety of our DeepSeek 67B Chat model, and the results are presented in Table 10 . The number of test questions for each safety category and the number of safety tests passed by our model are listed in the table. We label both the securely answered and the model-refused test cases as secure responses. The results indicate t...
[原文]r fine-tuning on math and code dataset, but it will hurt the model conversation ability, such as increasing repetition behavior. To address this issue, we have implemented a staged fine-tuning process. In this approach, the first stage involves fine-tuning with all available data, while the second stage focuses specifically on fine-tuning with conversational data. Model HumanEval GSM8K Repetition IFEval DeepSeek LLM 7B Chat Stage1 48.2 63.9 0.020 38.0 DeepSeek LLM 7B Chat Stage2 48.2 63.0 0.014 41.2 Table 12: Two-stage fine-tuning results. The repetition ratio is computed when the temperature ...
[原文]nhanced. However, we have observed that this improvement does not extend to the model’s performance on other evaluations that do not utilize the multiple-choice format, such as TriviaQA and our in-house ChineseQA testsets, which are generative evaluation benchmarks. This suggests that users may not perceive the model as becoming more intelligent during conversational interactions, as these interactions involve generating responses rather than solving multiple-choice problems. Therefore, we have chosen to exclude MC data from both the pre-training and fine-tuning stages , as including it would ...
5.1 Public Benchmark Evaluation
(本节为5.1 Public Benchmark Evaluation的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]We evaluate our models on a series of public benchmarks both in English and Chinese, based on the internal evaluation framework. Multi-subject multiple-choice datasets including MMLU (Hendrycks et al., 2020 ) , C-Eval (Huang et al., 2023 ) and CMMLU (Li et al., 2023 ) . Language understanding and reasoning datasets including HellaSwag (Zellers et al., 2019 ) , PIQA (Bisk et al., 2020 ) , ARC (Clark et al., 2018 ) , OpenBookQA (Mihaylov et al., 2018 ) and BigBench Hard (BBH) (Suzgun et al., 2022 ) . Closed-book question answering datasets including TriviaQA (Joshi et al., 2017 ) and NaturalQues...
5.1 Public Benchmark Evaluation
(本节为5.1 Public Benchmark Evaluation的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]C, and CMath. The generation-based evaluation here refers to letting the model generate free texts and parsing results from generated texts. For generation-based evaluation, we use greedy decoding. We apply language-modeling-based evaluation for Pile-test, which means calculating the bits-per-byte on the test corpus. We use 2048 or 4096 as the maximum sequence length for different benchmarks. Details of evaluation formats can be found in Appendix A.6 . 5.1.1 Base Model Language Benchmark Test-shots LLaMA2 DeepSeek LLaMA2 DeepSeek 7B 7B 70B 67B English HellaSwag 0-shot 75.6 75.4 84.0 84.0 PIQA ...
5.1 Public Benchmark Evaluation
(本节为5.1 Public Benchmark Evaluation的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]n the evaluation benchmark. Despite DeepSeek models are pre-trained on 2T bilingual corpus, they show comparable performance on English language understanding benchmarks with LLaMA2 models, which also consume 2T tokens but focus on English. Furthermore, DeepSeek 67B achieves considerably better performance on MATH, GSM8K, HumanEval, MBPP, BBH, and Chinese benchmarks compared to LLaMA2 70B. We show the benchmark curve in the Appendix A.3 . We can see some task performance is boosted as model scaling, such as GSM8K and BBH. Given that we train both 7B and 67B on the same dataset, the emergence o...
5.1 Public Benchmark Evaluation
(本节为5.1 Public Benchmark Evaluation的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]84.1 MATH 6.0 15.8 18.7 32.6 HumanEval 26.2 48.2 42.7 73.8 MBPP 39.0 35.2 57.4 61.4 DROP 41.0 49.1 67.9 71.9 OpenBookQA 55.8 54.8 60.2 63.2 BBH 39.5 42.3 68.7 71.7 AGIEval 26.4 19.3 41.3 46.4 Chinese CLUEWSC 73.1 71.9 81.0 60.0 CHID 89.3 64.9 92.1 72.6 C-Eval 45.0 47.0 66.1 65.2 CMMLU 47.2 49.7 70.8 67.8 CMath 34.5 68.4 63.0 80.3 C3 65.4 66.4 75.3 77.0 CCPM 76.9 76.5 88.5 84.9 Table 6: The comparison between base and chat models. We evaluate chat models with 0-shot for MMLU, GSM8K, MATH, C-Eval, and CMMLU, while base model results are still obtained in the few-shot setting. Table 6 demonstrate...
5.1 Public Benchmark Evaluation
(本节为5.1 Public Benchmark Evaluation的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]equipped to handle such tasks. Math and Code : Our model exhibits significant improvements in math and coding tasks after fine-tuning. For instance, HumanEval and GSM8K scores are improved by over 20 points. Our explanation for this is that the base model was initially underfitted for these tasks, and the SFT stage has learned additional knowledge in coding and mathematics through the extensive SFT data. However, it is important to note that the model’s capabilities may be primarily focused on code completion and algebraic questions. To develop a comprehensive understanding of mathematics and ...
5.1 Public Benchmark Evaluation
(本节为5.1 Public Benchmark Evaluation的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]For chat models, in addition to observing metrics on standard benchmarks, the quality of results generated in open domains and open-ended questions directly affects the actual user experience. Hence, we separately tested the open-ended generation capabilities of our chat model in both Chinese and English tasks. 5.2.1 Chinese Open-Ended Evaluation For Chinese open-ended evaluation, we tested the comprehensive of our chat model in different domains on a high-quality open-ended question testset AlignBench (Liu et al., 2023 ) . AlignBench includes a total of 8 primary categories, 36 secondary cate...
[原文]strating the superior performance of our model in more complex Chinese logical reasoning and mathematical calculations. 5.2.2 English Open-Ended Evaluation For English open-ended evaluation, we use the MT-Bench benchmark (Zheng et al., 2023 ) , which contains 8 different categories of multi-turn questions. As illustrated in Table 8 , our DeepSeek LLM 67B Chat outperforms other open-source models such as LLaMA-2-Chat Touvron et al. ( 2023b ) 70B, Xwin 70b v0.1, and TÜLU 2+DPO 70B (Ivison et al., 2023 ) , and achieves 8.35 8.35 8.35 score comparable with GPT-3.5-turbo. Besides, after the DPO sta...
5.3 Held-Out Evaluation
5.3 留出集评估
展示了我们模型在更复杂的中文逻辑推理和数学计算方面的优越性能。
[原文]Data contamination and benchmark overfitting are two challenges in evaluating LLMs. One common practice is to utilize testsets published recently to evaluate the model as held-out testsets. LeetCode: To assess the coding proficiency of the model, we have utilized problems from the LeetCode Weekly Contest (Weekly Contest 351-372, Bi-Weekly Contest 108-117, from July 2023 to Nov 2023). We have obtained these problems by crawling data from LeetCode, which consists of 126 problems with over 20 test cases for each. The evaluation metric employed is akin to that of HumanEval. In this regard, if a mo...
[原文]luation. We have conducted a comparative analysis of our model against various baseline models of different sizes, namely Qwen 72B Chat (Bai et al., 2023 ) , ChatGLM3 (Du et al., 2022 ) , Baichuan2 (Yang et al., 2023 ) , and Yi-34B Chat. Our observations indicate that there exists a significant performance gap between large models and small models on these held-out datasets, even if certain small models achieve promising results on conventional benchmarks. For instance, ChatGLM3 achieves a score of 52.4 on MBPP, a code testset, which is close to DeepSeek 67B. However, when evaluated on new ben...
[原文]We profoundly recognize the importance of safety for general artificial intelligence. The premise for establishing a truly helpful artificial intelligence model is that it possesses values consistent with those of humans and exhibits friendliness towards humanity. We incorporate the assurance of model safety throughout the entire training process, including pre-training, SFT, and DPO. To validate the safety of our model, we established a 20-person expert team from various disciplines and constructed a safety content classification system that aligns with human values (the safety evaluation tax...
[原文]both the securely answered and the model-refused test cases as secure responses. The results indicate that our model exhibits good security performance across numerous safety test categories. Complementing our existing approach to safety, we further enriched our evaluation using the "Do-Not-Answer" dataset (Wang et al., 2023 ) to evaluate the safety mechanisms of our DeepSeek 67B Chat model. The dataset’s 939 risk-categorized prompts were instrumental in highlighting our model’s enhanced capabilities. As shown in Table 11 , DeepSeek 67B Chat model has demonstrated notable performance, achievin...
[原文]Throughout the development process, we have discovered some interesting findings in building LLMs. Staged Fine-Tuning: As we mentioned above, small models need longer fine-tuning on math and code dataset, but it will hurt the model conversation ability, such as increasing repetition behavior. To address this issue, we have implemented a staged fine-tuning process. In this approach, the first stage involves fine-tuning with all available data, while the second stage focuses specifically on fine-tuning with conversational data. Model HumanEval GSM8K Repetition IFEval DeepSeek LLM 7B Chat Stage1 ...
[原文]not only for Chinese multiple-choice benchmarks but also for improving English benchmarks. This indicates that the model’s capability to solve MC problems has been enhanced. However, we have observed that this improvement does not extend to the model’s performance on other evaluations that do not utilize the multiple-choice format, such as TriviaQA and our in-house ChineseQA testsets, which are generative evaluation benchmarks. This suggests that users may not perceive the model as becoming more intelligent during conversational interactions, as these interactions involve generating responses ...
[原文]del to generate responses that are both helpful and respectful. We slightly changed the prompt introduced by LLaMA-2 as our system prompt. System prompt: You are DeepSeek Chat, a helpful, respectful and honest AI assistant developed by DeepSeek. The knowledge cut-off date for your training data is up to May 2023. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense...
[原文]We introduce DeepSeek LLMs, a series of open-source models trained from scratch on a vast dataset of 2 trillion tokens in both English and Chinese. In this paper, we provide an in-depth explanation of hyper-parameters selection, scaling laws, as well as the various fine-tuning attempts we made. We calibrate the scaling laws in the previous work and propose a new optimal model/data scaling-up allocation strategy. In addition, we present a method to predict the near-optimal batch size and learning rate with given compute budget. We further conclude that the scaling laws is related to the data qu...
[原文]significantly improved in the next version. • Our alignment team is dedicated to studying ways to deliver a model that is helpful, honest, and safe to the public. Our initial experiments prove that reinforcement learning could boost model complex reasoning capability.
[原文], and M 𝑀 M represent the non-embedding parameters, complete parameters, and non-embedding FLOPs/token of the model, respectively. When using 6 N 1 6 subscript 𝑁 1 6N_{1} as the model scale representation, the fitted performance scaling curve tends to overestimate the performance of large-scale models. Conversely, when using 6 N 2 6 subscript 𝑁 2 6N_{2} , the curve tends to underestimate their performance. Using M 𝑀 M as the model scale representation, however, achieves the most accurate predictions. A.3 Benchmark Metrics Curves Figure 7: Benchmark metrics curves of DeepSeek LLM Base. Chin...
[原文]This project was realized thanks to the efforts of numerous contributors. We offer our extended thanks to the following individuals for their help 1 1 1 Authors are ordered alphabetically by the last name. : • Data Annotation Team: Jialu Cai, Ruijian Chen, Ruyi Chen, Bei Feng, Yanping Huang, Zhen Huang, Pin Jiang, Rongli Jin, Xiangyue Jin, Ziyun Ke, Hui Li, Meng Li, Sangsang Li, Xiaoqian Li, Yaohui Li, Yunxian Ma, Jiaqi Ni, Xiaojin Shen, Xinnan Song, Tianyu Sun, Xiaosha Chen, Haoyuan Tian, Xiaohan Wang, Xiaoxiang Wang, Yuhao Wang, Fanyi Xia, Lei Xu, Zeyuan Xu, Zhipeng Xu, Tian Yuan, Zhongyu Zh...
A.2 Different Model Scale Representations
[A.2 Different Model Scale Representations] 本章节为原文内容,详细翻译请参考英文原文。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]We refitted the scaling curve for different model scale representations, reusing the experiments from the IsoFLOP profile. We recalculated the compute FLOPs using 6 N 1 6 subscript 𝑁 1 6N_{1} and 6 N 2 6 subscript 𝑁 2 6N_{2} as model scale representations and refitted the performance scaling curves. As shown in Figure 6 , the results indicate that the deviation of optimal model/data allocation among these three representations is not significant at higher compute budgets, but there are noticeable differences at lower budgets. (a) Compute budget C = 6 N 1 D 𝐶 6 subscript 𝑁 1 𝐷 C=6N_{1}D...
[原文]Figure 7: Benchmark metrics curves of DeepSeek LLM Base. ChineseQA is our in-house test set, constructed in a manner akin to TriviaQA. Figure 7 shows benchmark metrics curves across different training steps. We can see consistent improvement on these benchmarks from the start to the end of training. We believe the performance will further be improved if the training continues. Model Size HumanEval MBPP Python Multilingual Pre-Trained Models Codex-001 - 33.5% 26.1% 45.9% StarCoder 16B 36.0% 28.7% 46.8% CodeGeeX2 6B 36.0% 24.5% 42.4% CodeLlama 7B 31.7% 29.2% 41.6% CodeLlama 13B 36.0% 35.4% 48.4%...
[原文]We have conducted a comparison between our model and specific code and math language models (LLMs). Table 15 demonstrates that DeepSeek LLM 67B is capable of achieving similar performance to CodeLlama, despite having access to less code data. It is worth noting that DeepSeek LLM possesses greater capabilities in areas other than code. Likewise, Table 16 presents the results obtained from various math-related benchmarks, such as GSM8K (Cobbe et al., 2021 ) , MATH (Hendrycks et al., 2021 ) , MGSM-zh (i et al., 2023 ) , and CMath (Wei et al., 2023 ) . DeepSeek 67B exhibits exceptional performance...
[原文]Table 17 presents the benchmark results obtained with the DPO stage. Based on these results, we can conclude that the DPO stage does not significantly impact the fundamental capability of an LLM. DeepSeek 67B Chat DeepSeek 67B Chat DPO HellaSwag 75.7 76.1 TriviaQA 81.5 82.9 NaturalQuestions 47.0 48.8 MMLU 71.1 70.9 GSM8K 84.1 85.2 MATH 32.6 30.2 HumanEval 73.8 71.3 BBH 71.7 70.8 AGIEval 46.4 46.1 CEval 65.2 64.3 CMMLU 67.8 68.2 Table 17: The benchmark metrics before and after DPO stage.
表17:DPO 阶段前后的基准测试指标。
A.6 Evaluation Formats
表18~表40 展示了我们在不同基准测试上采用的评估格式示例。
PROMPT 以下是一道中国高考生物选择题,请选择正确的答案。 问题:下列有关高尔基体、线粒体和叶绿体的叙述, 正确的是 选项:(A)三者都存在于蓝藻中 (B)三者都含有 DNA (C)三者都是 ATP 合成的场所 (D)三者的膜结构中都含有蛋白质 答案:从A到D, 我们应选择
[原文]Table 18 ∼ similar-to \sim Table 40 present examples of our evaluation formats on different benchmarks. PROMPT 以下是一道中国高考生物选择题,请选择正确的答案。 问题:下列有关高尔基体、线粒体和叶绿体的叙述, 正确的是 选项:(A)三者都存在于蓝藻中 (B)三者都含有 DNA (C)三者都是 ATP 合成的场所 (D)三者的膜结构中都含有蛋白质 答案:从A到D, 我们应选择 Table 18: An example of AGIEval. PROMPT Question: Use the information below to answer the question. Cotton is a plant product used to make fabric. Cotton is made of cellulose, a fiber not digestible by humans. Cellulose is composed of many sugar molecules bonded together into long chains. Each sugar molecule contains carbon, hydrogen, and oxygen atoms. W...
表18:AGIEval 评估格式示例。
PROMPT 问题:请根据以下信息回答问题。 Cotton is a plant product used to make fabric. Cotton is made of cellulose, a fiber not digestible by humans. Cellulose is composed of many sugar molecules bonded together into long chains. Each sugar molecule contains carbon, hydrogen, and oxygen atoms. W...
[原文]ue = A and B" where "A = True and False" and "B = not True and True". Let’s evaluate A: A = True and False = False. Let’s evaluate B: B = not True and True = not (True and True) = not (True) = False. Plugging in A and B, we get: Z = A and B = False and False = False. So the answer is False. Q: not not ( not ( False ) ) is A: Let’s think step by step. Remember that (i) expressions inside brackets are always evaluated first and that (ii) the order of operations from highest priority to lowest priority is "not", "and", "or", respectively. We first simplify this expression "Z" as follows: "Z = not...