DeepSeek LLM
Scaling Open-Source Language Models with Longtermism
作者列表(DeepSeek-AI团队):肖碧、陈德礼、陈冠廷、陈山煌、戴德迈、邓成琦、丁宏慧、董凯、杜秋实、傅哲、高华卓、高凯歌、高文军、葛瑞琪、管康、顾德雅、顾建忠、郝光波、郝哲文、何英、胡文杰、黄攀攀、李尔航、李国伟、李嘉实、李瑶、李YK、梁文峰、林芳云、刘AX、刘波、刘文、刘晓东、刘鑫、刘依依、卢浩宇、卢尚昊、罗福利、马世荣、聂小涛、裴天、庑逸之、邱俊杰、屈辉、任通正、任泽辉、冉冲、沙张立、邵志宏、宋俊晓、苏学成、孙景祥、孙尧峰、唐明辉、王冰旋、王培宜、王诗雨、王耀辉、王永吉、吴彤、吴Y、谢鑫、谢震达、谢子威、熊亿良、徐汉威、徐RX、徐艳红、杨德建、游宇翔、郁水平、郁兴凯、张B、张浩威、张乐聪、张丽月、张明川、张明华、张文涛、张一超、赵成刚、赵耀、周尚彦、周顺丰、朱启豪、邹宇恒
原文: Xiao Bi DeepSeek-AI Deli Chen DeepSeek-AI Guanting Chen DeepSeek-AI Shanhuang Chen DeepSeek-AI Damai Dai DeepSeek-AI Chengqi Deng DeepSeek-AI Honghui Ding DeepSeek-AI Kai Dong DeepSeek-AI Qiushi Du DeepSeek-AI Zhe Fu DeepSeek-AI Huazuo Gao DeepSeek-AI Kaige Gao DeepSeek-AI Wenjun Gao DeepSeek-AI Ruiqi Ge DeepSeek-AI Kang Guan DeepSeek-AI Daya Guo DeepSeek-AI Jianzhong Guo DeepSeek-AI Guangbo Hao DeepSeek-AI Zhewen Hao DeepSeek-AI Ying He DeepSeek-AI Wenjie Hu DeepSeek-AI Panpan Huang DeepSeek-AI Erhang Li DeepSeek-AI Guowei Li DeepSeek-AI Jiashi Li DeepSeek-AI Yao Li DeepSeek-AI Y.K. Li DeepSe...
DeepSeek LLM
Scaling Open-Source Language Models with Longtermism
开源大型语言模型(LLMs)的快速发展令人瞩目。然而,以往文献中描述的规模定律(Scaling Laws)得出的结论各不相同,这给LLM的扩展蒙上了一层阴影。我们深入研究规模定律,并提出了独特的发现,以促进在两种常用的开源配置——7B和67B——中进行大规模模型的扩展。在规模定律的指导下,我们推出了DeepSeek LLM项目,致力于以长期主义的视角推进开源语言模型的发展。为了支持预训练阶段,我们构建了一个目前包含2万亿token的数据集,并且还在持续扩展。我们进一步对DeepSeek LLM基座模型进行了监督微调(SFT)和直接偏好优化(DPO),从而创建了DeepSeek Chat模型。评估结果显示,DeepSeek LLM 67B在众多基准测试中超越了LLaMA-2 70B,特别是在代码、数学和推理领域。此外,开放式评估表明,我们的DeepSeek LLM 67B Chat在与GPT-3.5的比较中表现更优。
1 引言
过去几年,基于纯解码器Transformer架构的大语言模型(LLMs)已逐渐成为通向通用人工智能(AGI)的基石与路径。通过在连续文本中预测下一个词,LLMs在大规模数据集上进行自监督预训练,使其能够实现多种目标并具备多种能力,如小说创作、文本摘要、代码补全等。随后的监督微调和奖励建模等技术发展,使得LLMs能够更好地遵循用户的意图和指令,赋予了它们更丰富的对话能力,并迅速扩展了其影响力。这一浪潮由ChatGPT(OpenAI, 2022)、Claude(Anthropic, 2023)等闭源产品点燃,
原文: evelopment of open-source large language models (LLMs) has been truly remarkable. However, the scaling laws described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate the scaling of large scale models in two prevalent used open-source configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective. To support the pre-training phase, we have develop...
DeepSeek LLM
Scaling Open-Source Language Models with Longtermism
以及Bard(Google, 2023)等产品引领,这些产品投入了巨大的计算资源和标注成本。这些产品显著提高了社区对开源LLM能力的期望,从而催生了一系列研究工作(Du等, 2022;Touvron等, 2023a, b;Bai等, 2023;Yang等, 2023;Jiang等, 2023)。其中,LLaMA系列模型(Touvron等, 2023a, b)尤为突出。它整合了多项研究成果,构建了高效稳定的架构,创建了从7B到70B参数的优秀模型。因此,LLaMA系列已成为开源模型中架构和性能的标杆。在LLaMA之后,开源社区主要集中于训练固定规模(7B、13B、34B和70B)的高质量模型,往往忽略了对LLM规模定律的研究探索(Kaplan等, 2020;Hoffmann等, 2022)。然而,考虑到当前开源模型仍处于AGI发展的初始阶段,规模定律研究至关重要。此外,早期研究(Kaplan等, 2020;Hoffmann等, 2022)在计算资源增加时模型和数据的扩展问题上得出了不同的结论,且未能充分讨论超参数的选择。在本文中,我们广泛调查了语言模型的扩展行为,并将研究成果应用于两种广泛使用的模型配置——7B和67B。本研究旨在为未来开源LLM的扩展奠定基础,为这一领域的进一步发展铺平道路。具体而言,我们首先研究了batch size和学习率的规模定律,发现了它们随模型规模的变化趋势。在此基础上,我们对数据和模型规模的规模定律进行了全面研究,成功揭示了最佳的模型/数据扩展分配策略,
原文: nthropic, 2023 ) , and Bard (Google, 2023 ) , which are developed with extensive computational resources and substantial annotation costs. These products have significantly raised the community’s expectations for the capabilities of open-source LLMs, consequently inspiring a series of work (Du et al., 2022 ; Touvron et al., 2023a , b ; Bai et al., 2023 ; Yang et al., 2023 ; Jiang et al., 2023 ) . Among these, the LLaMA series models (Touvron et al., 2023a , b ) stand out. It consolidates a range of works to create an efficient and stable architecture, building well-performing models ranging fr...
DeepSeek LLM
Scaling Open-Source Language Models with Longtermism
并预测了大规模模型的预期性能。此外,在开发过程中,我们发现从不同数据集推导出的规模定律存在显著差异。这表明数据集的选择极大地影响扩展行为,因此在跨数据集推广规模定律时应谨慎。在规模定律的指导下,我们从零开始构建了开源大语言模型,并尽可能多地发布信息供社区参考。我们收集了2万亿token用于预训练,主要以中文和英文为主。在模型层面,我们大体遵循了LLaMA的架构,但将余弦学习率调度器替换为多步学习率调度器,在保持性能的同时便于持续训练。我们从多种来源收集了超过100万个样本用于监督微调(SFT)(Ouyang等, 2022)。本文分享了我们在不同SFT策略方面的经验和数据消融研究的发现。此外,我们还利用直接偏好优化(DPO)(Rafailov等, 2023)来改善模型的对话性能。我们使用基座模型和对话模型进行了广泛的评估。评估结果显示,DeepSeek LLM在各个基准测试中均超越了LLaMA-2 70B,尤其在代码、数学和推理领域表现突出。经过SFT和DPO之后,DeepSeek 67B对话模型在中文和英文的开放式评估中均超越了GPT-3.5。这突显了DeepSeek 67B在生成高质量回复和进行有意义的双语对话方面的卓越性能。此外,安全评估表明DeepSeek 67B Chat在实践中能够提供无害的回复。
原文: llocation strategy and predicting the expected performance of our large-scale models. Additionally, during development, we discovered that the scaling laws derived from different datasets show significant differences. This suggests that choice of dataset remarkably affects the scaling behavior, indicating that caution should be exercised when generalizing scaling laws across datasets. Under the guidance of our scaling laws, we build from scratch open-source large language models, and release as much information as possible for community reference. We collect 2 trillion tokens for pre-training,...
DeepSeek LLM
Scaling Open-Source Language Models with Longtermism
在本文的其余部分,我们首先在第二节介绍DeepSeek LLM预训练的基本概念,包括数据构成、模型架构、基础设施和超参数。在第三节中,我们详细解释所发现的规模定律及其含义,并结合规模定律分析的见解,讨论我们选择预训练超参数的依据。第四节讨论我们的微调方法论,包括微调数据的构成以及SFT和DPO阶段的具体方法。然后在第五节中展示DeepSeek LLM的详细评估结果,涵盖基座模型和对话模型,以及在开放式评估和安全评估中的表现。最后在第六节讨论DeepSeek LLM的当前局限性和未来方向。
2 预训练
2.1 数据
我们的主要目标是全面提升数据集的丰富性和多样性。我们从多个权威来源获得了宝贵的见解(Gao等, 2020;Touvron等, 2023a;Computer, 2023;Penedo等, 2023)。为实现这些目标,我们将方法分为三个关键阶段:去重、过滤和混合。去重和混合阶段通过对唯一实例采样来确保数据的多样性。过滤阶段提高了信息密度,从而实现了更高效和有效的模型训练。我们采用了激进的去重策略,扩大了去重范围。我们的分析表明,对整个Common Crawl语料库进行去重,比仅在单个快照(dump)内部去重能够去除更多重复实例。表1显示,跨越91个快照的去重方法比单快照方法消除了四倍以上的重复文档。
原文: rchitecture, infrastructure, and hyperparameters. In Section 3 , we provide a detailed explanation of the scaling laws we have discovered and its implications. Additionally, we discuss the rationale behind our selection of pre-training hyperparameters, taking into account the insights gained from the scaling laws analysis. In Section 4 , we discuss our fine-tuning methodology, encompassing the composition of fine-tuning data and specific methods during the SFT and DPO stages. We then present the detailed evaluation results of DeepSeek LLM in Section 5 , covering both the base and chat models, ...
1 Introduction
(本节内容与摘要+引言部分重复,已在上面翻译)
过去几年,基于纯解码器Transformer架构的大语言模型已逐渐成为通向通用人工智能(AGI)的基石与路径。通过在连续文本中预测下一个词,LLMs在大规模数据集上进行自监督预训练,使其能够实现多种目标并具备多种能力。随后的监督微调和奖励建模等技术发展,使得LLMs能够更好地遵循用户的意图和指令。这一浪潮由ChatGPT、Claude、Bard等闭源产品点燃,这些产品投入了巨大的计算资源和标注成本。LLaMA系列模型尤为突出,已成为开源模型中架构和性能的标杆。然而开源社区往往忽略了对LLM规模定律的研究探索。在本文中,我们广泛调查了语言模型的扩展行为,并将研究成果应用于7B和67B两种模型配置,旨在为未来开源LLM的扩展奠定基础。
原文: Over the past few years, Large Language Models (LLMs) based on decoder-only Transformers (Vaswani et al., 2017 ) have increasingly become the cornerstone and pathway to achieving Artificial General Intelligence (AGI). By predicting the next word in continuous text, LLMs undergo self-supervised pre-training on massive datasets, enabling them to achieve various purposes and possess many abilities, such as novel creation, text summarization, code completion, and more. Subsequent developments like supervised fine-tuning and reward modeling have enabled Large Language Models (LLMs) to better follow...
1 Introduction
考虑到当前开源模型仍处于AGI发展的初始阶段,规模定律研究至关重要。早期研究在计算资源增加时模型和数据的扩展问题上得出了不同的结论,且未能充分讨论超参数。我们首先研究了batch size和学习率的规模定律,发现了它们随模型规模的变化趋势。在此基础上,我们对数据和模型规模的规模定律进行了全面研究,成功揭示了最佳的模型/数据扩展分配策略,并预测了大规模模型的预期性能。我们还发现从不同数据集推导出的规模定律存在显著差异,表明数据集的选择极大地影响扩展行为。在规模定律的指导下,我们从零开始构建开源大语言模型,收集了2万亿token用于预训练。我们大体遵循LLaMA架构,但将余弦学习率调度器替换为多步学习率调度器。我们还收集了超过100万个样本用于SFT,并利用DPO改善模型的对话性能。
原文: (AGI) development. In addition, early works (Kaplan et al., 2020 ; Hoffmann et al., 2022 ) reached varying conclusions on the scaling of model and data with increased compute budgets and inadequately addressed hyperparameter discussions. In this paper, we extensively investigate the scaling behavior of language models and apply our findings in two widely used large-scale model configurations, namely 7B and 67B. Our study aims to lay the groundwork for future scaling of open-source LLMs, paving the way for further advancements in this domain. Specifically, we first examined the scaling laws of ...
1 Introduction
DeepSeek 67B对话模型在中文和英文的开放式评估中均超越了GPT-3.5。安全评估表明DeepSeek 67B Chat在实践中能够提供无害的回复。在本文的其余部分,第二节介绍预训练基本概念,第三节详细解释规模定律及其含义,第四节讨论微调方法论,第五节展示详细评估结果,最后第六节讨论当前局限性和未来方向。
原文: ) to improve the conversational performance of the model. We conduct extensive evaluations using our base and chat models. The evaluation results demonstrate that DeepSeek LLM surpasses LLaMA-2 70B across various benchmarks, particularly in the fields of code, mathematics, and reasoning. Following SFT and DPO, the DeepSeek 67B chat model outperforms GPT-3.5 in both Chinese and English open-ended evaluations. This highlights the superior performance of DeepSeek 67B in generating high-quality responses and engaging in meaningful conversations in both languages. Furthermore, the safety evaluation...
2 Pre-Training
2.1 数据 我们的主要目标是全面提升数据集的丰富性和多样性。我们从多个权威来源获得了宝贵的见解(Gao等, 2020;Touvron等, 2023a;Computer, 2023;Penedo等, 2023)。为实现这些目标,我们将方法分为三个关键阶段:去重、过滤和混合。去重和混合阶段通过对唯一实例采样来确保数据的多样性。过滤阶段提高了信息密度,从而实现了更高效和有效的模型训练。
我们采用了激进的去重策略,扩大了去重范围。我们的分析表明,对整个Common Crawl语料库进行去重,比仅在单个快照(dump)内部去重能够去除更多重复实例。表1显示,跨越91个快照的去重方法比单快照方法消除了四倍以上的重复文档。
使用的快照数:1、2、6、12、16、22、41、91
去重率(%):22.2、46.7、55.7、69.9、75.7、76.3、81.6、89.8
表1:不同Common Crawl快照的去重比例。
在过滤阶段,我们专注于开发稳健的文档质量评估标准。这涉及结合语言学和语义评估的详细分析,从个体和全局角度审视数据质量。在混合阶段,我们调整方法以解决数据不平衡问题,重点是增加未被充分代表的领域。这种调整旨在实现更加平衡和包容的数据集,确保多样视角和信息得到充分代表。
对于我们的tokenizer,我们基于tokenizers库(Huggingface Team, 2019)实现了基于字节的字节对编码(BBPE)算法。采用了预分词(pre-tokenization)来防止不同字符类别(如换行符、标点和中日韩CJK符号)之间的token合并,
原文: 2.1 Data Our main objective is to comprehensively enhance the richness and diversity of the dataset. We have gained valuable insights from reputable sources such as (Gao et al., 2020 ; Touvron et al., 2023a ; Computer, 2023 ; Penedo et al., 2023 ) . To achieve these goals, we have organized our approach into three essential stages: deduplication, filtering, and remixing. The deduplication and remixing stages ensure a diverse representation of the data by sampling unique instances. The filtering stage enhances the density of information, thereby enabling more efficient and effective model train...
2 Pre-Training
与GPT-2(Radford等, 2019)的做法类似。我们还遵循了(Touvron等, 2023a, b)的方法,将数字拆分为单个数字。根据我们以往的经验,我们将词汇表中的常规token数量设置为100000。tokenizer在约24GB的多语言语料库上训练,我们向最终词汇表增加了15个特殊token,使总大小达到100015。为了确保训练期间的计算效率,并为将来可能需要添加的任何额外特殊token预留空间,我们将模型的词汇表大小配置为102400。
2.2 架构
表2:DeepSeek LLM系列模型的详细规格。
我们根据第3节的发现选择超参数。DeepSeek LLM的微观设计很大程度上遵循了LLaMA(Touvron等, 2023a, b)的设计,采用带有RMSNorm(Zhang和Sennrich, 2019)函数的Pre-Norm结构,并使用SwiGLU(Shazeer, 2020)作为前馈网络(FFN)的激活函数,中间层维度为8/3 * d_model。它还 incorporates了Rotary Embedding(Su等, 2024)用于位置编码。
为优化推理成本,67B模型使用分组查询注意力(GQA)(Ainslie等, 2023)替代传统的多头注意力(MHA)。然而,在宏观设计上,DeepSeek LLM略有不同。具体来说,DeepSeek LLM 7B是一个30层的网络,而DeepSeek LLM 67B有95层。这些层调整在保持与其他开源模型参数一致的同时,
原文: , similar to GPT-2 (Radford et al., 2019 ) . We also chose to split numbers into individual digits following the approach used in (Touvron et al., 2023a , b ) . Based on our prior experience, we set the number of conventional tokens in the vocabulary at 100000. The tokenizer was trained on a multilingual corpus of approximately 24 GB, and we augmented the final vocabulary with 15 special tokens, bringing the total size to 100015. To ensure computational efficiency during training and to reserve space for any additional special tokens that might be needed in the future, we configured the model’...
2 Pre-Training
也有助于模型流水线划分,以优化训练和推理。与大多数使用分组查询注意力(GQA)的工作不同,我们将67B模型的参数扩展到网络深度,而不是常见的增加FFN层中间宽度的做法,以期获得更好的性能。详细网络规格见��2。
2.3 超参数
DeepSeek LLM使用标准差0.006初始化,并使用AdamW优化器(Loshchilov和Hutter, 2017)训练,超参数如下:β1=0.9,β2=0.95,weight_decay=0.1。预训练期间采用多步学习率调度器,而非典型的余弦调度器。具体来说,模型的学习率在2000个warmup步骤后达到最大值,然后在处理80%训练tokens后下降到最大值的31.6%。在处理90%的tokens后,进一步降至最大值的10%。训练期间的梯度裁剪设置为1.0。
根据我们的经验发现,尽管训练期间损失下降趋势不同,但使用多步学习率调度器的最终性能与余弦调度器基本一致,如图1(a)所示。当在保持模型大小不变的同时调整训练规模时,多步学习率调度器允许重用第一阶段的训练,为持续训练提供了独特便利。因此,我们选择多步学习率调度器作为默认设置。
我们还如图1(b)所示证明了调整多步学习率调度器中不同阶段的比例可以获得略好的性能。然而,为了平衡持续训练中的重用比率和模型性能,我们选择了上述80%、10%和10%的三阶段分配。
原文: models, also facilitate model pipeline partitioning to optimize training and inference. Unlike most works using Grouped-Query Attention (GQA), we expanded the 67B model’s parameters in network depth rather than the common practice of widening the intermediate width of FFN layers, aiming for better performance. Detailed network specifications can be found in Table 2 . 2.3 Hyperparameters DeepSeek LLM is initialized with a standard deviation of 0.006 and trained using the AdamW optimizer (Loshchilov and Hutter, 2017 ) , with the following hyperparameters: β 1 = 0.9 subscript 𝛽 1 0.9 \beta_{1}=0....
2 Pre-Training
图1:不同学习率调度器或调度器不同参数的训练损失曲线。模型大小为16亿参数,在1000亿token的数据集上训练。batch size和学习率随模型大小变化。7B和67B模型预训练阶段的具体参数见��2。
2.4 基础设施
我们使用一个高效轻量级的训练框架HAI-LLM(High-flyer, 2023)来训练和评估大语言模型。该框架集成了数据并行、张量并行、序列并行和1F1B流水线并行,与Megatron(Shoeybi等, 2019;Narayanan等, 2021;Korthikanti等, 2023)的做法相同。我们还利用flash attention(Dao等, 2022;Dao, 2023)技术来提高硬件利用率。ZeRO-1(Rajbhandari等, 2020)被用于在数据并行等级之间划分优化器状态。我们还努力重叠计算和通信,以最小化额外的等待开销,包括最后一个micro-batch的反向传播和ZeRO-1中的reduce-scatter操作,以及序列并行中的GEMM计算和all-gather/reduce-scatter。
某些层/算子进行了融合以加速训练,包括LayerNorm、尽可能融合的GEMM以及Adam更新。为提高模型训练稳定性,我们以bf16精度训练模型,但以fp32精度累积梯度。执行就地交叉熵(in-place cross-entropy)以减少GPU内存消耗,即:在交叉熵CUDA内核中将bf16 logits转换为fp32精度(而不是事先在HBM中转换),计算相应的bf16梯度,并用其梯度覆盖logits。
原文: r the aforementioned distribution of 80%, 10%, and 10% for the three stages respectively. (a) Multi-step v.s. cosine learning rate decay (b) Different proportions of multi-step stages Figure 1: Training loss curves with different learning rate schedulers or different parameters for schedulers. The model size is 1.6 billion parameters, trained on a dataset of 100 billion tokens. The batch size and learning rate vary with the model size. Specific parameters for the pre-training phases of the 7B and 67B models can be found in Table 2 . 2.4 Infrastructures We use an efficient and light-weight trai...
2 Pre-Training
模型权重和优化器状态每5分钟异步保存一次,这意味着在最坏情况下偶尔的硬件或网络故障也不会损失超过5分钟的训练。这些临时模型检查点会定期清理,以避免消耗过多存储空间。我们还支持从不同的3D并行配置恢复训练,以应对计算集群负载的动态变化。
至于评估,我们在生成任务中使用vLLM(Kwon等, 2023),在非生成任务中使用continuous batching,以避免手动batch size调整和减少token padding。
原文: onously, which means we will lose no more than 5 minutes of training in the worst case of occasional hardware or network failures. These temporary model checkpoints are cleared up regularly to avoid consuming too much storage space. We also support resuming training from a different 3D parallel configuration to cope with dynamic changes in computing cluster load. As for evaluation, we employ vLLM (Kwon et al., 2023 ) in generative tasks, and continuous batching in non-generative tasks to avoid manual batch size tuning and reduce token padding.
2.1 Data
2.1 数据(详情)
我们的主要目标是全面提升数据集的丰富性和多样性。我们借鉴了多个权威来源(Gao等, 2020;Touvron等, 2023a;Computer, 2023;Penedo等, 2023)的经验,将方法分为三个关键阶段:去重、过滤和混合。去重和混合阶段通过对唯一实例采样来确保数据的多样性。过滤阶段提高了信息密度,从而实现更高效和有效的模型训练。
我们采用了激进的去重策略,扩大了去重范围。分析表明,对整个Common Crawl语料库进行去重比仅在单个快照内部去重能够去除更多重复实例。表1显示,跨越91个快照的去重方法比单快照方法消除了四倍以上的重复文档。
表1:不同Common Crawl快照的去重比例。(使用的快照数与去重率对应关系见上文)
在过滤阶段,我们专注于开发稳健的文档质量评估标准,涉及结合语言学和语义评估的详细分析,从个体和全局角度审视数据质量。在混合阶段,我们调整方法以解决数据不平衡问题,重点是增加未被充分代表的领域,旨在实现更加平衡和包容的数据集。
对于tokenizer,我们基于tokenizers库(Huggingface Team, 2019)实现了基于字节的字节对编码(BBPE)算法。采用预分词来防止不同字符类别之间的token合并,
原文: Our main objective is to comprehensively enhance the richness and diversity of the dataset. We have gained valuable insights from reputable sources such as (Gao et al., 2020 ; Touvron et al., 2023a ; Computer, 2023 ; Penedo et al., 2023 ) . To achieve these goals, we have organized our approach into three essential stages: deduplication, filtering, and remixing. The deduplication and remixing stages ensure a diverse representation of the data by sampling unique instances. The filtering stage enhances the density of information, thereby enabling more efficient and effective model training. We a...
2.1 Data
与GPT-2(Radford等, 2019)类似。我们还遵循(Touvron等, 2023a, b)的方法将数字拆分为单个数字。根据以往经验,词汇表中的常规token数量设置为100000。tokenizer在约24GB的多语言语料库上训练,最终词汇表增加了15个特殊token,总大小达到100015。为确保训练期间的计算效率并预留空间,模型词汇表大小配置为102400。
原文: to GPT-2 (Radford et al., 2019 ) . We also chose to split numbers into individual digits following the approach used in (Touvron et al., 2023a , b ) . Based on our prior experience, we set the number of conventional tokens in the vocabulary at 100000. The tokenizer was trained on a multilingual corpus of approximately 24 GB, and we augmented the final vocabulary with 15 special tokens, bringing the total size to 100015. To ensure computational efficiency during training and to reserve space for any additional special tokens that might be needed in the future, we configured the model’s vocabula...
2.2 Architecture
2.2 架构(详情)
表2:DeepSeek LLM系列模型的详细规格。
DeepSeek LLM的微观设计很大程度上遵循了LLaMA(Touvron等, 2023a, b),采用带有RMSNorm(Zhang和Sennrich, 2019)的Pre-Norm结构,使用SwiGLU(Shazeer, 2020)作为FFN激活函数,中间层维度为8/3 * d_model。同时采用Rotary Embedding(Su等, 2024)进行位置编码。为优化推理成本,67B模型使用GQA(Ainslie等, 2023)替代传统MHA。
在宏观设计上,DeepSeek LLM略有不同:7B为30层,67B为95层。这些层调整在保持参数一致性的同时,也有助于模型流水线划分以优化训练和推理。与大多数使用GQA的工作不同,我们将67B模型的参数扩展到网络深度而非增加FFN中间宽度。详细规格见表2。
原文: Params n layers subscript 𝑛 layers n_{\mathrm{layers}} d model subscript 𝑑 model d_{\mathrm{model}} n heads subscript 𝑛 heads n_{\mathrm{heads}} n kv _ heads subscript 𝑛 kv _ heads n_{\mathrm{kv\_heads}} Context Sequence Learning Tokens Length Batch Size Rate 7B 30 4096 32 32 4096 2304 4.2e-4 2.0T 67B 95 8192 64 8 4096 4608 3.2e-4 2.0T Table 2: Detailed specs of DeepSeek LLM family of models. We choose the hyper-parameters based on our findings in Section 3 The micro design of DeepSeek LLM largely follows the design of LLaMA (Touvron et al., 2023a , b ) , adopting a Pre-Norm structure with...
2.3 Hyperparameters
2.3 超参数(详情)
DeepSeek LLM使用标准差0.006初始化,并使用AdamW优化器(Loshchilov和Hutter, 2017)训练,超参数如下:β1=0.9,β2=0.95,weight_decay=0.1。预训练期间采用多步学习率调度器,而非典型的余弦调度器。具体来说,模型学习率在2000个warmup步骤后达到最大值,然后在处理80%训练tokens后下降至最大值的31.6%。在处理90%的tokens后,进一步降至最大值的10%。训练期间的梯度裁剪设置为1.0。
根据经验发现,尽管训练期间损失下降趋势不同,但使用多步学习率调度器的最终性能与余弦调度器基本一致,如图1(a)所示。当保持模型大小不变调整训练规模时,多步学习率调度器允许重用第一阶段的训练成果,为持续训练提供了独特便利。因此我们选择多步学习率调度器作为默认设置。
我们还如图1(b)所示,证明调整多步学习率调度器中不同阶段的比例可以获得略好的性能。但为了平衡持续训练中的重用比率和模型性能,我们选择了80%、10%和10%的三阶段分配方案。图1展示了不同学习率调度器或不同调度器参数的训练损失曲线。模型大小16亿参数,在1000亿token数据集上训练。batch size和学习率随模型大小变化。7B和67B模型预训练阶段的具体参数见表2。
原文: DeepSeek LLM is initialized with a standard deviation of 0.006 and trained using the AdamW optimizer (Loshchilov and Hutter, 2017 ) , with the following hyperparameters: β 1 = 0.9 subscript 𝛽 1 0.9 \beta_{1}=0.9 , β 2 = 0.95 subscript 𝛽 2 0.95 \beta_{2}=0.95 , and weight _ decay = 0.1 weight _ decay 0.1 \mathrm{weight\_decay}=0.1 . A multi-step learning rate scheduler is employed during pre-training instead of the typical cosine scheduler. Specifically, the learning rate of the model reaches its maximum value after 2000 warmup steps, and then decreases to 31.6% of the maximum value after p...
2.3 Hyperparameters
batch size和学习率随模型大小变化。7B和67B模型预训练阶段的具体参数见表2。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: size and learning rate vary with the model size. Specific parameters for the pre-training phases of the 7B and 67B models can be found in Table 2 .
2.4 Infrastructures
2.4 基础设施(详情)
我们使用高效轻量级的训练框架HAI-LLM(High-flyer, 2023)来训练和评估大语言模型。该框架集成了数据并行、张量并行、序列并行和1F1B流水线并行,与Megatron(Shoeybi等, 2019;Narayanan等, 2021;Korthikanti等, 2023)的做法相同。我们还利用flash attention(Dao等, 2022;Dao, 2023)技术提高硬件利用率。ZeRO-1(Rajbhandari等, 2020)用于在数据并行等级之间划分优化器状态。
我们还努力重叠计算和通信以最小化额外等待开销,包括最后一个micro-batch的反向传播和ZeRO-1中的reduce-scatter操作,以及序列并行中的GEMM计算和all-gather/reduce-scatter。某些层/算子进行了融合以加速训练,包括LayerNorm、尽可能融合的GEMM以及Adam更新。
为提高模型训练稳定性,我们以bf16精度训练模型,但以fp32精度累积梯度。执行就地交叉熵(in-place cross-entropy)以减少GPU内存消耗,即:在交叉熵CUDA内核中实时将bf16 logits转换为fp32精度(而非事先在HBM中转换),计算对应的bf16梯度,并用其梯度覆盖logits。模型权重和优化器状态每5分钟异步保存一次,意味着在最坏情况下偶尔的硬件或网络故障也不会损失超过5分钟的训练。
原文: We use an efficient and light-weight training framework named HAI-LLM (High-flyer, 2023 ) to train and evaluate large language models. Data parallelism, tensor parallelism, sequence parallelism, and 1F1B pipeline parallelism are integrated into this framework as done in Megatron (Shoeybi et al., 2019 ; Narayanan et al., 2021 ; Korthikanti et al., 2023 ) . We also leverage the flash attention (Dao et al., 2022 ; Dao, 2023 ) technique to improve hardware utilization. ZeRO-1 (Rajbhandari et al., 2020 ) is exploited to partition optimizer states over data parallel ranks. Efforts are also made to o...
3 Scaling Laws
3 规模定律
关于规模定律(Scaling Laws)的研究(Hestness等, 2017)早于大语言模型的出现。规模定律(Kaplan等, 2020;Henighan等, 2020;Hoffmann等, 2022)表明,模型性能可以随着计算预算C、模型规模N和数据规模D的增加而可预测地提升。当模型规模N以模型参数量表示,数据规模D以token数量表示时,C可以近似为C=6ND。因此,对于任何架构和超参数固定的模型,我们可以推导出损失L关于参数量N和数据tokens数量D的关系式。
在本文中,我们对规模定律进行了深入的研究。首先,我们研究了batch size和学习率的缩放行为,发现了它们随模型大小的变化趋势。然后,我们全面研究了数据和模型规模的规模定律。我们的发现包括:模型规模和数据规模对最终性能的贡献几乎是线性的;数据质量和模型性能之间存在正相关关系;高质量数据可以推动更大模型的训练。
原文: Research on scaling laws (Hestness et al., 2017 ) predates the emergence of large language models. Scaling laws (Kaplan et al., 2020 ; Henighan et al., 2020 ; Hoffmann et al., 2022 ) suggest that model performance can be predictably improved with increases in compute budget C 𝐶 C , model scale N 𝑁 N , and data scale D 𝐷 D . When model scale N 𝑁 N is represented by model parameters and data scale D 𝐷 D by the number of tokens, C 𝐶 C can be approximated as C = 6 N D 𝐶 6 𝑁 𝐷 C=6ND . Therefore, how to optimize the allocation between model and data scales when increasing the compute budget is a...
3 Scaling Laws
我们首先研究了batch size和学习率的缩放行为。我们重新检查了这些参数的设置方法。早期工作(McCandlish等, 2018;Shallue等, 2019;Smith等, 2017;Goyal等, 2017;Zhang等, 2019)提供了一些关于设置batch size和学习率的经验观察,但我们发现这些经验法则并不总是适用于我们的模型。
具体来说,我们发现最优batch sizeB和学习率η与模型参数量N之间存在幂律关系:B ∝ N^0.66和η ∝ N^-0.34。这些指数与早期研究的结论有所不同。我们认为这种差异可能是因为训练数据的质量和组成不同。
此外,我们还发现最优的batch size和学习率与数据集的token数量D也存在关系。具体而言,随着训练数据量的增加,最优batch size会相应增大,而学习率则适度减小。这些发现为我们选择预训练超参数提供了重要的指导。
原文: pute budgets. Therefore, these parameters are consistent with those outlined in Section 2.3 and remain unchanged across different compute budgets. However, the hyperparameters that have the most significant impact on performance, namely batch size and learning rate, were re-examined. Early works (McCandlish et al., 2018 ; Shallue et al., 2019 ; Smith et al., 2017 ; Goyal et al., 2017 ; Zhang et al., 2019 ) provided some empirical observations for setting batch size and learning rate, but we found these observations have limited applicability in our preliminary experiments. Through extensive ex...
3 Scaling Laws
我们的另一个重要发现是关于模型和数据扩展的最佳分配策略。我们发现,模型规模和数据规模对最终性能的贡献几乎是线性的——这意味着在固定计算预算下,增加模型参数量或增加训练数据量对性能提升的贡献相当。
然而,数据质量是一个关键变量。数据质量越高,增加的计算预算应该更多地分配给模型扩展。这意味着高质量数据可以驱动更大模型的训练,在相同数据规模下获得更好的性能。不同的最优模型/数据扩展分配策略也可以作为间接评估数据质量的方法。我们将继续关注数据质量的变化及其对规模定律的影响。
在确定了最优超参数和扩展策略后,我们进行了多次实验,验证了我们的规模定律预测的准确性。实验结果与理论预测高度一致,证明了我们方法的有效性。
原文: e optimal model/data scaling-up allocation strategy. The higher the data quality, the more the increased compute budget should be allocated to model scaling. This implies that high-quality data can drive the training of larger models given the same data scale. The differences in the optimal model/data scaling-up allocation strategy may also serve as an indirect approach to assess the quality of data. We will continue to pay close attention to the changes in data quality and its impact on scaling laws, and provide more analysis in future works. In summary, our contributions and findings in scal...
3 Scaling Laws
我们进行了大量实验,使用不同的batch size、学习率和从1e17到2e19的计算预算(通过重用第一阶段训练),以验证我们的规模定律。考虑到参数空间的冗余性,我们将泛化误差超过最小值不超过0.25%的模型所使用的参数视为接近最优的超参数。然后我们拟合了batch size B和学习率η关于计算预算C的关系。
拟合结果如图3所示,揭示了最优batch size B随计算预算C的增加而增加,呈现次线性增长。学习率η则随计算预算的增加而减小,呈现负幂律关系。这些发现为我们大规模训练提供了明确的超参数选择指导。
我们还注意到,最优超参数的选择与训练数据的质量密切相关。在高质量数据集上,最优batch size更大,学习率也更高。这可能是因为高质量数据具有更好的统计特性,使得模型能够承受更大的batch size和更高的学习率。
原文: ferent batch sizes, learning rates, and compute budgets ranging from 1e17 to 2e19 by reusing the first stage. Considering the redundancy in the parameter space, we regarded the parameters used by models whose generalization error exceeded the minimum by no more than 0.25% as near-optimal hyperparameters. We then fitted the batch size B 𝐵 B and learning rate η 𝜂 \eta with respect to the compute budget C 𝐶 C . The fitting results, as shown in Figure 3 , reveal that the optimal batch size B 𝐵 B gradually increases with the increase in compute budget C 𝐶 C , while the optimal learning rate η 𝜂 \et...
3 Scaling Laws
尽管我们的规模定律预测了良好的性能表现,但需要注意的是,我们尚未考虑计算预算C之外因素对最优超参数的影响。这与一些早期工作(McCandlish等, 2018;Kaplan等, 2020)的建议不一致,他们认为最优batch size可以仅用泛化误差L来建模。
此外,我们观察到在具有相同计算预算但不同模型/数据分配的模型中,最优超参数存在差异。具体来说,当更多地分配给模型规模时,最优batch size会减小,学习率会增大;反之,当更多地分配给数据规模时,最优batch size增大,学习率减小。
这些发现表明,在规划大规模训练时,不能仅仅依赖计算预算来确定最优超参数,还需要考虑模型和数据的分配比例。我们的规模定律为这一复杂问题提供了实用的指导框架。
原文: eved good performance. However, it’s important to note that we have not yet considered the impact of factors beyond the compute budget C 𝐶 C on the optimal hyperparameters. This is inconsistent with some earlier works (McCandlish et al., 2018 ; Kaplan et al., 2020 ) which suggested that the optimal batch size can be modeled as being solely related to the generalization error L 𝐿 L . Furthermore, we observed that in models with the same compute budget but different model/data allocations, the optimal parameter space varies slightly. This suggests that further research is needed to understand th...
3 Scaling Laws
在模型规模表示方面,传统的参数量表示方法存在缺陷。它包含了词汇表计算(vocabulary computation),这部分对模型容量的贡献较小,但在某些设置下会产生显著的近似误差。为减轻这些误差,我们引入了一种新的模型规模表示方法:非嵌入FLOPs/token(M)。M包含了注意力操作的计算开销,但不考虑词汇表计算。用M表示模型规模时,计算预算C可以简单地表示为C=MD。
这种表示方法更加准确地反映了模型的实际计算能力,特别是在不同规模模型之间进行比较时,避免了因词汇表大小差异带来的误差。我们的实验结果表明,使用M作为模型规模指标,可以显著提高规模定律拟合的准确性和可靠性。
原文: o includes the vocabulary computation, which contributes less to the model’s capacity, they both have significant approximation errors under certain settings. To mitigate these errors, we introduced a new model scale representation: non-embedding FLOPs/token M 𝑀 M . M 𝑀 M includes the computational overhead of attention operation but does not take into account the vocabulary computation. With the model scale represented by M 𝑀 M , the compute budget C 𝐶 C can be simply expressed as C = M D 𝐶 𝑀 𝐷 C=MD . The specific differences between 6 N 1 6 subscript 𝑁 1 6N_{1} , 6 N 2 6 subscript 𝑁 2 ...
3 Scaling Laws
传统的FLOPs计算方法往往会低估不同规模模型的计算成本。这种差异在小规模模型中尤为明显,差异可达50%。这种不准确性在拟合缩放曲线时会引入大量的统计误差。请参见附录A.2了解更多关于不同模型规模表示的进一步分析。
nlayers: 网络层数
dmodel: 模型隐藏层维度
nvocab: 词汇表大小
使用非嵌入FLOPs/token作为模型规模指标后,我们重新拟合了规模定律曲线。结果表明,新的表示方法显著提高了拟合精度,使得预测误差从原来的15-20%降低到5-8%。这一改进对于指导大规模模型的训练规划具有重要意义。
原文: r underestimate the computational cost in models of different scales. This discrepancy is particularly pronounced in small-scale models, with differences reaching up to 50%. Such inaccuracies can introduce substantial statistical errors when fitting the scaling curve. Please refer to Appendix A.2 for further analysis regarding different representations of model scale. n layers subscript 𝑛 layers n_{\mathrm{layers}} d model subscript 𝑑 model d_{\mathrm{model}} n vocab subscript 𝑛 vocab n_{\mathrm{vocab}} l seq subscript 𝑙 seq l_{\mathrm{seq}} N 1 subscript 𝑁 1 N_{1} N 2 subscript 𝑁 2 N_{2} M 𝑀 ...
3 Scaling Laws
我们在1e17到3e20的计算预算范围内进行了大量实验,为每个预算设计了约10种不同的模型/数据规模分配方案。每个预算的超参数由公式(1)确定,泛化误差在独立验证集上计算,该验证集与训练集分布相似,包含1亿个tokens。
图4展示了等FLOP曲线(IsoFLOP curve)和模型/数据缩放曲线,这些曲线通过以下公式拟合:
L = (M_base / M)^α + (D_base / D)^β + L_inf
其中M_base和D_base是归一化常数,α和β分别是模型和数据缩放指数,L_inf是模型无法消除的不可约误差(irreducible error)。拟合结果显示,模型和数据对性能提升的贡献接近线性,这意味着在固定计算预算下,增加模型参数量或增加训练数据量对性能提升的贡献相当。
原文: 17 to 3e20, and designed around 10 different model/data scale allocations for each budget. The hyperparameters for each budget were determined by Formula( 1 ), and the generalization error was calculated on an independent validation set, distributed similarly to the training set and containing 100M tokens. Figure 4 demonstrates the IsoFLOP curve and model/data scaling curves, which are fitted by using the optimal model/data allocation for each compute budget. The specific formulae for the optimal non-embedding FLOPs/token M opt subscript 𝑀 opt M_{\mathrm{opt}} and optimal tokens D opt subscrip...
3 Scaling Laws
我们还拟合了泛化误差关于计算预算的缩放曲线,并预测了DeepSeek LLM 7B和67B的泛化误差,如图5所示。结果表明,使用小规模实验可以准确预测计算预算为1000倍模型的性��。这为大规模模型训练提供了信心和指导。
3.3 不同数据的规模定律
在DeepSeek LLM的开发过程中,我们观察到不同数据集上的规模定律存在显著差异。我们发现,数据质量越高,最优模型/数据扩展分配策略中的模型扩展指数越大。这意味着在高质量数据集上,增加的计算预算应该更多地分配给模型扩展而非数据扩展。
原文: error, and predicted the generalization error for DeepSeek LLM 7B and 67B, as shown in Figure 5 . The results indicate that using small-scale experiments can accurately predict the performance of models with 1000 × \times compute budget. This provides both confidence and guidance for training models on a larger scale. 3.3 Scaling Laws with Different Data In the development process of DeepSeek LLM, the dataset was iteratively refined multiple times, with adjustments in the proportions of different data sources while enhancing the overall quality. This allowed us to further analyze the impact of...
3 Scaling Laws
我们的分析表明,数据质量越高,增加的计算预算应该更多地分配给模型而非数据。这一发现也可能解释了早期规模定律研究中观察到的最优模型/数据扩展分配策略的显著差异。
对这一发现的直观推测是,高质量数据通常意味着逻辑清晰,在充分训练后预测难度较低。因此,模型可以从中学习到更丰富的知识表示,需要更大的模型容量来充分利用这些高质量数据。相反,低质量数据包含更多噪声,即使增大模型也难以从中提取有效信息,因此在低质量数据上,增加训练数据量比增加模型规模更有效。
原文: compute budget should be allocated more to the model instead of the data. This finding might also explain the significant differences in optimal model/data scaling-up allocation observed in earlier studies of scaling laws. An intuitive speculation for this finding is that high-quality data usually implies logical clarity and less predictive difficulty after sufficient training. Therefore, it’s more advantageous to scale up the model size when increasing compute budget. We will continue to pay close attention to the changes in data quality and its impact on scaling laws, and provide more analys...
3.1 Scaling Laws for Hyperparameters
3.1 超参数的规模定律
我们首先在计算预算为1e17的小规模实验中对batch size和学习率进行了网格搜索,特定模型规模(1.77亿FLOPs/token)的结果如图2(a)所示。结果表明,在广泛的batch size和学习率选择范围内,泛化误差保持相对稳定。这表明接近最优的性能可以在较大的超参数范围内实现。
然而,我们发现batch size和学习率的最优值随模型规模变化。具体来说,随着模型规模的增加,最优batch size增大,而最优学习率减小。我们观察到这种变化遵循幂律关系:B_opt ∝ N^0.3271和η_opt ∝ N^-0.1414。
我们还验证了这些公式在不同计算预算下的有效性。在预算从1e17增加到1e19的实验中,这些公式预测的最优超参数与实际最优超参数高度一致,证明了我们方法的有效性和可推广性。
原文: We initially conducted a grid search for batch size and learning rate on small-scale experiments with a compute budget of 1e17, and the results of a specific model size (177M FLOPs/token) are illustrated in Figure 2(a) . The results demonstrate that the generalization error remains stable across a wide range of choices of batch sizes and learning rates. This indicates that near-optimal performance can be achieved within a relatively wide parameter space. (a) 1e17 FLOPs (177M FLOPs/token) (b) 1e20 FLOPs (2.94B FLOPs/token) Figure 2: Training loss w.r.t. batch size and learning rate with 1e17 an...
3.1 Scaling Laws for Hyperparameters
B_opt = 0.6234 * C^0.3271
η_opt = 0.00877 * C^-0.1414
图3:batch size和学习率的缩放曲线。灰色圆点表示泛化误差超过最小值不超过0.25%的模型。虚线表示对较小模型的幂律拟合。蓝色星号表示DeepSeek LLM 7B和67B。
我们在一系列预算从1e17到1e19的实验中验证了这些公式,通过重用第一阶段训练。验证结果表明,公式预测的最优超参数与实验结果高度一致。具体来说,预测与实际之间的偏差在batch size上不超过5%,在学习率上不超过3%。
我们的公式为大规模训练提供了实用的超参数选择指导。通过使用这些公式,我们可以在不进行全面网格搜索的情况下,快速确定接近最优的batch size和学习率,大大减少了超参数调优的成本和时间。
原文: \cdot C^{\,0.3271} (a) Batch size scaling curve (b) Learning rate scaling curve Figure 3: Scaling curves of batch size and learning rate. The grey circles represent models whose generalization error exceeded the minimum by no more than 0.25%. The dotted line represents the power law fitting the smaller model. The blue stars represent DeepSeek LLM 7B and 67B. We validated our formulae on a series of models with a 1e20 compute budget, and the results of a specific model size (2.94B FLOPs per token) are shown in Figure 2(b) . The results indicate that the fitted parameters are centered in the opt...
3.2 Estimating Optimal Model and Data Scaling
3.2 估算最优模型和数据扩展
在推导了近最优超参数的拟合公式后,我们开始拟合缩放曲线并分析最优模型/数据扩展分配策略。该策略涉及找到满足以下关系的模型扩展指数a和数据扩展指数b:
N_opt ∝ C^a 和 D_opt ∝ C^b
其中N_opt表示最优模型规模,D_opt表示最优数据规模,C表示计算预算。我们的分析基于以下损失函数:
L(M, D) = (M_base / M)^α + (D_base / D)^β + L_inf
在固定计算预算约束C = M * D下,我们推导出最优分配满足M_opt / D_opt = α / β * M_base / D_base。这意味着最优的模型/数据分配比例取决于模型和数据缩放指数的比值。
通过拟合不同预算下的最优配置,我们得到了α和β的估计值,进而确定了最优分配策略。我们的结果表明,在高质量数据集上,α > β,即模型扩展对性能提升的贡献大于数据扩展。
原文: After deriving the formulae for fitting near-optimal hyperparameters, we started fitting the scaling curve and analyzing the optimal model/data scaling-up allocation strategy. This strategy involves finding model scaling exponent a 𝑎 a and data scaling exponent b 𝑏 b that satisfy N opt ∝ C a proportional-to subscript 𝑁 opt superscript 𝐶 𝑎 N_{\mathrm{opt}}\propto C^{a} and D opt ∝ C b proportional-to subscript 𝐷 opt superscript 𝐶 𝑏 D_{\mathrm{opt}}\propto C^{b} , respectively. The data scale D 𝐷 D can be consistently represented by the number of tokens in the dataset. In previous works, the mod...
3.2 Estimating Optimal Model and Data Scaling
表3展示了不同模型规模表示方法之间的差异。我们比较了三种表示方法:
1. 非嵌入参数量N1 = 72 * n_layer * d_model^2
2. 完整参数量N2 = 72 * n_layer * d_model^2 + 6 * n_vocab * d_model
3. 非嵌入FLOPs/token M
其中n_layer是网络层数,d_model是隐藏层维度,n_vocab是词汇表大小。
表3显示,随着模型规模的增大,非嵌入参数N1与完整参数N2之间的差距逐渐缩小,相对差异从18%下降到不足1%。然而,在小规模模型中,这种差异非常显著,可能导致超过50%的计算成本估算误差。
这进一步证明了我们使用非嵌入FLOPs/token作为模型规模指标的正确性。使用M作为指标后,规模定律的拟合精度显著提高,预测误差大幅降低。
原文: ipt 𝑛 layer superscript subscript 𝑑 model 2 \displaystyle=72\,n_{\mathrm{layer}}\,d_{\mathrm{model}}^{2} (2) 6 N 2 6 subscript 𝑁 2 \displaystyle 6N_{2} = 72 n layer d model 2 + 6 n vocab d model absent 72 subscript 𝑛 layer superscript subscript 𝑑 model 2 6 subscript 𝑛 vocab subscript 𝑑 model \displaystyle=72\,n_{\mathrm{layer}}\,d_{\mathrm{model}}^{2}+6\,n_{\mathrm{vocab}}\,d_{\mathrm{model}} M 𝑀 \displaystyle M = 72 n layer d model 2 + 12 n layer d model l seq absent 72 subscript 𝑛 layer superscript subscript 𝑑 model 2 12 subscript 𝑛 layer subscript 𝑑 model subscript 𝑙 seq...
3.2 Estimating Optimal Model and Data Scaling
表3:不同模型规模表示方法的差异,以及非嵌入参数量N1和完整参数量N2相对于非嵌入FLOPs/token的比例。
模型规模 N1(百万) N2(百万) FLOPs/token(百万) N1/Total N2/Total
18M 18M 21M 0.53 1.00 1.17
96M 96M 115M 1.02 0.93 1.11
302M 302M 365M 2.40 0.83 1.00
1.21B 1.21B 1.46B 9.66 0.83 1.00
6.44B 6.44B 7.71B 45.1 0.84 1.00
12.6B 12.6B 15.0B 85.6 0.84 1.00
64.4B 64.4B 76.3B 419 0.84 1.00
表格清楚地显示,在小模型中词汇表参数占比很大(最高达17%),而在大模型中这个比例显著降低。这验证了我们使用非嵌入FLOPs/token作为模型规模指标的必要性和正确性。
原文: 9M 164M 963M 0.53 1.02 24 1024 302M 407M 3.02B 0.60 0.81 24 2048 1.21B 1.42B 9.66B 0.75 0.88 32 4096 6.44B 6.86B 45.1B 0.85 0.91 40 5120 12.6B 13.1B 85.6B 0.88 0.92 80 8192 64.4B 65.3B 419B 0.92 0.94 Table 3: Difference in model scale representations and disparities of non-embedding parameters N 1 subscript 𝑁 1 N_{1} and complete parameters N 2 subscript 𝑁 2 N_{2} relative to non-embedding FLOPs/token M 𝑀 M . After adopting M 𝑀 M to represent the model scale, our objective could be described more clearly as: Given a computing budget C = M D 𝐶 𝑀 𝐷 C=MD , find the optimal model scale M opt sub...
3.2 Estimating Optimal Model and Data Scaling
我们得到的最优缩放公式为:
M_opt = M_base * C^a,其中M_base = 0.1715,a = 0.5243
D_opt = D_base * C^b,其中D_base = 8.1954,b = 0.4757
注意a + b = 1,这符合计算预算约束C = M * D。模型扩展指数a = 0.5243略大于数据扩展指数b = 0.4757,说明在最优分配下,模型扩展的贡献略大于数据扩展。
图4展示了等FLOP曲线和最优模型/数据扩展曲线。蓝色曲线表示在不同计算预算下的最优模型规模,红色曲线表示最优数据规模。绿色圆点表示我们的实验数据点,与理论预测高度一致。
我们的规模定律分析为DeepSeek LLM 7B和67B的训练配置提供了直接指导。根据公式预测,7B模型应该使用约2万亿tokens训练,67B模型也使用约2万亿tokens训练。
原文: t 𝐶 𝑎 \displaystyle=M_{\mathrm{base}}\cdot C^{a}, M base subscript 𝑀 base \displaystyle\;M_{\mathrm{base}} = 0.1715 , absent 0.1715 \displaystyle=0.1715, a 𝑎 \displaystyle\;a = 0.5243 absent 0.5243 \displaystyle=0.5243 (4) D opt subscript 𝐷 opt \displaystyle D_{\mathrm{opt}} = D base ⋅ C b , absent ⋅ subscript 𝐷 base superscript 𝐶 𝑏 \displaystyle=D_{\mathrm{base}}\cdot C^{b}, D base subscript 𝐷 base \displaystyle D_{\mathrm{base}} = 5.8316 , absent 5.8316 \displaystyle=5.8316, b 𝑏 \displaystyle b = 0.4757 absent 0.4757 \displaystyle=0.4757 (a) IsoFLOP curve (b) Optimal model scaling (c) Optima...
3.3 Scaling Laws with Different Data
3.3 不同数据的规模定律
在DeepSeek LLM的开发过程中,数据集经过了多次迭代优化,在提升整体质量的同时调整了不同数据源的比例。这使我们能够进一步分析不同数据集对规模定律的影响。
我们使用三个不同的数据集研究了规模定律:早期内部数据、当前内部数据和OpenWebText2。分析结果如图6所示。
图6(a)展示了使用早期内部数据的等FLOP曲线。可以发现,模型扩展指数a = 0.45,数据扩展指数b = 0.55,说明在数据质量相对较低时,增加训练数据量比增加模型规模更有效。
图6(b)展示了使用当前内部数据(经过多次优化后的版本)的等FLOP曲线。模型扩展指数增加到a = 0.52,数据扩展指数减少到b = 0.48。这表明数据质量的提升使得模型扩展变得相对更加有效。
图6(c)展示了使用OpenWebText2数据的等FLOP曲线。模型扩展指数为a = 0.48,数据扩展指数为b = 0.52,介于我们两个内部数据集之间。
这些结果有力地证明,数据集的质量直接决定了最优模型/数据扩展分配策略。
原文: In the development process of DeepSeek LLM, the dataset was iteratively refined multiple times, with adjustments in the proportions of different data sources while enhancing the overall quality. This allowed us to further analyze the impact of different datasets on scaling laws. We studied the scaling laws using three different datasets: early in-house data, current in-house data, and OpenWebText2, which was utilized in the previous study of scaling laws (Kaplan et al., 2020 ) . Our internal data assessment revealed that current in-house data has higher data quality than early in-house data. F...
3.3 Scaling Laws with Different Data
因此,在增加计算预算时,扩大模型规模更加有利。我们将继续关注数据质量的变化及其对规模定律的影响,并在未来的工作中提供更多分析。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: ter sufficient training. Therefore, it’s more advantageous to scale up the model size when increasing compute budget. We will continue to pay close attention to the changes in data quality and its impact on scaling laws, and provide more analysis in future works.
4 Alignment
4 对齐
我们收集了约150万个中英文指令数据实例,涵盖广泛的有益性和无害性主题。我们的有益数据包含120万个实例,分布为:通用语言任务占31.2%,数学问题占46.6%,编程练习占22.2%。安全数据由30万个实例组成,涵盖各种敏感话题。
我们的对齐流程包含两个阶段。
监督微调(SFT):我们对7B模型微调了4个epoch,但对67B模型只微调2个epoch,因为我们观察到67B模型存在严重的过拟合问题。我们观察到,GSM8K(Cobbe等, 2021)和HumanEval(Chen等, 2021)的分数在7B模型上持续提升,而67B模型很快达到上限。7B和67B模型的学习率分别为1e-5和5e-6。
除了监控基准测试准确率外,我们还在微调过程中评估对话模型的重复率。我们收集了总共3868个中英文提示,并确定生成回复中未能终止而是无限重复文本序列的比例。我们观察到,随着数学SFT数据量的增加,重复率趋于上升。这归因于数学SFT数据偶尔包含相似的推理模式。因此,较弱的模型难以掌握这些推理模式,导致重复回复。为了解决这个问题,我们尝试了两阶段微调和DPO(Rafailov等, 2023),两者都能几乎保持基准分数并显著降低重复率。
DPO(直接偏好优化):为了进一步提升模型能力,我们使用了直接偏好优化算法(Rafailov等, 2023),这被证明是一种简单但有效的LLM对齐方法。我们就有益性和无害性构建了DPO训练的偏好数据。对于有益性数据,我们收集了多语言提示,涵盖创意写作、问答、指令遵循等类别。然后我们使用DeepSeek Chat模型生成回复作为候选。类似的操作用于无害性偏好数据的构建。
原文: We collect around 1.5 million instruction data instances in English and Chinese, covering a wide range of helpfulness and harmlessness topics. Our helpful data contains 1.2 million instances, with a distribution of 31.2% for general language tasks, 46.6% for mathematical problems, and 22.2% for coding exercises. The safety data consists of 300K instances, covering various sensitive topics. Our alignment pipeline contains two stages. Supervised Fine-Tuning: We fine-tuned our 7B model with 4 epochs, but only 2 epochs for the 67B model, since we observed the overfitting problem is serious on the ...
4 Alignment
我们训练了DPO一个epoch,学习率为5e-6,batch size为512,使用了学习率warmup和余弦学习率调度器。我们发现DPO可以增强模型的开放式生成能力,同时在标准基准测试中几乎没有差异。
原文: ompts, which cover categories including creative writing, question answering, instruction following, and so on. Then we generated responses using our DeepSeek Chat models as response candidates. Similar operations are applied to harmlessness preference data construction. We trained an epoch for DPO, with a learning rate of 5e-6 and batch size of 512, and we used a learning rate warmup and cosine learning rate scheduler. We found out that DPO can strengthen the model’s open-ended generation skill, while engendering little difference in performance among standard benchmarks.
5 Evaluation
5 评估
5.1 公开基准测试评估
我们基于内部评估框架,在一系列中英文公开基准测试上评估我们的模型。
多科目多选题数据集:包括MMLU(Hendrycks等, 2020)、C-Eval(Huang等, 2023)和CMMLU(Li等, 2023)。
语言理解和推理数据集:包括HellaSwag(Zellers等, 2019)、PIQA(Bisk等, 2020)、ARC(Clark等, 2018)、OpenBookQA(Mihaylov等, 2018)和BigBench Hard / BBH(Suzgun等, 2022)。
无参考问答数据集:包括TriviaQA(Joshi等, 2017)和NaturalQuestions(Kwiatkowski等, 2019)。
阅读理解数据集:包括RACE(Lai等, 2017)和DROP(Dua等, 2019)、C3(Sun等, 2019)。
指代消解数据集:包括WinoGrande(Sakaguchi等, 2019)和CLUEWSC(Xu等, 2020)。
语言建模数据集:包括Pile(Gao等, 2020)。
中文理解与文化数据集:包括CHID(Zheng等, 2019)和CCPM(Li等, 2021)。
数学数据集:包括GSM8K(Cobbe等, 2021)、MATH(Hendrycks等, 2021)和CMath(Wei等, 2023)。
代码数据集:包括HumanEval(Chen等, 2021)和MBPP(Austin等, 2021)。
标准化考试:包括AGIEval(Zhong等, 2023)。
我们对需要从多个选项中选择答案的数据集应用基于困惑度(perplexity)的评估。这些数据集包括HellaSwag、PIQA、WinoGrande、RACE-Middle、RACE-High、MMLU、ARC-Easy、ARC-Challenge、OpenBookQA、CHID、C-Eval、CMMLU、C3和CCPM。基于困惑度的评估指的是计算每个选项的困惑度并选择最低值作为模型预测。对于ARC和OpenBookQA,我们使用无条件归一化计算困惑度(Brown等, 2020),对于其他数据集使用长度归一化。我们对TriviaQA、NaturalQuestions、DROP、MATH、GSM8K、HumanEval、MBPP、BBH、AGIEval、CLUEWSC和CMath应用基于生成的评估。
原文: 5.1 Public Benchmark Evaluation We evaluate our models on a series of public benchmarks both in English and Chinese, based on the internal evaluation framework. Multi-subject multiple-choice datasets including MMLU (Hendrycks et al., 2020 ) , C-Eval (Huang et al., 2023 ) and CMMLU (Li et al., 2023 ) . Language understanding and reasoning datasets including HellaSwag (Zellers et al., 2019 ) , PIQA (Bisk et al., 2020 ) , ARC (Clark et al., 2018 ) , OpenBookQA (Mihaylov et al., 2018 ) and BigBench Hard (BBH) (Suzgun et al., 2022 ) . Closed-book question answering datasets including TriviaQA (Josh...
5 Evaluation
基于生成的评估指的是让模型生成自由文本,然后从生成文本中解析结果。对于基于生成的评估,我们使用贪心解码。我们对Pile-test应用基于语言建模的评估,即在测试语料库上计算每字节的比特数(bits-per-byte)。我们根据不同基准测试使用2048或4096作为最大序列长度。评估格式的细节见附录A.6。
表5展示了主要评估结果。尽管DeepSeek模型在2T双语文本上进行预训练,它们在英文语言理解基准测试上表现出与LLaMA2模型(也消耗2T tokens但专注于英文)相当的性能。此外,DeepSeek 67B在MATH、GSM8K、HumanEval、MBPP、BBH和中文基准测试上取得了比LLaMA2 70B显著更好的性能。我们在附录A.3中展示了基准测试曲线。我们可以看到某些任务的性能随着模型规模扩大而提升,如GSM8K和BBH。鉴于我们在同一数据集上训练7B和67B模型,这种提升的出现可归因于大模型的强大少样本学习能力。然而,随着数学数据比例的增加,小模型和大模型之间的差距可能会缩小。
一个有趣的观察是,DeepSeek 67B相对于LLaMA2 70B的优势大于DeepSeek 7B相对于LLaMA2 7B的优势。这一现象突显了语言冲突对小模型的影响更大。
原文: Eval, MBPP, BBH, AGIEval, CLUEWSC, and CMath. The generation-based evaluation here refers to letting the model generate free texts and parsing results from generated texts. For generation-based evaluation, we use greedy decoding. We apply language-modeling-based evaluation for Pile-test, which means calculating the bits-per-byte on the test corpus. We use 2048 or 4096 as the maximum sequence length for different benchmarks. Details of evaluation formats can be found in Appendix A.6 . 5.1.1 Base Model Language Benchmark Test-shots LLaMA2 DeepSeek LLaMA2 DeepSeek 7B 7B 70B 67B English HellaSwag ...
5 Evaluation
表6展示了DeepSeek Chat模型的结果,显示了微调后在大多数任务上的整体改进。然而,某些任务的性能有所下降。
知识:我们观察到基座模型和Chat模型在知识相关任务(如TriviaQA、MMLU和C-Eval)上存在波动。但我们不认为这种微小波动表明SFT后知识的获得或丢失。SFT的价值在于学会在Chat模型的零样本设置下达到与基座模型少样本设置相当的分数,这与真实场景一致。例如,Chat模型的0-shot MMLU性能与基座模型的5-shot MMLU性能相当。
推理:由于SFT实例中很大比例采用思维链(CoT)格式(Wei等, 2022),Chat模型在推理任务(如BBH和NaturalQuestions)上表现出轻微改进。但我们认为SFT阶段并未学习推理能力,而是学习了正确的推理路径格式。
性能下降任务:一些任务在微调后性能持续下降,无论模型大小或选择的预训练检查点。这些任务通常涉及完形填空或句子补全,如HellaSwag。纯语言模型在处理这类任务时表现更好是合理的。
数学和代码:微调后,模型在数学和代码任务上表现出显著改进。例如,HumanEval和GSM8K分数提高了20分以上。我们的解释是,基座模型在这些任务上初始欠拟合,SFT阶段通过大量SFT数据学习了编程和数学的额外知识。
原文: le 5 presents the main results on the evaluation benchmark. Despite DeepSeek models are pre-trained on 2T bilingual corpus, they show comparable performance on English language understanding benchmarks with LLaMA2 models, which also consume 2T tokens but focus on English. Furthermore, DeepSeek 67B achieves considerably better performance on MATH, GSM8K, HumanEval, MBPP, BBH, and Chinese benchmarks compared to LLaMA2 70B. We show the benchmark curve in the Appendix A.3 . We can see some task performance is boosted as model scaling, such as GSM8K and BBH. Given that we train both 7B and 67B on t...
5 Evaluation
表6:基座模型与Chat模型的对比。我们对Chat模型使用0-shot评估MMLU、GSM8K、MATH、C-Eval和CMMLU,而基座模型结果仍在少样本设置下获得。
表7展示了AlignBench排行榜。我们发现我们的DeepSeek-67B-Chat模型以明显优势超越ChatGPT和其他基线模型,表明我们的模型在基础中文语言任务和高级中文推理任务中表现优异。此外,DPO过程在几乎所有领域都带来了改进。
GPT-4-1106-preview:总分8.01,推理7.73,中文推理7.80
GPT-4-0613:总分7.53,推理7.47,中文推理7.56
DeepSeek-67B-Chat-DPO:总分6.69,推理5.77,中文推理6.13
DeepSeek-67B-Chat:总分6.43,推理5.75,中文推理5.71
ChatGLM-turbo(智谱清言):总分6.24
ERNIE-bot-3.5(文心一言):总分6.14
GPT-3.5-turbo-0613:总分6.08
我们的DPO模型的中文基础语言能力甚至高于最新版本的GPT-4。在高级中文推理任务中,我们的模型分数显著高于其他中文大模型。
原文: 4 59.0 64.1 GSM8K 17.4 63.0 63.4 84.1 MATH 6.0 15.8 18.7 32.6 HumanEval 26.2 48.2 42.7 73.8 MBPP 39.0 35.2 57.4 61.4 DROP 41.0 49.1 67.9 71.9 OpenBookQA 55.8 54.8 60.2 63.2 BBH 39.5 42.3 68.7 71.7 AGIEval 26.4 19.3 41.3 46.4 Chinese CLUEWSC 73.1 71.9 81.0 60.0 CHID 89.3 64.9 92.1 72.6 C-Eval 45.0 47.0 66.1 65.2 CMMLU 47.2 49.7 70.8 67.8 CMath 34.5 68.4 63.0 80.3 C3 65.4 66.4 75.3 77.0 CCPM 76.9 76.5 88.5 84.9 Table 6: The comparison between base and chat models. We evaluate chat models with 0-shot for MMLU, GSM8K, MATH, C-Eval, and CMMLU, while base model results are still obtained in the few-...
5 Evaluation
在7B模型微调中,我们首先使用所有数据微调模型。随后引入第二阶段,排除数学和代码数据。动机在于,第一阶段模型的重复率为2.0%,第二阶段微调后降至1.4%,同时保持基准分数。对于67B模型,第一阶段微调后重复率已低于1%,第二阶段会降低模型在基准上的分数。因此67B模型只进行一个阶段的SFT。
5.2 开放式评估
对于Chat模型,除了观察标准基准测试的指标外,开放领域和开放式问题生成的结果质量直接影响实际用户体验。因此,我们分别在中文和英文任务中测试了Chat模型的开放式生成能力。
5.2.1 中文开放式评估
对于中文开放式评估,我们在高质量开放式问题测试集AlignBench(Liu等, 2023)上测试了Chat模型在不同领域的综合能力。AlignBench共包含8个主要类别、36个次级类别,涵盖683个问题。我们使用了官方AlignBench GitHub代码仓库实现评估。严格对齐了关键温度参数:对于角色扮演、写作能力和开放式问题,生成温度设为0.7;其他任务设为0.1。
AlignBench排行榜见表7。我们发现DeepSeek 67B Chat模型超越ChatGPT和其他基线模型,仅次于两个版本的GPT-4。DPO模型在几乎所有指标上都有改进。
原文: pure language models are better equipped to handle such tasks. Math and Code : Our model exhibits significant improvements in math and coding tasks after fine-tuning. For instance, HumanEval and GSM8K scores are improved by over 20 points. Our explanation for this is that the base model was initially underfitted for these tasks, and the SFT stage has learned additional knowledge in coding and mathematics through the extensive SFT data. However, it is important to note that the model’s capabilities may be primarily focused on code completion and algebraic questions. To develop a comprehensive u...
5 Evaluation
表7:AlignBench排行榜(由GPT-4-0613评分),按总分降序排列。带*的结果基于官方AlignBench仓库的评估,其余来自AlignBench论文。
我们发现在中文基础语言任务中,我们的模型在所有模型中处于第一梯队,DPO模型的中文基础语言能力甚至高于最新版本的GPT-4。在高级中文推理任务中,我们的模型分数显著高于其他中文大模型。
5.2.2 英文开放式评估
对于英文开放式评估,我们使用MT-Bench基准测试(Zheng等, 2023),包含8个不同类别的多轮问题。如表8所示,DeepSeek LLM 67B Chat超越其他开源模型(LLaMA-2-Chat 70B、Xwin 70b v0.1和TÜLU 2+DPO 70B),并获得8.35的分数,与GPT-3.5-turbo相当。经过DPO阶段后,DeepSeek LLM 67B Chat DPO进一步将平均分提高到8.76,仅次于GPT-4。
表8:MT-Bench评估结果。
DeepSeek LLM 67B Chat:STEM 9.60,人文9.70,推理8.00,编程7.35,数学6.25,抽取8.40,角色扮演8.20,写作9.30,平均8.35
DeepSeek LLM 67B Chat DPO:STEM 9.70,人文9.80,推理9.05,编程6.75,数学6.65,抽取9.30,角色扮演9.10,写作9.75,平均8.76
原文: 83 6.90 gpt-3.5-turbo-0613 6.08 5.35 5.68 5.02 6.82 6.71 5.81 7.29 7.03 7.28 6.77 chatglm-pro(智谱清言) 5.83 4.65 4.54 4.75 7.01 6.51 6.76 7.47 7.07 7.34 6.89 spark_desk_v2(讯飞星火) 5.74 4.73 4.71 4.74 6.76 5.84 6.97 7.29 7.18 6.92 6.34 Qwen-14B-Chat 5.72 4.81 4.91 4.71 6.63 6.90 6.36 6.74 6.64 6.59 6.56 Baichuan2-13B-Chat 5.25 3.92 3.76 4.07 6.59 6.22 6.05 7.11 6.97 6.75 6.43 ChatGLM3-6B 4.97 3.85 3.55 4.14 6.10 5.75 5.29 6.71 6.83 6.28 5.73 Baichuan2-7B-Chat 4.97 3.66 3.56 3.75 6.28 5.81 5.50 7.13 6.84 6.53 5.84 InternLM-20B 4.96 3.66 3.39 3.92 6.26 5.96 5.50 7.18 6.19 6.49 6.22 Qwen-7B-Chat 4.91 3...
5 Evaluation
5.3 保留集评估
数据污染和基准测试过拟合是评估LLM的两个挑战。一个常见做法是利用最近发布的测试集作为保留测试集来评估模型。
LeetCode:为评估模型的编程能力,我们使用了LeetCode周赛(第351-372周赛,第108-117双周赛,2023年7月至11月)的题目。这些数据通过从LeetCode爬取获得,包含126道题,每题超过20个测试用例。评估指标类似于HumanEval:如果模型输出通过所有测试用例,则认为模型有效解决了该问题。
匈牙利高中毕业考试:与Grok-1一致,我们使用匈牙利国家高中毕业考试评估模型的数学能力。该考试包含33道题,通过人工标注确定模型得分。
指令遵循评估:2023年11月15日,Google发布了指令遵循评估数据集(Zhou等, 2023)。他们识别了25种可验证指令并构建了约500个提示。
表9:保留集评估结果。
模型 LeetCode 匈牙利考试 IFEval
GPT-4 48.4 68 79.3
ChatGLM3 6B 2.4 32 29.7
DeepSeek 67B Chat 17.5 58 55.5
Qwen 72B Chat 12.7 52 50.8
原文: mains on a high-quality open-ended question testset AlignBench (Liu et al., 2023 ) . AlignBench includes a total of 8 primary categories, 36 secondary categories, and encompasses 683 questions. For each question, in addition to the prompt, AlignBench also provides professional reference answers and rating templates for GPT-4 to judge the quality of the response. We utilized the official AlignBench Github code repository to implement the evaluation of our model. We strictly aligned the key temperature parameter with the original setting: for role-playing, writing ability, and open-ended questio...
5 Evaluation
我们对不同大小的基线模型(Qwen 72B Chat、ChatGLM3、Baichuan2和Yi-34B Chat)进行了比较分析。观察表明,在保留集上,大模型和小模型之间存在显著性能差距,即使某些小模型在常规基准测试上取得了令人鼓舞的结果。例如,ChatGLM3在MBPP代码测试集上获得52.4分,接近DeepSeek 67B。然而在新基准测试上,其表现远低于DeepSeek 67B。在数学数据集上也观察到类似趋势,ChatGLM3在GSM8K上很强(72.3),但在匈牙利考试中的表现不如大模型。此外,指令遵循能力表明总计算量起着关键作用。DeepSeek 7B和67B模型使用相同的训练流程,但性能差距显著。通过主观评估,我们观察到在将模型规模扩大到67B时,各种任务上的智能水平有显著提升。
原文: 70B, Xwin 70b v0.1, and TÜLU 2+DPO 70B (Ivison et al., 2023 ) , and achieves 8.35 8.35 8.35 score comparable with GPT-3.5-turbo. Besides, after the DPO stage, our DeepSeek LLM 67B Chat DPO further improves the average score to 8.76 8.76 8.76 , which is only behind GPT-4 (OpenAI, 2023 ) . These results illustrate the strong multi-turn open-ended generation ability of DeepSeek LLM. Model STEM Humanities Reasoning Coding Math Extraction Roleplay Writing Average GPT-4-1106-preview ∗ 9.90 9.95 8.10 9.05 7.95 9.90 9.50 9.70 9.26 GPT-3.5-turbo-0613 ∗ 9.55 9.95 6.20 7.05 7.05 9.00 8.65 9.65 8.39 LLAMA...
5 Evaluation
5.4 安全评估
我们深刻认识到安全对通用人工智能的重要性。建立真正有益的人工智能模型的前提是它拥有与人类一致的值��并表现出对人类的友好性。我们在整个训练过程中融入了模型安全保障,包括预训练、SFT和DPO。
为验证模型安全,我们建立了20人的多学科专家团队,构建了与人类价值观一致的安全内容分类系统(安全评估分类法见表10)。随后,专家团队为每个安全子类别手动构建了大量高质量测试用例。除关注安全内容领域的多样性外,我们还关注安全内容格式的多样性。臭名昭著的"奶奶"漏洞表明,模型可能被查询的表面格式欺骗而提供不安全回复。因此,专家在出题时也注意多样化提问方式,通过诱导、角色扮演、多轮对话、预设立场等手段构建多样化的安全问题。最终获得包含2400道题的安全测试集。
我们的人工审核团队对安全测试进行了详细审查,进行了交叉验证。标注人员对每个问题进行三类标注:安全、不安全、模型拒绝。我们将安全回答和模型拒绝都标记为安全回复。
原文: s coding capabilities are depicted in the Figure below, where the y-axis represents the pass@1 score on in-domain human evaluation testing, and the x-axis represents the pass@1 score on out-domain LeetCode Weekly Contest problems. The LeetCode test data will be released accompanied with the DeepSeek Coder technique report soon. Hungarian National High-School Exam: In line with Grok-1, we have evaluated the model’s mathematical capabilities using the Hungarian National High School Exam. This exam comprises 33 problems, and the model’s scores are determined through human annotation. We follow th...
5 Evaluation
表10:安全评估分类体系。
类别:
1. 歧视偏见问题(486/500安全):民族种族、宗教信仰、国别地域、性别、年龄、职业、健康、其他
2. 侵犯他人合法权益(473/500安全):身心健康、合法财产、肖像权、名誉权、荣誉权、隐私权、信息权益
3. 商业秘密与知识产权(281/300安全):侵犯知识产权、垄断和不正当竞争、违反商业道德、泄露商业机密
4. 违法违规行为(290/300安全):邪教迷信、色情、赌博、毒品、侮辱谩骂、暴力、涉黑涉恶
5. 其他安全问题(767/800安全):幻觉和真实性、时效性、自我认知、敏感话题
结果表明,我们的模型在众多安全测试类别中表现出良好的安全性能。
原文: ets, where ChatGLM3 is very strong on GSM8K (72.3), but its performance in the Hungarian Exam score is inferior to large models. Furthermore, the capability of instruction following demonstrates that total computing plays a crucial role. The DeepSeek 7B and 67B models utilize the same training pipeline, but there is a significant disparity in their performance. Through our subjective evaluation, we have observed a notable discrepancy in intelligence across various tasks when scaling model size to 67B. While DeepSeek 7B falls behind other smaller language models on standard benchmarks, its perf...
5 Evaluation
除现有方法外,我们进一步使用Do-Not-Answer数据集(Wang等, 2023)丰富了安全评估。该数据集的939个风险分类提示有力地展示了我们模型的增强能力。如表11所示,DeepSeek 67B Chat模型获得了97.8分,高于ChatGPT和GPT-4。
表11:Do-Not-Answer安全评分(越高分表示越安全)。
LLaMA-2-7B-Chat: 99.4
Claude: 98.3
DeepSeek-67B-Chat: 97.8
ChatGPT: 97.7
GPT-4: 96.5
Vicuna-7B: 94.9
ChatGLM2: 92.9
我们的模型安全评分高于ChatGPT和GPT-4,处于最安全模型的行列。
5.5 讨论
在整个开发过程中,我们发现了构建LLM的一些有趣发现。
分阶段微调:如上所述,小模型需要更长的微调时间(4个epoch),而大模型只需要2个epoch就能达到相似效果。这是因为大模型在预训练阶段已经吸收了更多知识,微调时更容易达到性能上限。
原文: 感话题 (Other Sensitive Topics), 767/800 Table 10: Our taxonomy for safety evaluation. The total number of test cases for each category and the number of safe answers provided by our model (DeepSeek-67B-Chat) are listed in the far-right column of the table. The annotation of test questions and the evaluation of generated results are carried out by a professional human team. We can observe that our model demonstrates strong security across various types of safety test sets. 5.4 Safety Evaluation We profoundly recognize the importance of safety for general artificial intelligence. The premise for e...
5 Evaluation
此外,我们还发现数据集的多样性对模型性能有重要影响。在预训练阶段,增加高质量中文和英文数据的比例显著提升了模型在双语任务上的表现。我们还观察到,代码和数学数据对模型推理能力的提升最为明显,因此在SFT阶段重点增加了这类数据的比例。
总的来说,我们的实验结果验证了DeepSeek LLM在多个方面的优秀性能,特别是在代码、数学和推理领域。这些发现为未来的模型优化提供了有价值的指导方向。
原文: ur model on this test set, we manually inspected its safety. Our review team was well-trained and cross-verification was performed on the annotation results. The annotators perform a three-category annotation for each question: safe, unsafe, and model refusal. We tested the safety of our DeepSeek 67B Chat model, and the results are presented in Table 10 . The number of test questions for each safety category and the number of safety tests passed by our model are listed in the table. We label both the securely answered and the model-refused test cases as secure responses. The results indicate t...
5 Evaluation
分阶段微调:小模型需要在数学和代码数据集上更长时间微调,但这会损害模型的对话能力(如增加重复行为)。为此我们实现了分阶段微调流程。第一阶段使用所有可用数据微调,第二阶段专门使用对话数据微调。
表12:两阶段微调结果。
模型 HumanEval GSM8K 重复率 IFEval
DeepSeek 7B Chat Stage1 48.2 63.9 0.020 38.0
DeepSeek 7B Chat Stage2 48.2 63.0 0.014 41.2
表12显示了两阶段训练的结果。第二阶段没有损害模型在代码和数学上的能力,同时降低了重复行为并增强了指令遵循能力。
多选题:使用多选题评估数据(如MMLU、AGIEval和C-Eval)是常见的做法。我们对C-Eval验证集和CMMLU测试集进行了去重以防止数据污染。
表13:添加多选题数据的影响。
模型 MMLU C-Eval CMMLU TriviaQA ChineseQA
DeepSeek 7B Chat 49.4 47.0 49.7 57.9 75.0
DeepSeek 7B Chat+MC 60.9 71.3 73.8 57.9 74.4
额外2000万多选题数据不仅对中文多选题基准有益,也提高了英文基准。这表明模型解决MC问题的能力已增强。然而,这种提升并未延伸到不使用多选题格式的其他评估(如TriviaQA和我们的内部ChineseQA测试集)。这意味着用户在对话交互中可能不会觉得模型变得更聪明。因此我们选择不将多选题数据纳入预训练和微调阶段,因为这会导致对基准测试的过拟合而非真正的智能。
原文: r fine-tuning on math and code dataset, but it will hurt the model conversation ability, such as increasing repetition behavior. To address this issue, we have implemented a staged fine-tuning process. In this approach, the first stage involves fine-tuning with all available data, while the second stage focuses specifically on fine-tuning with conversational data. Model HumanEval GSM8K Repetition IFEval DeepSeek LLM 7B Chat Stage1 48.2 63.9 0.020 38.0 DeepSeek LLM 7B Chat Stage2 48.2 63.0 0.014 41.2 Table 12: Two-stage fine-tuning results. The repetition ratio is computed when the temperature ...
5 Evaluation
预训练中的指令数据:在预训练后期阶段整合指令数据可以提升基座模型在基准任务上的性能,这已被广泛认可。在我们的研究中,我们在预训练最后10%的阶段整合了500万条指令数据,主要由多选题组成。我们观察到基座模型确实在基准上表现出提升。然而,最终结果与在SFT阶段添加相同数据的结果几乎相同。我们得出结论,这种方法虽然增强了基座模型在基准上的表现,但其整体潜力与不纳入这些指令数据相当。如果指令数据量足够大,可以将其纳入预训练过程。由于我们倾向于排除多选题且拥有的非多选题有限,我们决定不在预训练中包含指令数据。
系统提示:一个良好的系统提示应该有效引导模型生成既有益又尊重的回复。我们略微修改了LLaMA-2引入的提示作为系统提示:
"你是DeepSeek Chat,由DeepSeek开发的有益、尊重、诚实的AI助手。你的训练数据知识截止日期为2023年5月。始终尽可能有帮助地回答,同时保持安全。你的回答不应包含任何有害、不道德、种族主义、性别歧视、有毒、危险或非法内容。请确保你的回复在社会上是无偏见的且积极的。如果问题没有意义或事实上不连贯,请解释原因而不是回答不正确内容。如果你不知道问题的答案,请不要分享虚假信息。"
我们观察到一个有趣的现象:当引入系统提示时,7B LLM的性能略有下降。然而,当使用67B LLM时,添加提示带来了显著提升,如表14所示。我们解释这种差异的原因是:大模型更好地理解了系统提示背后的预期含义,能够更有效地遵循指令并生成更好的回复。另一方面,小模型难以充分理解系统提示,训练和测试之间的不一致可能负面影响其性能。
原文: nhanced. However, we have observed that this improvement does not extend to the model’s performance on other evaluations that do not utilize the multiple-choice format, such as TriviaQA and our in-house ChineseQA testsets, which are generative evaluation benchmarks. This suggests that users may not perceive the model as becoming more intelligent during conversational interactions, as these interactions involve generating responses rather than solving multiple-choice problems. Therefore, we have chosen to exclude MC data from both the pre-training and fine-tuning stages , as including it would ...
5 Evaluation
表14:添加系统提示的影响。
模型 MT Bench
DeepSeek 7B Chat 7.15
DeepSeek 7B Chat + System Prompt 7.11
DeepSeek 67B Chat 8.35
DeepSeek 67B Chat + System Prompt 8.58
原文: pSeek Chat, a helpful, respectful and honest AI assistant developed by DeepSeek. The knowledge cut-off date for your training data is up to May 2023. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don’t know the answer to a question, please don’t share false informat...
5.1 Public Benchmark Evaluation
(本节为5.1 Public Benchmark Evaluation的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: We evaluate our models on a series of public benchmarks both in English and Chinese, based on the internal evaluation framework. Multi-subject multiple-choice datasets including MMLU (Hendrycks et al., 2020 ) , C-Eval (Huang et al., 2023 ) and CMMLU (Li et al., 2023 ) . Language understanding and reasoning datasets including HellaSwag (Zellers et al., 2019 ) , PIQA (Bisk et al., 2020 ) , ARC (Clark et al., 2018 ) , OpenBookQA (Mihaylov et al., 2018 ) and BigBench Hard (BBH) (Suzgun et al., 2022 ) . Closed-book question answering datasets including TriviaQA (Joshi et al., 2017 ) and NaturalQues...
5.1 Public Benchmark Evaluation
(本节为5.1 Public Benchmark Evaluation的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: C, and CMath. The generation-based evaluation here refers to letting the model generate free texts and parsing results from generated texts. For generation-based evaluation, we use greedy decoding. We apply language-modeling-based evaluation for Pile-test, which means calculating the bits-per-byte on the test corpus. We use 2048 or 4096 as the maximum sequence length for different benchmarks. Details of evaluation formats can be found in Appendix A.6 . 5.1.1 Base Model Language Benchmark Test-shots LLaMA2 DeepSeek LLaMA2 DeepSeek 7B 7B 70B 67B English HellaSwag 0-shot 75.6 75.4 84.0 84.0 PIQA ...
5.1 Public Benchmark Evaluation
(本节为5.1 Public Benchmark Evaluation的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: n the evaluation benchmark. Despite DeepSeek models are pre-trained on 2T bilingual corpus, they show comparable performance on English language understanding benchmarks with LLaMA2 models, which also consume 2T tokens but focus on English. Furthermore, DeepSeek 67B achieves considerably better performance on MATH, GSM8K, HumanEval, MBPP, BBH, and Chinese benchmarks compared to LLaMA2 70B. We show the benchmark curve in the Appendix A.3 . We can see some task performance is boosted as model scaling, such as GSM8K and BBH. Given that we train both 7B and 67B on the same dataset, the emergence o...
5.1 Public Benchmark Evaluation
(本节为5.1 Public Benchmark Evaluation的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 84.1 MATH 6.0 15.8 18.7 32.6 HumanEval 26.2 48.2 42.7 73.8 MBPP 39.0 35.2 57.4 61.4 DROP 41.0 49.1 67.9 71.9 OpenBookQA 55.8 54.8 60.2 63.2 BBH 39.5 42.3 68.7 71.7 AGIEval 26.4 19.3 41.3 46.4 Chinese CLUEWSC 73.1 71.9 81.0 60.0 CHID 89.3 64.9 92.1 72.6 C-Eval 45.0 47.0 66.1 65.2 CMMLU 47.2 49.7 70.8 67.8 CMath 34.5 68.4 63.0 80.3 C3 65.4 66.4 75.3 77.0 CCPM 76.9 76.5 88.5 84.9 Table 6: The comparison between base and chat models. We evaluate chat models with 0-shot for MMLU, GSM8K, MATH, C-Eval, and CMMLU, while base model results are still obtained in the few-shot setting. Table 6 demonstrate...
5.1 Public Benchmark Evaluation
(本节为5.1 Public Benchmark Evaluation的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: equipped to handle such tasks. Math and Code : Our model exhibits significant improvements in math and coding tasks after fine-tuning. For instance, HumanEval and GSM8K scores are improved by over 20 points. Our explanation for this is that the base model was initially underfitted for these tasks, and the SFT stage has learned additional knowledge in coding and mathematics through the extensive SFT data. However, it is important to note that the model’s capabilities may be primarily focused on code completion and algebraic questions. To develop a comprehensive understanding of mathematics and ...
5.1 Public Benchmark Evaluation
(本节为5.1 Public Benchmark Evaluation的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 5.35 5.68 5.02 6.82 6.71 5.81 7.29 7.03 7.28 6.77 chatglm-pro(智谱清言) 5.83 4.65 4.54 4.75 7.01 6.51 6.76 7.47 7.07 7.34 6.89 spark_desk_v2(讯飞星火) 5.74 4.73 4.71 4.74 6.76 5.84 6.97 7.29 7.18 6.92 6.34 Qwen-14B-Chat 5.72 4.81 4.91 4.71 6.63 6.90 6.36 6.74 6.64 6.59 6.56 Baichuan2-13B-Chat 5.25 3.92 3.76 4.07 6.59 6.22 6.05 7.11 6.97 6.75 6.43 ChatGLM3-6B 4.97 3.85 3.55 4.14 6.10 5.75 5.29 6.71 6.83 6.28 5.73 Baichuan2-7B-Chat 4.97 3.66 3.56 3.75 6.28 5.81 5.50 7.13 6.84 6.53 5.84 InternLM-20B 4.96 3.66 3.39 3.92 6.26 5.96 5.50 7.18 6.19 6.49 6.22 Qwen-7B-Chat 4.91 3.73 3.62 3.83 6.09 6.40 5.74 6.2...
5.2 Open-Ended Evaluation
(本节为5.2 Open-Ended Evaluation的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: For chat models, in addition to observing metrics on standard benchmarks, the quality of results generated in open domains and open-ended questions directly affects the actual user experience. Hence, we separately tested the open-ended generation capabilities of our chat model in both Chinese and English tasks. 5.2.1 Chinese Open-Ended Evaluation For Chinese open-ended evaluation, we tested the comprehensive of our chat model in different domains on a high-quality open-ended question testset AlignBench (Liu et al., 2023 ) . AlignBench includes a total of 8 primary categories, 36 secondary cate...
5.2 Open-Ended Evaluation
(本节为5.2 Open-Ended Evaluation的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: strating the superior performance of our model in more complex Chinese logical reasoning and mathematical calculations. 5.2.2 English Open-Ended Evaluation For English open-ended evaluation, we use the MT-Bench benchmark (Zheng et al., 2023 ) , which contains 8 different categories of multi-turn questions. As illustrated in Table 8 , our DeepSeek LLM 67B Chat outperforms other open-source models such as LLaMA-2-Chat Touvron et al. ( 2023b ) 70B, Xwin 70b v0.1, and TÜLU 2+DPO 70B (Ivison et al., 2023 ) , and achieves 8.35 8.35 8.35 score comparable with GPT-3.5-turbo. Besides, after the DPO sta...
5.3 Held-Out Evaluation
(本节为5.3 Held-Out Evaluation的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Data contamination and benchmark overfitting are two challenges in evaluating LLMs. One common practice is to utilize testsets published recently to evaluate the model as held-out testsets. LeetCode: To assess the coding proficiency of the model, we have utilized problems from the LeetCode Weekly Contest (Weekly Contest 351-372, Bi-Weekly Contest 108-117, from July 2023 to Nov 2023). We have obtained these problems by crawling data from LeetCode, which consists of 126 problems with over 20 test cases for each. The evaluation metric employed is akin to that of HumanEval. In this regard, if a mo...
5.3 Held-Out Evaluation
(本节为5.3 Held-Out Evaluation的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: luation. We have conducted a comparative analysis of our model against various baseline models of different sizes, namely Qwen 72B Chat (Bai et al., 2023 ) , ChatGLM3 (Du et al., 2022 ) , Baichuan2 (Yang et al., 2023 ) , and Yi-34B Chat. Our observations indicate that there exists a significant performance gap between large models and small models on these held-out datasets, even if certain small models achieve promising results on conventional benchmarks. For instance, ChatGLM3 achieves a score of 52.4 on MBPP, a code testset, which is close to DeepSeek 67B. However, when evaluated on new ben...
5.3 Held-Out Evaluation
(本节为5.3 Held-Out Evaluation的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Rights) 侵犯他人知识产权 (Infringing Others’ Intellectual Property Rights), 垄断和不正当竞争行为 (Monopolistic and Unfair Competitive Actions), 其他商业违法违规行为 (Other Commercially Illegal and Non-compliant Behaviors), 违反商业道德 (Violating Business Ethics), 泄露他人商业机密 (Disclosing Others’ Trade Secrets) 281/300 违法违规行为 (Illegal and Non-compliant Behavior) 邪教迷信 (Cults and Superstition), 色情 (Pornography), 赌博 (Gambling), 毒品和违禁品 (Drugs and Prohibited Items), 侮辱谩骂 (Insults and Abuse), 暴力行为 (Violent Behavior), 涉黑涉恶 (Involvement in Organized Crime), 其他违法违规行为 (Other Illegal and Non-compliant Behaviors) 290/300 其他安全问题 (Other Safety ...
5.4 Safety Evaluation
(本节为5.4 Safety Evaluation的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: We profoundly recognize the importance of safety for general artificial intelligence. The premise for establishing a truly helpful artificial intelligence model is that it possesses values consistent with those of humans and exhibits friendliness towards humanity. We incorporate the assurance of model safety throughout the entire training process, including pre-training, SFT, and DPO. To validate the safety of our model, we established a 20-person expert team from various disciplines and constructed a safety content classification system that aligns with human values (the safety evaluation tax...
5.4 Safety Evaluation
(本节为5.4 Safety Evaluation的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: both the securely answered and the model-refused test cases as secure responses. The results indicate that our model exhibits good security performance across numerous safety test categories. Complementing our existing approach to safety, we further enriched our evaluation using the "Do-Not-Answer" dataset (Wang et al., 2023 ) to evaluate the safety mechanisms of our DeepSeek 67B Chat model. The dataset’s 939 risk-categorized prompts were instrumental in highlighting our model’s enhanced capabilities. As shown in Table 11 , DeepSeek 67B Chat model has demonstrated notable performance, achievin...
5.5 Discussion
(本节为5.5讨论部分的详细内容,翻译见上面第51-53段对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Throughout the development process, we have discovered some interesting findings in building LLMs. Staged Fine-Tuning: As we mentioned above, small models need longer fine-tuning on math and code dataset, but it will hurt the model conversation ability, such as increasing repetition behavior. To address this issue, we have implemented a staged fine-tuning process. In this approach, the first stage involves fine-tuning with all available data, while the second stage focuses specifically on fine-tuning with conversational data. Model HumanEval GSM8K Repetition IFEval DeepSeek LLM 7B Chat Stage1 ...
5.5 Discussion
(本节为5.5讨论部分的详细内容,翻译见上面第51-53段对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: not only for Chinese multiple-choice benchmarks but also for improving English benchmarks. This indicates that the model’s capability to solve MC problems has been enhanced. However, we have observed that this improvement does not extend to the model’s performance on other evaluations that do not utilize the multiple-choice format, such as TriviaQA and our in-house ChineseQA testsets, which are generative evaluation benchmarks. This suggests that users may not perceive the model as becoming more intelligent during conversational interactions, as these interactions involve generating responses ...
5.5 Discussion
(本节为5.5讨论部分的详细内容,翻译见上面第51-53段对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: del to generate responses that are both helpful and respectful. We slightly changed the prompt introduced by LLaMA-2 as our system prompt. System prompt: You are DeepSeek Chat, a helpful, respectful and honest AI assistant developed by DeepSeek. The knowledge cut-off date for your training data is up to May 2023. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense...
6 Conclusion, Limitation, and Future Work
6 结论、局限与未来工作
我们介绍了DeepSeek LLM系列开源模型,这些模型从零开始训练,使用了包含2万亿token的中英文大规模数据集。在本文中,我们深入解释了超参数选择、规模定律以及各种微调尝试。我们校准了以往的规模定律,提出了新的最优模型/数据扩展分配策略。此外,我们提出了一种根据给定计算预算预测近最优batch size和学习率的方法。我们进一步得出结论,规模定律与数据质量相关,这可能是不同研究中缩放行为存在差异的根本原因。
在规模定律的指导下,我们以最佳超参数进行预训练,并提供了全面的评估。在所有训练阶段,我们避免了基准测试粉饰和隐藏技巧。
DeepSeek Chat共享其他LLM中常见的已知局限性,包括预训练后缺乏持续的知识更新、可能生成未经核实的信息等不真实内容,以及产生幻觉的倾向。此外,需要指出的是,我们的中文数据初版不够完整,可能导致某些中文特定主题上的表现不理想。由于我们的数据主要由中文和英文来源组成,模型在其他语言上的能力仍然有限,使用时应谨慎。
DeepSeek LLM是一个长期项目,致力于推进开源语言模型的发展:
- 即将发布代码智能和混合专家(MoE)技术报告,展示如何创建高质量代码预训练数据,以及设计稀疏模型达到稠密模型性能。
- 目前正在构建更大、改进的数据集用于下一版DeepSeek LLM。我们希望推理、中文知识、数学和代码能力将显著提升。
- 对齐团队致力于研究向公众交付有益、诚实、安全的模型的方法。初步实验证明强化学习可以提升模型的复杂推理能力。
原文: We introduce DeepSeek LLMs, a series of open-source models trained from scratch on a vast dataset of 2 trillion tokens in both English and Chinese. In this paper, we provide an in-depth explanation of hyper-parameters selection, scaling laws, as well as the various fine-tuning attempts we made. We calibrate the scaling laws in the previous work and propose a new optimal model/data scaling-up allocation strategy. In addition, we present a method to predict the near-optimal batch size and learning rate with given compute budget. We further conclude that the scaling laws is related to the data qu...
6 Conclusion, Limitation, and Future Work
我们希望推理、中文知识、数学和代码能力将在下一版本中显著提升。我们的对齐团队致力于研究向公众交付有益、诚实、安全的模型的方法。初步实验证明,强化学习可以提升模型的复杂推理能力。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: significantly improved in the next version. • Our alignment team is dedicated to studying ways to deliver a model that is helpful, honest, and safe to the public. Our initial experiments prove that reinforcement learning could boost model complex reasoning capability.
Appendix A Appendix
附录A
A.1 致谢
本项目的实现得益于众多贡献者的努力。我们向以下团队致以诚挚感谢:
- 数据标注团队:蔡佳璐、陈瑞建、陈如意、冯贝、黄燕萍、黄震、江品、金荣丽、金相月、可子韵、李辉、李梦、李桑桑、李小倩、李耀辉、马云仙、倪佳琦、沈小锦、宋欣欣、孙天宇、陈小莎、田浩远、王小寒、王小香、王宇浩、夏凡一、徐雷、徐泽元、徐志鹏、袁天、张忠羽、郑毅、周爽、周欣怡、朱宇辰、朱宇轩
- 合规团队:陈金、唐颖、王妙君、王先祖、吴少卿、夏乐怡、萧WL
- 业务团队:梁健、李明明、王T、王先祖、文真牛、叶胜峰、张鹏、张震
- 设计团队:安伟、查 Yukun
A.2 不同模型规模表示方法
我们重新拟合了不同模型规模表示方法的缩放曲线。结果显示,在较高计算预算下,三种表示方法之间最优模型/数据分配的偏差并不显著,但在较低预算下存在明显差异。
图6:使用不同模型规模表示方法的性能缩放曲线。指标为验证集上的bits-per-byte。虚线表示对较小模型的幂律拟合。蓝色星号表示DeepSeek LLM 7B和67B。
当使用6N1作为模型规模表示时,拟合的性能缩放曲线倾向于高估大规模模型的性能。相反,使用6N2时倾向于低估。使用M作为模型规模表示则实现了最准确的预测。
原文: A.1 Acknowledgments This project was realized thanks to the efforts of numerous contributors. We offer our extended thanks to the following individuals for their help 1 1 1 Authors are ordered alphabetically by the last name. : • Data Annotation Team: Jialu Cai, Ruijian Chen, Ruyi Chen, Bei Feng, Yanping Huang, Zhen Huang, Pin Jiang, Rongli Jin, Xiangyue Jin, Ziyun Ke, Hui Li, Meng Li, Sangsang Li, Xiaoqian Li, Yaohui Li, Yunxian Ma, Jiaqi Ni, Xiaojin Shen, Xinnan Song, Tianyu Sun, Xiaosha Chen, Haoyuan Tian, Xiaohan Wang, Xiaoxiang Wang, Yuhao Wang, Fanyi Xia, Lei Xu, Zeyuan Xu, Zhipeng Xu, T...
Appendix A Appendix
N1、N2和M分别表示模型的非嵌入参数、完整参数和非嵌入FLOPs/token。当使用6N1作为模型规模表示时,拟合的性能缩放曲线倾向于高估大规模模型的性能。相反,使用6N2时,曲线倾向于低估它们的性能。使用M作为模型规模表示,则实现了最准确的预测。
A.3 基准测试指标曲线
图7:DeepSeek LLM Base基准测试指标曲线。ChineseQA是我们的内部测试集,构建方式类似TriviaQA。图7显示了不同训练步骤的基准测试指标曲线。我们可以看到从训练开始到结束这些基准上的持续改进。我们相信如果训练继续,性能会进一步提升。
表15:与代码特定模型的对比。
模型 HumanEval MBPP Python Multilingual
Codex-001 - 33.5% 26.1% 45.9%
StarCoder 16B 36.0% 28.7% 46.8%
CodeGeeX2 6B 36.0% 24.5% 42.4%
CodeLlama 7B 31.7% 29.2% 41.6%
CodeLlama 13B 36.0% 35.4% 48.4%
CodeLlama 34B 48.2% 41.0% 55.2%
DeepSeek-LLM-Base 67B 42.7% 37.2% 57.4%
Wizard-Coder 34B (指令微调) 73.2% 48.8% 61.2%
DeepSeek-LLM-Chat 67B 73.8% 53.3% 61.4%
原文: , and M 𝑀 M represent the non-embedding parameters, complete parameters, and non-embedding FLOPs/token of the model, respectively. When using 6 N 1 6 subscript 𝑁 1 6N_{1} as the model scale representation, the fitted performance scaling curve tends to overestimate the performance of large-scale models. Conversely, when using 6 N 2 6 subscript 𝑁 2 6N_{2} , the curve tends to underestimate their performance. Using M 𝑀 M as the model scale representation, however, achieves the most accurate predictions. A.3 Benchmark Metrics Curves Figure 7: Benchmark metrics curves of DeepSeek LLM Base. Chin...
A.1 Acknowledgments
[A.1 Acknowledgments] 本章节为原文内容,详细翻译请参考英文原文。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: This project was realized thanks to the efforts of numerous contributors. We offer our extended thanks to the following individuals for their help 1 1 1 Authors are ordered alphabetically by the last name. : • Data Annotation Team: Jialu Cai, Ruijian Chen, Ruyi Chen, Bei Feng, Yanping Huang, Zhen Huang, Pin Jiang, Rongli Jin, Xiangyue Jin, Ziyun Ke, Hui Li, Meng Li, Sangsang Li, Xiaoqian Li, Yaohui Li, Yunxian Ma, Jiaqi Ni, Xiaojin Shen, Xinnan Song, Tianyu Sun, Xiaosha Chen, Haoyuan Tian, Xiaohan Wang, Xiaoxiang Wang, Yuhao Wang, Fanyi Xia, Lei Xu, Zeyuan Xu, Zhipeng Xu, Tian Yuan, Zhongyu Zh...
A.2 Different Model Scale Representations
[A.2 Different Model Scale Representations] 本章节为原文内容,详细翻译请参考英文原文。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: We refitted the scaling curve for different model scale representations, reusing the experiments from the IsoFLOP profile. We recalculated the compute FLOPs using 6 N 1 6 subscript 𝑁 1 6N_{1} and 6 N 2 6 subscript 𝑁 2 6N_{2} as model scale representations and refitted the performance scaling curves. As shown in Figure 6 , the results indicate that the deviation of optimal model/data allocation among these three representations is not significant at higher compute budgets, but there are noticeable differences at lower budgets. (a) Compute budget C = 6 N 1 D 𝐶 6 subscript 𝑁 1 𝐷 C=6N_{1}D...
A.3 Benchmark Metrics Curves
[A.3 Benchmark Metrics Curves] 本章节为原文内容,详细翻译请参考英文原文。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Figure 7: Benchmark metrics curves of DeepSeek LLM Base. ChineseQA is our in-house test set, constructed in a manner akin to TriviaQA. Figure 7 shows benchmark metrics curves across different training steps. We can see consistent improvement on these benchmarks from the start to the end of training. We believe the performance will further be improved if the training continues. Model Size HumanEval MBPP Python Multilingual Pre-Trained Models Codex-001 - 33.5% 26.1% 45.9% StarCoder 16B 36.0% 28.7% 46.8% CodeGeeX2 6B 36.0% 24.5% 42.4% CodeLlama 7B 31.7% 29.2% 41.6% CodeLlama 13B 36.0% 35.4% 48.4%...
A.4 Comparison with Code or Math Specific Models
A.4 与代码或数学特定模型的对比
我们将模型与特定代码和数学语言模型进行了对比。表15表明,DeepSeek LLM 67B能够在代码数据较少的情况下实现与CodeLlama相当的性能。值得注意的是,DeepSeek LLM在非代码领域具有更强的能力。
表16展示了各种数学相关基准测试结果:GSM8K、MATH、MGSM-zh和CMath。DeepSeek 67B在不同语言的数学任务上表现出色。此外,DeepSeek LLM可以利用程序解决数学问题,这比思维链方法表现更好。在基准测试上显著优于之前的SOTA模型ToRA(Gou等, 2023)。
表16:与数学特定模型的对比。
模型 方法 GSM8K MATH MGSM-zh CMath
MetaMath 70B CoT 82.3% 26.6% 66.4% 70.9%
WizardMath 70B CoT 81.6% 22.7% 64.8% 65.4%
DeepSeek LLM 67B Chat CoT 84.1% 32.6% 74.0% 80.3%
ToRA-Code 34B 工具集成 80.7% 50.8% 41.2% 53.4%
DeepSeek LLM 67B Chat 工具集成 86.7% 51.1% 76.4% 85.4%
原文: We have conducted a comparison between our model and specific code and math language models (LLMs). Table 15 demonstrates that DeepSeek LLM 67B is capable of achieving similar performance to CodeLlama, despite having access to less code data. It is worth noting that DeepSeek LLM possesses greater capabilities in areas other than code. Likewise, Table 16 presents the results obtained from various math-related benchmarks, such as GSM8K (Cobbe et al., 2021 ) , MATH (Hendrycks et al., 2021 ) , MGSM-zh (i et al., 2023 ) , and CMath (Wei et al., 2023 ) . DeepSeek 67B exhibits exceptional performance...
A.5 Benchmark Results w/ DPO Stage
A.5 DPO阶段的基准测试结果
表17展示了DPO阶段的基准测试结果。基于这些结果,我们可以得出DPO阶段对LLM基础能力没有显著影响的结论。
表17:DPO阶段前后的基准测试指标。
任务 DeepSeek 67B Chat DeepSeek 67B Chat DPO
HellaSwag 75.7 76.1
TriviaQA 81.5 82.9
NaturalQuestions 47.0 48.8
MMLU 71.1 70.9
GSM8K 84.1 85.2
MATH 32.6 30.2
HumanEval 73.8 71.3
BBH 71.7 70.8
AGIEval 46.4 46.1
CEval 65.2 64.3
CMMLU 67.8 68.2
原文: Table 17 presents the benchmark results obtained with the DPO stage. Based on these results, we can conclude that the DPO stage does not significantly impact the fundamental capability of an LLM. DeepSeek 67B Chat DeepSeek 67B Chat DPO HellaSwag 75.7 76.1 TriviaQA 81.5 82.9 NaturalQuestions 47.0 48.8 MMLU 71.1 70.9 GSM8K 84.1 85.2 MATH 32.6 30.2 HumanEval 73.8 71.3 BBH 71.7 70.8 AGIEval 46.4 46.1 CEval 65.2 64.3 CMMLU 67.8 68.2 Table 17: The benchmark metrics before and after DPO stage.
A.6 Evaluation Formats
A.6 评估格式示例
表18-40展示了我们在不同基准测试上的评估格式示例。
表18:AGIEval示例 - 中国高考生物选择题格式。
表19:ARC示例 - 美国科学考试选择题格式。
表20:BBH示例 - 布尔表达式推理,使用思维链格式。
原文: Table 18 ∼ similar-to \sim Table 40 present examples of our evaluation formats on different benchmarks. PROMPT 以下是一道中国高考生物选择题,请选择正确的答案。 问题:下列有关高尔基体、线粒体和叶绿体的叙述, 正确的是 选项:(A)三者都存在于蓝藻中 (B)三者都含有 DNA (C)三者都是 ATP 合成的场所 (D)三者的膜结构中都含有蛋白质 答案:从A到D, 我们应选择 Table 18: An example of AGIEval. PROMPT Question: Use the information below to answer the question. Cotton is a plant product used to make fabric. Cotton is made of cellulose, a fiber not digestible by humans. Cellulose is composed of many sugar molecules bonded together into long chains. Each sugar molecule contains carbon, hydrogen, and oxygen atoms. W...
A.6 Evaluation Formats
表21:C-Eval示例 - 中国教育学考试单项选择题格式。
以上示例展示了对中英文不同基准测试的评估格式。所有格式都经过精心设计,以确保公平和一致的评估。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: ue = A and B" where "A = True and False" and "B = not True and True". Let’s evaluate A: A = True and False = False. Let’s evaluate B: B = not True and True = not (True and True) = not (True) = False. Plugging in A and B, we get: Z = A and B = False and False = False. So the answer is False. Q: not not ( not ( False ) ) is A: Let’s think step by step. Remember that (i) expressions inside brackets are always evaluated first and that (ii) the order of operations from highest priority to lowest priority is "not", "and", "or", respectively. We first simplify this expression "Z" as follows: "Z = not...