DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence
Daya Guo* 1 Qihao Zhu ∗1,2 Dejian Yang 1 Zhenda Xie 1 Kai Dong 1 Wentao Zhang 1 Guanting Chen 1 Xiao Bi 1 Y. Wu 1 Y.K. Li 1 Fuli Luo 1 Yingfei Xiong 2 Wenfeng Liang 1 1 DeepSeek-AI 2 北京大学高可信软件技术教育部重点实验室;计算机学院 {zhuqh, guodaya}@deepseek.com https://github.com/deepseek-ai/DeepSeek-Coder 摘要 大型语言模型的快速发展革新了软件开发中的代码智能。然而,闭源模型的主导地位限制了广泛的研究与开发。为此,我们推出 DeepSeek-Coder 系列,一系列面向代码智能的开源大语言模型,在代码生成、代码补全和代码理解等任务上取得了 state-of-the-art 的性能。
[原文]Daya Guo* 1 Qihao Zhu ∗1,2 Dejian Yang 1 Zhenda Xie 1 Kai Dong 1 Wentao Zhang 1 Guanting Chen 1 Xiao Bi 1 Y. Wu 1 Y.K. Li 1 Fuli Luo 1 Yingfei Xiong 2 Wenfeng Liang 1 1 DeepSeek-AI 2 Key Lab of HCST (PKU) MOE; SCS Peking University {zhuqh, guodaya}@deepseek.com https://github.com/deepseek-ai/DeepSeek-Coder Abstract The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce the DeepSeek-Coder series, a range of open-...
DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence
[原文]ssible to many researchers and developers due to their proprietary nature. In response to this challenge, we present the DeepSeek-Coder series. This series comprises a range of open-source code models, varying in size from 1.3B to 33B, including the base version and instructed version for each size. Each model in the series has been trained from scratch on 2 trillion tokens sourced from 87 programming languages, ensuring a comprehensive understanding of coding languages and syntax. Besides, we attempt to organize the pre-training data at the repository level to enhance the pre-trained model’s ...
DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence
大语言模型(LLM)。这些模型在庞大代码语料库上经过大规模训练,能够熟练理解 87 种编程语言。此外,它们提供多种模型规模以满足广泛的计算与应用需求。• 我们首次尝试在预训练阶段纳入仓库级(repository-level)数据构建。我们发现这能显著提升跨文件代码生成能力。• 我们的分析严格考察了 FIM 训练策略对预训练阶段……
[原文]large language models (LLMs). Developed through extensive training on an expansive code corpus, these models exhibit proficiency in understanding 87 programming languages. Additionally, they are available in various model scales to cater to a wide range of computational and application needs. • We make the first attempt to incorporate repository-level data construction during the pre-training phase of our models. We find that it can significantly boost the capability of cross-file code generation. • Our analysis rigorously examines the impact of FIM training strategies on the pretraining phase...
DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence
[原文]ing, rule-based filtering, dependency parsing, repository-level deduplication, and quality screening, as illustrated in Figure 2 . In the following, we will describe the data creation procedure step by step. Figure 2: The Procedure of Dataset Creation 2.1 GitHub Data Crawling and Filtering We collect public repositories created before February 2023 on GitHub and retain only 87 programming languages, as listed in Table 1 . To reduce the amount of data to be processed, we apply filtering rules similar to those used in the StarCoder project (Li et al., 2023 ) to preliminarily filter out lower-qua...
DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence
[原文]ndencies between files within the same repository in this step. Specifically, we first parse the dependencies between files and then arrange these files in an order that ensures the context each file relies on is placed before that file in the input sequence. By aligning the files in accordance with their dependencies, our dataset more accurately represents real coding practices and structures. This enhanced alignment not only makes our dataset more relevant but also potentially increases the practicality and applicability of the model in handling project-level code scenarios. It’s worth notin...
1 Introduction
软件开发领域因大语言模型(Touvron et al., 2023; OpenAI, 2023)的快速发展而显著变革,开启了代码智能新时代。这些模型有潜力自动化和简化编码的诸多方面,从 bug 检测到代码生成,从而提升生产力并降低人为错误概率。然而,该领域的主要挑战是开源模型(Roziere et al., 2023; Li et al., 2023; Nijkamp et al., 2022; Wang et al., 2021)与闭源……
[原文]The field of software development has been significantly transformed by the swift advancement of large language models (Touvron et al., 2023 ; OpenAI, 2023 ) , which have brought about a new era of code intelligence. These models have the potential to automate and streamline many aspects of coding, from bug detection to code generation, thereby enhancing productivity and reducing the likelihood of human error. However, a major challenge in this field is the performance gap between open-source models (Roziere et al., 2023 ; Li et al., 2023 ; Nijkamp et al., 2022 ; Wang et al., 2021 ) and closed...
[原文]public code-related benchmarks. The findings reveal that among open-source models, DeepSeek-Coder-Base 33B consistently delivers superior performance across all benchmarks. Furthermore, DeepSeek-Coder-Instruct 33B surpasses OpenAI GPT-3.5 Turbo in the majority of the evaluation benchmarks, significantly narrowing the performance gap between OpenAI GPT-4 and open-source models. Remarkably, despite having fewer parameters, DeepSeek-Coder-Base 7B demonstrates competitive performance when compared to models that are five times larger, such as CodeLlama-33B (Roziere et al., 2023 ) . To summarize, o...
[原文]The training dataset of DeepSeek-Coder is composed of 87% source code, 10% English code-related natural language corpus, and 3% code-unrelated Chinese natural language corpus. The English corpus consists of materials from GitHub’s Markdown and StackExchange 1 1 1 https://stackexchange.com , which are used to enhance the model’s understanding of code-related concepts and improve its ability to handle tasks like library usage and bug fixing. Meanwhile, the Chinese corpus consists of high-quality articles aimed at improving the model’s proficiency in understanding the Chinese language. In this se...
[原文]tes at least 20% of the code and is no less than 100 characters. For JSON and YAML files, which typically contain more data, we only keep files that have a character count ranging from 50 to 5000 characters. This effectively removes most data-heavy files. 2.2 Dependency Parsing In previous works (Li et al., 2023 ; Roziere et al., 2023 ; Nijkamp et al., 2022 ; Chen et al., 2021 ) , large language models for code are mainly pre-trained on file-level source code, which ignores the dependencies between different files in a project. However, in practical applications, such models struggle to effect...
2 Data Collection
算法 1:仓库级依赖解析(Repository-level Dependency Parsing) 伪代码片段: 6: inDegree[file] ← 0 7: end for 9: for each fileA in files do 10: for each fileB in files do 11: if HasDependency(fileA, fileB) then ▷ 若 fileA 依赖 fileB
[原文]← 𝑔 𝑟 𝑎 𝑝 ℎ 𝑠 delimited-[] 𝑓 𝑖 𝑙 𝑒 graphs[file]\leftarrow[] 6: i n D e g r e e [ f i l e ] ← 0 ← 𝑖 𝑛 𝐷 𝑒 𝑔 𝑟 𝑒 𝑒 delimited-[] 𝑓 𝑖 𝑙 𝑒 0 inDegree[file]\leftarrow 0 7: end for 8: 9: for each f i l e A 𝑓 𝑖 𝑙 𝑒 𝐴 fileA in f i l e s 𝑓 𝑖 𝑙 𝑒 𝑠 files do 10: for each f i l e B 𝑓 𝑖 𝑙 𝑒 𝐵 fileB in f i l e s 𝑓 𝑖 𝑙 𝑒 𝑠 files do 11: if HasDependency ( f i l e A 𝑓 𝑖 𝑙 𝑒 𝐴 fileA , f i l e B 𝑓 𝑖 𝑙 𝑒 𝐵 fileB ) then ▷ ▷ \triangleright If fileA depends on fileB 12: g r a p h s [ f i l e B ] . append ( f i l e A ) ...
12: graphs[fileB].append(fileA) ▷ 添加从 B 到 A 的边 13: inDegree[fileA] ← inDegree[fileA] + 1 该算法通过拓扑排序确定文件的处理顺序,确保在处理每个文件时其所有依赖已被处理。
2 Data Collection
拓扑排序主循环(续): 23: file ← argmin({inDegree[file] | file ∈ subgraph and file ∉ results}) 24: for each node in graphs[file] do 25: inDegree[node] ← inDegree[node] − 1 26: end for
[原文]r e s u l t s } ) ← 𝑓 𝑖 𝑙 𝑒 argmin conditional-set 𝑖 𝑛 𝐷 𝑒 𝑔 𝑟 𝑒 𝑒 delimited-[] 𝑓 𝑖 𝑙 𝑒 𝑓 𝑖 𝑙 𝑒 𝑠 𝑢 𝑏 𝑔 𝑟 𝑎 𝑝 ℎ and 𝑓 𝑖 𝑙 𝑒 𝑟 𝑒 𝑠 𝑢 𝑙 𝑡 𝑠 file\leftarrow\text{argmin}(\{inDegree[file]\mid file\in subgraph\text{ and }file\notin results\}) 24: for each n o d e 𝑛 𝑜 𝑑 𝑒 node in g r a p h s [ f i l e ] 𝑔 𝑟 𝑎 𝑝 ℎ 𝑠 delimited-[] 𝑓 𝑖 𝑙 𝑒 graphs[file] do 25: i n D e g r e e [ n o d e ] ← i n D e g r e e [ n o d e ] − 1 ← 𝑖 𝑛 𝐷 𝑒 𝑔 𝑟 𝑒 𝑒 delimited-[] 𝑛 𝑜 𝑑 𝑒 𝑖 𝑛 𝐷 𝑒 𝑔 𝑟 𝑒 𝑒 delimited-[] 𝑛 𝑜 𝑑 𝑒 1 inDegree[node]\leftarrow inDe...
[原文]generated for each subgraph. The algorithm concludes by returning a list of these sorted sequences, and each sequence’s files are concatenated to form a single training sample. To incorporate file path information, a comment indicating the file’s path is added at the beginning of each file. This method ensures that the path information is preserved in the training data. 2.3 Repo-Level Deduplication Recent studies have demonstrated the significant performance improvements that can be achieved by deduplicating training datasets for Large Language Models (LLMs). Lee et al. ( 2022 ) have shown tha...
(2.1 GitHub Data Crawling and Filtering的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]We collect public repositories created before February 2023 on GitHub and retain only 87 programming languages, as listed in Table 1 . To reduce the amount of data to be processed, we apply filtering rules similar to those used in the StarCoder project (Li et al., 2023 ) to preliminarily filter out lower-quality code. By applying these filtering rules, we reduce the total amount of data to only 32.8% of its original size. To make the paper self-contained, we briefly describe the filter rules used in the StarCoder Data project: Firstly, we filter out files with an average line length exceeding ...
2.2 Dependency Parsing
在先前工作(Li et al., 2023; Roziere et al., 2023; Nijkamp et al., 2022; Chen et al., 2021)中,代码 LLM 主要在文件级源代码上预训练,忽略了项目中不同文件间的依赖。然而,在实际应用中,此类模型难以有效扩展至整个项目级代码场景。因此,本步骤考虑如何利用同仓库内文件间的依赖。具体而言,我们首先解析文件间依赖,然后按……顺序排列文件。
[原文]In previous works (Li et al., 2023 ; Roziere et al., 2023 ; Nijkamp et al., 2022 ; Chen et al., 2021 ) , large language models for code are mainly pre-trained on file-level source code, which ignores the dependencies between different files in a project. However, in practical applications, such models struggle to effectively scale to handle entire project-level code scenarios. Therefore, we will consider how to leverage the dependencies between files within the same repository in this step. Specifically, we first parse the dependencies between files and then arrange these files in an order tha...
2.2 Dependency Parsing
依赖解析算法(续): 10: for each fileB in files do 11: if HasDependency(fileA, fileB) then ▷ 若 fileA 依赖 fileB
[原文]les do 10: for each f i l e B 𝑓 𝑖 𝑙 𝑒 𝐵 fileB in f i l e s 𝑓 𝑖 𝑙 𝑒 𝑠 files do 11: if HasDependency ( f i l e A 𝑓 𝑖 𝑙 𝑒 𝐴 fileA , f i l e B 𝑓 𝑖 𝑙 𝑒 𝐵 fileB ) then ▷ ▷ \triangleright If fileA depends on fileB 12: g r a p h s [ f i l e B ] . append ( f i l e A ) formulae-sequence 𝑔 𝑟 𝑎 𝑝 ℎ 𝑠 delimited-[] 𝑓 𝑖 𝑙 𝑒 𝐵 append 𝑓 𝑖 𝑙 𝑒 𝐴 graphs[fileB].\text{append}(fileA) ▷ ▷ \triangleright Add edge from B to A 13: i n D e g r e e [ f i l e A ] ← i n D e g r e e [ f i l e A ] + 1 ← 𝑖 𝑛 𝐷 𝑒 𝑔 𝑟 𝑒...
12: graphs[fileB].append(fileA) ▷ 添加从 B 到 A 的边 13: inDegree[fileA] ← inDegree[fileA] + 1 14: end if 15: end for 16: end for 我们通过静态分析识别文件间的 import/include 关系以及函数调用依赖,构建有向无环图(DAG)以支持仓库级代码的有序处理。
2.2 Dependency Parsing
拓扑排序输出阶段: 25: inDegree[node] ← inDegree[node] − 1 26: end for
[原文]𝑜 𝑑 𝑒 node in g r a p h s [ f i l e ] 𝑔 𝑟 𝑎 𝑝 ℎ 𝑠 delimited-[] 𝑓 𝑖 𝑙 𝑒 graphs[file] do 25: i n D e g r e e [ n o d e ] ← i n D e g r e e [ n o d e ] − 1 ← 𝑖 𝑛 𝐷 𝑒 𝑔 𝑟 𝑒 𝑒 delimited-[] 𝑛 𝑜 𝑑 𝑒 𝑖 𝑛 𝐷 𝑒 𝑔 𝑟 𝑒 𝑒 delimited-[] 𝑛 𝑜 𝑑 𝑒 1 inDegree[node]\leftarrow inDegree[node]-1 26: end for 27: r e s u l t s . append ( f i l e ) formulae-sequence 𝑟 𝑒 𝑠 𝑢 𝑙 𝑡 𝑠 append 𝑓 𝑖 𝑙 𝑒 results.\text{append}(file) 28: end while 29: a l l R e s u l t s . append ( r e s u l t s ) formulae-sequence 𝑎 𝑙 ...
[原文]Recent studies have demonstrated the significant performance improvements that can be achieved by deduplicating training datasets for Large Language Models (LLMs). Lee et al. ( 2022 ) have shown that language model training corpora often contain numerous near-duplicates, and the performance of LLMs can be enhanced by removing long repetitive substrings. Kocetkov et al. ( 2022 ) have applied a near-deduplication method to training data, resulting in dramatic improvements, and they emphasize that near-deduplication is a crucial preprocessing step for achieving competitive performance on code ben...
(2.4 Quality Screening and Decontaminatio的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]In addition to applying the filtering rules mentioned in Section 2.1 , we also employ a compiler and a quality model, combined with heuristic rules, to further filter out low-quality data. This includes code with syntax errors, poor readability, and low modularity. We provide the statistical summary of source code in Table 1 , which includes a total of 87 languages, detailing the disk size, number of files, and percentage for each language. The total data volume is 798 GB with 603 million files. To ensure that our code training data is not contaminated by information from the test set, which m...
3 Training Policy
3.1 训练策略 3.1.1 Next Token Prediction 模型的第一个训练目标是 next token prediction。在此过程中,多个文件被拼接形成固定长度 entry,然后用这些 entry 训练模型,使其基于给定 context 预测后续 token。3.1.2 Fill-in-the-Middle 第二个训练目标是 fill-in-the-middle。在代码预训练场景中,常需基于给定 context 和后续文本生成对应的插入内容。由于程序中的 specific depe……
3 Training Policy
确定 FIM 方法中各超参数的有效性,我们进行了一系列消融实验。实验设置:本实验采用 DeepSeek-Coder-Base 1.3B 作为模型架构,聚焦训练数据集的 Python 子集以简化实验流程。主要目标是评估 Fill-in-the-Middle(FIM)技术的有效性,使用 HumanEval-FIM 基准(Fried et al., 2022)。该基准专注于 Python 的单行 FIM 任务,从 HumanEval 解答中随机遮蔽一行代码,测试……
[原文]rmine the effectiveness of various hyperparameters within the FIM approach, we conducted a series of ablation experiments. Experiment Settings: In this experiment, we employ DeepSeek-Coder-Base 1.3B as our model architecture. We focused on a Python subset from our training dataset to streamline the experimental process. Our primary objective was to assess the efficacy of the Fill-in-the-Middle (FIM) technique, utilizing the HumanEval-FIM benchmark (Fried et al., 2022 ) . This benchmark specializes in a single-line FIM task for Python, in which one line of code from a HumanEval solution is rand...
3 Training Policy
(3 Training Policy的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]d three sentinel tokens specifically for this task. For each code file, we initially divide its content into three segments, denoted as f p r e subscript 𝑓 𝑝 𝑟 𝑒 f_{pre} , f m i d d l e subscript 𝑓 𝑚 𝑖 𝑑 𝑑 𝑙 𝑒 f_{middle} , and f s u f subscript 𝑓 𝑠 𝑢 𝑓 f_{suf} . Using the PSM mode, we construct the training example as follows: f p r e f s u f f m i d d l e subscript 𝑓 𝑝 𝑟 𝑒 subscript 𝑓 𝑠 𝑢 𝑓 subscript 𝑓 𝑚 𝑖 𝑑 𝑑 𝑙 𝑒 \displaystyle\texttt{
[原文]r models are summarized in Table 2 . 3.4 Optimization Following DeepSeek LLM (DeepSeek-AI, 2024 ) , we use AdamW (Loshchilov and Hutter, 2019 ) as the optimizer with β 1 subscript 𝛽 1 \beta_{1} and β 2 subscript 𝛽 2 \beta_{2} values of 0.9 and 0.95. We adapt batch sizes and learning rates by the scaling laws suggested in DeepSeek LLM. For the learning rate scheduling, we implement a three-stage policy, which includes 2000 warm-up steps, and set the final learning rate to 10% of the initial rate. Notably, the learning rate at each stage is scaled down to 1 10 1 10 \sqrt{\frac{1}{10}} of the pre...
3 Training Policy
在 A100 和 H800 集群中,我们采用以高吞吐和低延迟著称的 InfiniBand 互连。该设置为计算实验提供了稳健高效的基础设施。3.6 长上下文 为增强 DeepSeek-Coder 处理扩展上下文的能力,尤其是仓库级代码处理场景,我们重新配置 RoPE(Su et al., 2023)参数以扩展默认上下文窗口。遵循先前实践(Chen et al., 2023; kaiokendev, 2023),我们采用线性缩放策略,增加缩放……
[原文]in both A100 and H800 clusters, we employ InfiniBand interconnects, known for their high throughput and low latency. This setup provides a robust and efficient infrastructure for our computational experiments. 3.6 Long Context To enhance the capabilities of DeepSeek-Coder in handling extended contexts, particularly for scenarios like repository-level code processing, we have reconfigured the RoPE (Su et al., 2023 ) parameters to extend the default context window. Following previous practices (Chen et al., 2023 ; kaiokendev, 2023 ) , we employed a linear scaling strategy, increasing the scaling...
[原文]a multi-turn dialogue scenario for building a snake game. Initially, we ask the model to write a game snake using pygame. The model successfully creates a basic snake game that can run without bugs. To improve the game, we further request adding a scoring system in the top left corner. The model then introduces a "score" variable and a "display_score" function, along with an explanation of how to integrate these features. This example illustrates DeepSeek-Coder-Instruct’s ability to provide complete solutions in multi-turn dialogue settings. More cases can be found in the Appendix A . Figure 4...
3.1 Training Strategy
3.1.1 Next Token Prediction 模型的第一个训练目标是 next token prediction。在此过程中,多个文件被拼接形成固定长度 entry,然后用这些 entry 训练模型,使其基于给定 context 预测后续 token。3.1.2 Fill-in-the-Middle 第二个训练目标是 fill-in-the-middle。在代码预训练场景中,常需基于给定 context 和后续文本生成对应的插入内容。由于程序中的 specific dependencies in a programm……
3.1 Training Strategy
FIM 方法中各超参数的有效性,我们进行了一系列消融实验。实验设置:本实验采用 DeepSeek-Coder-Base 1.3B 作为模型架构,聚焦训练数据集的 Python 子集以简化实验流程。主要目标是评估 Fill-in-the-Middle(FIM)技术的有效性,使用 HumanEval-FIM 基准(Fried et al., 2022)。该基准专注于 Python 的单行 FIM 任务,从 HumanEval 解答中随机遮蔽一行代码,测试……
[原文]s of various hyperparameters within the FIM approach, we conducted a series of ablation experiments. Experiment Settings: In this experiment, we employ DeepSeek-Coder-Base 1.3B as our model architecture. We focused on a Python subset from our training dataset to streamline the experimental process. Our primary objective was to assess the efficacy of the Fill-in-the-Middle (FIM) technique, utilizing the HumanEval-FIM benchmark (Fried et al., 2022 ) . This benchmark specializes in a single-line FIM task for Python, in which one line of code from a HumanEval solution is randomly obscured, testing...
3.1 Training Strategy
(3.1 Training Strategy的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]s specifically for this task. For each code file, we initially divide its content into three segments, denoted as f p r e subscript 𝑓 𝑝 𝑟 𝑒 f_{pre} , f m i d d l e subscript 𝑓 𝑚 𝑖 𝑑 𝑑 𝑙 𝑒 f_{middle} , and f s u f subscript 𝑓 𝑠 𝑢 𝑓 f_{suf} . Using the PSM mode, we construct the training example as follows: f p r e f s u f f m i d d l e subscript 𝑓 𝑝 𝑟 𝑒 subscript 𝑓 𝑠 𝑢 𝑓 subscript 𝑓 𝑚 𝑖 𝑑 𝑑 𝑙 𝑒 \displaystyle\texttt{}f_{pre}\textt...
[原文]For the tokenization process, we employ the HuggingFace Tokenizer library 2 2 2 https://github.com/huggingface/tokenizers to train Byte Pair Encoding (BPE) tokenizers, as outlined in Sennrich et al. (2015) (Sennrich et al., 2015 ) , on a subset of our training corpus. Ultimately, we utilize a tokenizer configured with a vocabulary size of 32,000.
[原文]We develop a range of models with varying parameters to cater to diverse applications, including models with 1.3B, 6.7B, and 33B parameters. These models are built upon the same framework as the DeepSeek Large Language Model (LLM) outlined by DeepSeek-AI ( 2024 ) . Each model is a decoder-only Transformer, incorporating Rotary Position Embedding (RoPE) as described by Su et al. ( 2023 ) . Notably, the DeepSeek 33B model integrates Grouped-Query-Attention (GQA) with a group size of 8, enhancing both training and inference efficiency. Additionally, we employ FlashAttention v2 (Dao, 2023 ) to exp...
[原文]Following DeepSeek LLM (DeepSeek-AI, 2024 ) , we use AdamW (Loshchilov and Hutter, 2019 ) as the optimizer with β 1 subscript 𝛽 1 \beta_{1} and β 2 subscript 𝛽 2 \beta_{2} values of 0.9 and 0.95. We adapt batch sizes and learning rates by the scaling laws suggested in DeepSeek LLM. For the learning rate scheduling, we implement a three-stage policy, which includes 2000 warm-up steps, and set the final learning rate to 10% of the initial rate. Notably, the learning rate at each stage is scaled down to 1 10 1 10 \sqrt{\frac{1}{10}} of the preceding stage’s rate, following the guidelines establis...
[原文]Our experiments are conducted using the HAI-LLM (High-Flyer, 2023 ) framework, known for its efficiency and lightweight approach in training large language models. This framework incorporates a variety of parallelism strategies to optimize computational efficiency. These include tensor parallelism (Korthikanti et al., 2023 ) , alongside ZeRO data parallelism (Rajbhandari et al., 2020 ) and PipeDream pipeline parallelism (Narayanan et al., 2019 ) . Our experiments utilize clusters outfitted with NVIDIA A100 and H800 GPUs. In the A100 cluster, each node is configured with 8 GPUs, interconnected ...
[原文]To enhance the capabilities of DeepSeek-Coder in handling extended contexts, particularly for scenarios like repository-level code processing, we have reconfigured the RoPE (Su et al., 2023 ) parameters to extend the default context window. Following previous practices (Chen et al., 2023 ; kaiokendev, 2023 ) , we employed a linear scaling strategy, increasing the scaling factor from 1 1 1 to 4 4 4 and altering the base frequency from 10000 10000 10000 to 100000 100000 100000 . The model underwent an additional 1000 1000 1000 steps of training, using a batch size of 512 512 512 and a sequence l...
[原文]We develop DeepSeek-Coder-Instruct by enhancing the DeepSeek-Coder-Base through instruction-based fine-tuning using high-quality data. This data comprises helpful and impartial human instructions, structured by the Alpaca Instruction format (Taori et al., 2023 ) . To demarcate each dialogue turn, we employed a unique delimiter token to signify the conclusion of each segment. For training, we use a cosine schedule with 100 warm-up steps and an initial learning rate 1e-5. We also use a batch size of 4M tokens and 2B tokens in total. An example of using DeepSeek-Coder-Instruct 34B is depi...
4 Experimental Results
本节我们在四项任务上评估 DeepSeek-Coder:代码生成(§4.1)、FIM 代码补全(§4.2)、跨文件代码补全(§4.3)和基于程序的数学推理(§4.4)。我们将 DeepSeek-Coder 与先前最先进的大语言模型对比:• CodeGeeX2(Zheng et al., 2023)代表多语言代码生成模型 CodeGeeX 的第二代,基于 ChatGLM2(Du et al., 2022)架构开发,并用大量编码示例数据集增强。• StarCoder(Li et al., 2023)是公开可访问的模型 wi……
[原文]In this section, we evaluate DeepSeek-Coder on four tasks, including code generation (§ 4.1 ), FIM code completion (§ 4.2 ), cross-file code completion (§ 4.3 ) and program-based math reasoning (§ 4.4 ). We compare DeepSeek-Coder with the previous state-of-the-art large language models: • CodeGeeX2 (Zheng et al., 2023 ) represents the second generation of the multilingual code generation model CodeGeeX. It is developed using the ChatGLM2 (Du et al., 2022 ) architecture and is enhanced with an extensive dataset of coding examples. • StarCoder (Li et al., 2023 ) is a publicly accessible model wi...
[原文]hile the MBPP benchmark includes 500 problems in a few-shot setting. To evaluate the model’s multilingual capabilities, we expanded the Python problems of Humaneval Benchmark to seven additional commonly used programming languages, namely C++, Java, PHP, TypeScript (TS), C#, Bash, and JavaScript (JS) (Cassano et al., 2023 ) . For both benchmarks, We adopted a greedy search approach and re-implemented the baseline results using the same script and environment for fair comparison. Model Size Python C++ Java PHP TS C# Bash JS Avg MBPP Multilingual Base Models code-cushman-001 12B 33.5% 31.9% 30.6...
[原文]rce model CodeLlama-Base 34B, our model has demonstrated a notable improvement of 9% and 11% in accuracy, respectively. It’s worth noting that even our smaller model, DeepSeek-Coder-Base 6.7B, surpasses the performance of CodeLlama-Base 34B. After instruction fine-tuning, our model surpasses the closed-source GPT-3.5-Turbo model in HumanEval benchmark, significantly reducing the performance gap between OpenAI GPT-4 and open-source models. DS-1000 Benchmark HumanEval and MBPP have a significant drawback in that they rely heavily on straightforward programming tasks that may not accurately repre...
[原文]n future studies using our released LeetCode data. 4.2 Fill-in-the-Middle Code Completion DeepSeek-Coder models are trained with a 0.5 FIM (Fill-In-the-Middle) rate during their pretraining phase. This specialized training strategy empowers the model to proficiently generate code by filling in blanks based on the surrounding context, both prefix and suffix, of the given code snippet. This capability is particularly advantageous in the realm of code completion tools. Several open-source models have emerged with similar capabilities. Notable among these are SantaCoder (Allal et al., 2023 ) , Sta...
[原文]ere is a corresponding and responsible enhancement in performance. This trend underscores the importance of model capacity in achieving higher accuracy in code completion tasks. Based on these findings, we recommend the deployment of the DeepSeek-Coder-Base 6.7B model in code completion tools. This recommendation is grounded in the model’s demonstrated balance between efficiency and accuracy. The DeepSeek-Coder-Base 6.7B model, with its substantial parameter size, has proven to be highly effective in the context of code completion, making it an ideal choice for integrating advanced computation...
[原文]nd MAWPS (Gou et al., 2023 ) . In each of these benchmarks, the model is prompted to alternately describe a solution step in natural language and then execute that step with code. As seen in Table 8 , DeepSeek-Coder models achieve a remarkable performance across all benchmarks, especially the 33B variant, which demonstrates the potential of using such models in applications that require complex mathematical computations and problem-solving abilities. Model Size GSM8k MATH GSM-Hard SVAMP TabMWP ASDiv MAWPS Avg Multilingual Base Models CodeGeex-2 7B 22.2% 9.7% 23.6% 39.0% 44.6% 48.5% 66.0% 36.2%...
[原文]HumanEval and MBPP Benchmarks The HumanEval (Chen et al., 2021 ) and MBPP (Austin et al., 2021 ) benchmarks are widely used for evaluating code LLMs. HumanEval consists of 164 hand-written Python problems that are validated using test cases to assess the code generated by a Code LLM in a zero-shot setting, while the MBPP benchmark includes 500 problems in a few-shot setting. To evaluate the model’s multilingual capabilities, we expanded the Python problems of Humaneval Benchmark to seven additional commonly used programming languages, namely C++, Java, PHP, TypeScript (TS), C#, Bash, and JavaS...
[原文]possibility of data contamination cannot be entirely ruled out. We observed that the GPT-4-Turbo and DeepSeek-Coder models achieved higher scores in the LeetCode Contest held in July and August. We encourage the research community to consider the potential issue of data contamination when evaluating models in future studies using our released LeetCode data.
[原文]DeepSeek-Coder models are trained with a 0.5 FIM (Fill-In-the-Middle) rate during their pretraining phase. This specialized training strategy empowers the model to proficiently generate code by filling in blanks based on the surrounding context, both prefix and suffix, of the given code snippet. This capability is particularly advantageous in the realm of code completion tools. Several open-source models have emerged with similar capabilities. Notable among these are SantaCoder (Allal et al., 2023 ) , StarCoder (Li et al., 2023 ) , and CodeLlama (Roziere et al., 2023 ) . These models have set ...
[原文]the importance of model capacity in achieving higher accuracy in code completion tasks. Based on these findings, we recommend the deployment of the DeepSeek-Coder-Base 6.7B model in code completion tools. This recommendation is grounded in the model’s demonstrated balance between efficiency and accuracy. The DeepSeek-Coder-Base 6.7B model, with its substantial parameter size, has proven to be highly effective in the context of code completion, making it an ideal choice for integrating advanced computational capabilities into coding environments.
[原文]In this section, we will evaluate the performance of existing open-source models in cross-file code completion tasks. Unlike code generation discussed in the previous section, cross-file code completion requires the model to access and understand repositories that span multiple files with numerous cross-file dependencies. We use CrossCodeEval (Ding et al., 2023 ) to evaluate the capabilities of currently available open-source code models of 7B scale in cross-file completion tasks. This dataset is constructed on a diverse set of real-world, open-sourced, permissively licensed repositories in fo...
[原文]r the cross-file context, we utilize the official BM25 search results provided by Ding et al. ( 2023 ) . Evaluation metrics include exact match and edit similarity. The results, presented in Table 7 , demonstrate that DeepSeek-Coder consistently outperforms other models in cross-file completion tasks across multiple languages, showcasing its superior practical application capabilities. When only utilizing file-level code corpus ( w/o Repo Pre-training ) to pre-train DeepSeek-Coder, we observe a decrease in performance in the Java, TypeScript, and C# languages, indicating the effectiveness of t...
4.4 Program-based Math Reasoning
(4.4 Program-based Math Reasoning的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]Program-based math reasoning involves evaluating a model’s ability to understand and solve mathematical problems through programming. This type of reasoning is critical in fields such as data analysis and scientific computing. To conduct this assessment, we utilize the Program-Aided Math Reasoning (PAL) method as outlined in Gao et al. ( 2023 ) . This approach is applied across seven distinct benchmarks, each offering unique challenges and contexts. These benchmarks includes GSM8K (Cobbe et al., 2021 ) , MATH (Hendrycks et al., 2021 ) , GSM-Hard (Gao et al., 2023 ) , SVAMP (Patel et al., 2021 ...
5 Continue Pre-Training From General LLM
(5 Continue Pre-Training From General LLM的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]To further enhance the natural language understanding and mathematical reasoning abilities of the DeepSeek-Coder model, we perform additional pre-training from the general language model DeepSeek-LLM-7B Base (DeepSeek-AI, 2024 ) on 2 trillion tokens, resulting in DeepSeek-Coder-v1.5 7B. For this pre-training, we specifically use the data sources listed in Table 9 . Unlike DeepSeek-Coder, DeepSeek-Coder-v1.5 employs solely a next token prediction objective with a 4K context length during its pre-training phase. Data Source Percentage Source Code 70% Markdown and StackExchange 10% Natural langua...
5 Continue Pre-Training From General LLM
(5 Continue Pre-Training From General LLM的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]model. In particular, in the Math Reasoning and Natural Language categories, DeepSeek-Coder-Base-v1.5 significantly outperforms its predecessor across all benchmarks, which also demonstrates significant improvements in its mathematical reasoning and natural language processing capabilities. Programming Math Reasoning Natural Language Models Size HumanEval MBPP GSM8K MATH MMLU BBH HellaSwag WinoG ARC-C DeepSeek-Coder-Base 6.7B 44.7% 60.6% 43.2% 19.2% 36.6% 44.3% 53.8% 57.1% 32.5% DeepSeek-Coder-Base-v1.5 6.9B 43.2% 60.4% 62.4% 24.7% 49.1% 55.2% 69.9% 63.8% 47.2% DeepSeek-Coder-Instruct 6.7B 66....
[原文]In this technical report, we introduce a series of specialized Large Language Models (LLMs) for coding, named DeepSeek-Coder, available in three distinct scales: 1.3B, 6.7B, and 33B parameters. These models are uniquely trained on a meticulously curated project-level code corpus, utilizing a "fill-in-the-blank" pre-training objective to enhance code infilling capabilities. A significant advancement is the extension of the models’ context window to 16,384 tokens, thereby greatly improving their effectiveness in handling extensive code generation tasks. Our evaluations reveal that the most advan...
[原文]st general LLMs. The reason is evident: to effectively interpret and execute coding tasks, these models must also possess a deep understanding of human instructions, which often come in various forms of natural language. Looking ahead, our commitment is to develop and openly share even more powerful code-focused LLMs based on larger-scale general LLMs.
Appendix A Cases of Chatting with DeepSeek-Coder-Instruct
(Appendix A Cases of Chatting with DeepSe的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]We will present two cases of interactions with DeepSeek-Coder-Instruct, with one involving a multi-turn conversation about creating a database and performing data analysis, and the other centered around using a model to solve a sample problem from LeetCode. In the first scenario, depicted in Figure 5 , we instruct the model to build a student database using Python and randomly insert 10 pieces of information. Subsequently, in the second round of the conversation, we continue to ask the model by analyzing the age distribution of the students. From Figure 5 , it’s evident that the model can gene...
Appendix B Benchmark curves during training of DeepSeek-Coder-Base
(Appendix B Benchmark curves during train的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
[原文]In Figure 7 , we present the benchmark curves illustrating the performance of DeepSeek-Coder-Base models during their training phase. For validation, a carefully curated subset of the training corpus was employed, consisting of 8,000 code files. This subset was deliberately chosen to ensure a diverse and representative sample, critical for an accurate assessment of the models’ capabilities. The performance metrics of these models are specifically detailed in the final two sub-figures of Figure 7 , offering a clear visual representation of their efficacy throughout the training process. Figure ...