← 首页 | 导读 | 详细解读

DeepSeek-Coder: Let the Code Write Itself

DeepSeek-Coder:让代码自我编写

📄 arXiv: 2401.14196📅 2024-01-25PDF
翻译进度62 / 62 段 (100%)

中文摘要

DeepSeek-Coder 代码模型在 HumanEval 和 MBPP 等编程基准上超越同期开源模型。该模型采用大规模代码数据预训练,支持代码补全、代码生成、代码翻译、代码解释等多种编程任务。在多项代码智能 benchmark 上达到当时开源模型的最佳水平。

DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence

作者:Daya Guo, Qihao Zhu, Dejian Yang等 机构:1 DeepSeek-AI, 2 北京大学 摘要 我们介绍DeepSeek-Coder系列,一个开源代码模型,包含从1.3B到33B不同规模的模型。
原文: Daya Guo* 1 Qihao Zhu ∗1,2 Dejian Yang 1 Zhenda Xie 1 Kai Dong 1 Wentao Zhang 1 Guanting Chen 1 Xiao Bi 1 Y. Wu 1 Y.K. Li 1 Fuli Luo 1 Yingfei Xiong 2 Wenfeng Liang 1 1 DeepSeek-AI 2 Key Lab of HCST (PKU) MOE; SCS Peking University {zhuqh, guodaya}@deepseek.com https://github.com/deepseek-ai/DeepSeek-Coder Abstract The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce the DeepSeek-Coder series, a range of open-...

DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence

由于专属性,许多研究人员和开发者无法访问闭源代码模型。为此,我们提出DeepSeek-Coder系列,包含1.3B到33B不同规模的开源代码模型。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: ssible to many researchers and developers due to their proprietary nature. In response to this challenge, we present the DeepSeek-Coder series. This series comprises a range of open-source code models, varying in size from 1.3B to 33B, including the base version and instructed version for each size. Each model in the series has been trained from scratch on 2 trillion tokens sourced from 87 programming languages, ensuring a comprehensive understanding of coding languages and syntax. Besides, we attempt to organize the pre-training data at the repository level to enhance the pre-trained model’s ...

DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence

通过大规模代码语料库训练,这些模型展示了理解87种编程语言的能力。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: large language models (LLMs). Developed through extensive training on an expansive code corpus, these models exhibit proficiency in understanding 87 programming languages. Additionally, they are available in various model scales to cater to a wide range of computational and application needs. • We make the first attempt to incorporate repository-level data construction during the pre-training phase of our models. We find that it can significantly boost the capability of cross-file code generation. • Our analysis rigorously examines the impact of FIM training strategies on the pretraining phase...

DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence

我们实施了严格的过滤、依赖解析、仓库级去重和质量筛选流程。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: ing, rule-based filtering, dependency parsing, repository-level deduplication, and quality screening, as illustrated in Figure 2 . In the following, we will describe the data creation procedure step by step. Figure 2: The Procedure of Dataset Creation 2.1 GitHub Data Crawling and Filtering We collect public repositories created before February 2023 on GitHub and retain only 87 programming languages, as listed in Table 1 . To reduce the amount of data to be processed, we apply filtering rules similar to those used in the StarCoder project (Li et al., 2023 ) to preliminarily filter out lower-qua...

DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence

我们解析文件间依赖关系,按顺序排列文件以确保上下文依赖。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: ndencies between files within the same repository in this step. Specifically, we first parse the dependencies between files and then arrange these files in an order that ensures the context each file relies on is placed before that file in the input sequence. By aligning the files in accordance with their dependencies, our dataset more accurately represents real coding practices and structures. This enhanced alignment not only makes our dataset more relevant but also potentially increases the practicality and applicability of the model in handling project-level code scenarios. It’s worth notin...

1 Introduction

1 引言 软件开发领域因大型语言模型的快速进展而显著转变,带来了代码智能化的新时代。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: The field of software development has been significantly transformed by the swift advancement of large language models (Touvron et al., 2023 ; OpenAI, 2023 ) , which have brought about a new era of code intelligence. These models have the potential to automate and streamline many aspects of coding, from bug detection to code generation, thereby enhancing productivity and reducing the likelihood of human error. However, a major challenge in this field is the performance gap between open-source models (Roziere et al., 2023 ; Li et al., 2023 ; Nijkamp et al., 2022 ; Wang et al., 2021 ) and closed...

1 Introduction

DeepSeek-Coder-Base 33B在所有开源模型中持续提供卓越性能。DeepSeek-Coder-Instruct 33B超越了OpenAI GPT-3.5 Turbo。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: public code-related benchmarks. The findings reveal that among open-source models, DeepSeek-Coder-Base 33B consistently delivers superior performance across all benchmarks. Furthermore, DeepSeek-Coder-Instruct 33B surpasses OpenAI GPT-3.5 Turbo in the majority of the evaluation benchmarks, significantly narrowing the performance gap between OpenAI GPT-4 and open-source models. Remarkably, despite having fewer parameters, DeepSeek-Coder-Base 7B demonstrates competitive performance when compared to models that are five times larger, such as CodeLlama-33B (Roziere et al., 2023 ) . To summarize, o...

2 Data Collection

2 数据收集 DeepSeek-Coder训练数据集由87%源代码、10%英文代码相关自然语言语料和3%代码无关中文自然语言语料组成。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: The training dataset of DeepSeek-Coder is composed of 87% source code, 10% English code-related natural language corpus, and 3% code-unrelated Chinese natural language corpus. The English corpus consists of materials from GitHub’s Markdown and StackExchange 1 1 1 https://stackexchange.com , which are used to enhance the model’s understanding of code-related concepts and improve its ability to handle tasks like library usage and bug fixing. Meanwhile, the Chinese corpus consists of high-quality articles aimed at improving the model’s proficiency in understanding the Chinese language. In this se...

2 Data Collection

对于JSON和YAML文件,我们仅保留字符数在50到5000之间的文件,有效移除了大多数数据密集型文件。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: tes at least 20% of the code and is no less than 100 characters. For JSON and YAML files, which typically contain more data, we only keep files that have a character count ranging from 50 to 5000 characters. This effectively removes most data-heavy files. 2.2 Dependency Parsing In previous works (Li et al., 2023 ; Roziere et al., 2023 ; Nijkamp et al., 2022 ; Chen et al., 2021 ) , large language models for code are mainly pre-trained on file-level source code, which ignores the dependencies between different files in a project. However, in practical applications, such models struggle to effect...

2 Data Collection

[2 Data Collection] 本章节为原文内容,详细翻译请参考英文原文。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: ← 𝑔 𝑟 𝑎 𝑝 ℎ 𝑠 delimited-[] 𝑓 𝑖 𝑙 𝑒 graphs[file]\leftarrow[] 6: i ​ n ​ D ​ e ​ g ​ r ​ e ​ e ​ [ f ​ i ​ l ​ e ] ← 0 ← 𝑖 𝑛 𝐷 𝑒 𝑔 𝑟 𝑒 𝑒 delimited-[] 𝑓 𝑖 𝑙 𝑒 0 inDegree[file]\leftarrow 0 7: end for 8: 9: for each f ​ i ​ l ​ e ​ A 𝑓 𝑖 𝑙 𝑒 𝐴 fileA in f ​ i ​ l ​ e ​ s 𝑓 𝑖 𝑙 𝑒 𝑠 files do 10: for each f ​ i ​ l ​ e ​ B 𝑓 𝑖 𝑙 𝑒 𝐵 fileB in f ​ i ​ l ​ e ​ s 𝑓 𝑖 𝑙 𝑒 𝑠 files do 11: if HasDependency ( f ​ i ​ l ​ e ​ A 𝑓 𝑖 𝑙 𝑒 𝐴 fileA , f ​ i ​ l ​ e ​ B 𝑓 𝑖 𝑙 𝑒 𝐵 fileB ) then ▷ ▷ \triangleright If fileA depends on fileB 12: g ​ r ​ a ​ p ​ h ​ s ​ [ f ​ i ​ l ​ e ​ B ] . append ​ ( f ​ i ​ l ​ e ​ A ) ...

2 Data Collection

[2 Data Collection] 本章节为原文内容,详细翻译请参考英文原文。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: r ​ e ​ s ​ u ​ l ​ t ​ s } ) ← 𝑓 𝑖 𝑙 𝑒 argmin conditional-set 𝑖 𝑛 𝐷 𝑒 𝑔 𝑟 𝑒 𝑒 delimited-[] 𝑓 𝑖 𝑙 𝑒 𝑓 𝑖 𝑙 𝑒 𝑠 𝑢 𝑏 𝑔 𝑟 𝑎 𝑝 ℎ and 𝑓 𝑖 𝑙 𝑒 𝑟 𝑒 𝑠 𝑢 𝑙 𝑡 𝑠 file\leftarrow\text{argmin}(\{inDegree[file]\mid file\in subgraph\text{ and }file\notin results\}) 24: for each n ​ o ​ d ​ e 𝑛 𝑜 𝑑 𝑒 node in g ​ r ​ a ​ p ​ h ​ s ​ [ f ​ i ​ l ​ e ] 𝑔 𝑟 𝑎 𝑝 ℎ 𝑠 delimited-[] 𝑓 𝑖 𝑙 𝑒 graphs[file] do 25: i ​ n ​ D ​ e ​ g ​ r ​ e ​ e ​ [ n ​ o ​ d ​ e ] ← i ​ n ​ D ​ e ​ g ​ r ​ e ​ e ​ [ n ​ o ​ d ​ e ] − 1 ← 𝑖 𝑛 𝐷 𝑒 𝑔 𝑟 𝑒 𝑒 delimited-[] 𝑛 𝑜 𝑑 𝑒 𝑖 𝑛 𝐷 𝑒 𝑔 𝑟 𝑒 𝑒 delimited-[] 𝑛 𝑜 𝑑 𝑒 1 inDegree[node]\leftarrow inDe...

2 Data Collection

算法返回排序后的文件序列列表,每个序列的文件被连接形成单个训练样本。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: generated for each subgraph. The algorithm concludes by returning a list of these sorted sequences, and each sequence’s files are concatenated to form a single training sample. To incorporate file path information, a comment indicating the file’s path is added at the beginning of each file. This method ensures that the path information is preserved in the training data. 2.3 Repo-Level Deduplication Recent studies have demonstrated the significant performance improvements that can be achieved by deduplicating training datasets for Large Language Models (LLMs). Lee et al. ( 2022 ) have shown tha...

2 Data Collection

(2 Data Collection的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 81 148 0.10 Bluespec 0.10 15 0.01 PHP 58.92 40,627 7.38 C 28.64 27,111 3.59 PowerShell 0.91 236 0.11 C# 58.56 53,739 7.34 Prolog 0.03 5 0.00 Clojure 0.90 295 0.11 Protocol Buffer 0.92 391 0.12 CMake 0.90 359 0.11 Python 120.68 75,188 15.12 CoffeeScript 0.92 361 0.12 R 0.92 158 0.11 Common Lisp 0.92 105 0.11 Racket 0.09 13 0.01 C++ 90.87 36,006 11.39 RMarkdown 6.83 1,606 0.86 CSS 5.63 11,638 0.71 Ruby 15.01 18,526 1.88 CUDA 0.91 115 0.11 Rust 0.61 692 0.08 Dart 0.89 264 0.11 SAS 0.92 70 0.11 Dockerfile 0.04 48 0.00 Scala 0.81 971 0.10 Elixir 0.91 549 0.11 Scheme 0.92 216 0.12 Elm 0.92 232 0.12 ...

2 Data Collection

(2 Data Collection的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: f source code in Table 1 , which includes a total of 87 languages, detailing the disk size, number of files, and percentage for each language. The total data volume is 798 GB with 603 million files. To ensure that our code training data is not contaminated by information from the test set, which may be present on GitHub, we’ve implemented an n-gram filtering process. This process involves the removal of any code segments that match specific criteria. Specifically, we filter out files containing docstrings, questions, and solutions from sources such as HumanEval (Chen et al., 2021 ) , MBPP (Aus...

2.1 GitHub Data Crawling and Filtering

(2.1 GitHub Data Crawling and Filtering的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: We collect public repositories created before February 2023 on GitHub and retain only 87 programming languages, as listed in Table 1 . To reduce the amount of data to be processed, we apply filtering rules similar to those used in the StarCoder project (Li et al., 2023 ) to preliminarily filter out lower-quality code. By applying these filtering rules, we reduce the total amount of data to only 32.8% of its original size. To make the paper self-contained, we briefly describe the filter rules used in the StarCoder Data project: Firstly, we filter out files with an average line length exceeding ...

2.2 Dependency Parsing

(2.2 Dependency Parsing的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: In previous works (Li et al., 2023 ; Roziere et al., 2023 ; Nijkamp et al., 2022 ; Chen et al., 2021 ) , large language models for code are mainly pre-trained on file-level source code, which ignores the dependencies between different files in a project. However, in practical applications, such models struggle to effectively scale to handle entire project-level code scenarios. Therefore, we will consider how to leverage the dependencies between files within the same repository in this step. Specifically, we first parse the dependencies between files and then arrange these files in an order tha...

2.2 Dependency Parsing

(2.2 Dependency Parsing的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: les do 10: for each f ​ i ​ l ​ e ​ B 𝑓 𝑖 𝑙 𝑒 𝐵 fileB in f ​ i ​ l ​ e ​ s 𝑓 𝑖 𝑙 𝑒 𝑠 files do 11: if HasDependency ( f ​ i ​ l ​ e ​ A 𝑓 𝑖 𝑙 𝑒 𝐴 fileA , f ​ i ​ l ​ e ​ B 𝑓 𝑖 𝑙 𝑒 𝐵 fileB ) then ▷ ▷ \triangleright If fileA depends on fileB 12: g ​ r ​ a ​ p ​ h ​ s ​ [ f ​ i ​ l ​ e ​ B ] . append ​ ( f ​ i ​ l ​ e ​ A ) formulae-sequence 𝑔 𝑟 𝑎 𝑝 ℎ 𝑠 delimited-[] 𝑓 𝑖 𝑙 𝑒 𝐵 append 𝑓 𝑖 𝑙 𝑒 𝐴 graphs[fileB].\text{append}(fileA) ▷ ▷ \triangleright Add edge from B to A 13: i ​ n ​ D ​ e ​ g ​ r ​ e ​ e ​ [ f ​ i ​ l ​ e ​ A ] ← i ​ n ​ D ​ e ​ g ​ r ​ e ​ e ​ [ f ​ i ​ l ​ e ​ A ] + 1 ← 𝑖 𝑛 𝐷 𝑒 𝑔 𝑟 𝑒...

2.2 Dependency Parsing

(2.2 Dependency Parsing的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 𝑜 𝑑 𝑒 node in g ​ r ​ a ​ p ​ h ​ s ​ [ f ​ i ​ l ​ e ] 𝑔 𝑟 𝑎 𝑝 ℎ 𝑠 delimited-[] 𝑓 𝑖 𝑙 𝑒 graphs[file] do 25: i ​ n ​ D ​ e ​ g ​ r ​ e ​ e ​ [ n ​ o ​ d ​ e ] ← i ​ n ​ D ​ e ​ g ​ r ​ e ​ e ​ [ n ​ o ​ d ​ e ] − 1 ← 𝑖 𝑛 𝐷 𝑒 𝑔 𝑟 𝑒 𝑒 delimited-[] 𝑛 𝑜 𝑑 𝑒 𝑖 𝑛 𝐷 𝑒 𝑔 𝑟 𝑒 𝑒 delimited-[] 𝑛 𝑜 𝑑 𝑒 1 inDegree[node]\leftarrow inDegree[node]-1 26: end for 27: r ​ e ​ s ​ u ​ l ​ t ​ s . append ​ ( f ​ i ​ l ​ e ) formulae-sequence 𝑟 𝑒 𝑠 𝑢 𝑙 𝑡 𝑠 append 𝑓 𝑖 𝑙 𝑒 results.\text{append}(file) 28: end while 29: a ​ l ​ l ​ R ​ e ​ s ​ u ​ l ​ t ​ s . append ​ ( r ​ e ​ s ​ u ​ l ​ t ​ s ) formulae-sequence 𝑎 𝑙 ...

2.2 Dependency Parsing

(2.2 Dependency Parsing的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: f each file. This method ensures that the path information is preserved in the training data.

2.3 Repo-Level Deduplication

(2.3 Repo-Level Deduplication的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Recent studies have demonstrated the significant performance improvements that can be achieved by deduplicating training datasets for Large Language Models (LLMs). Lee et al. ( 2022 ) have shown that language model training corpora often contain numerous near-duplicates, and the performance of LLMs can be enhanced by removing long repetitive substrings. Kocetkov et al. ( 2022 ) have applied a near-deduplication method to training data, resulting in dramatic improvements, and they emphasize that near-deduplication is a crucial preprocessing step for achieving competitive performance on code ben...

2.3 Repo-Level Deduplication

(2.3 Repo-Level Deduplication的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: y 15.01 18,526 1.88 CUDA 0.91 115 0.11 Rust 0.61 692 0.08 Dart 0.89 264 0.11 SAS 0.92 70 0.11 Dockerfile 0.04 48 0.00 Scala 0.81 971 0.10 Elixir 0.91 549 0.11 Scheme 0.92 216 0.12 Elm 0.92 232 0.12 Shell 13.92 10,890 1.74 Emacs Lisp 0.91 148 0.11 Smalltalk 0.92 880 0.12 Erlang 0.92 145 0.12 Solidity 0.85 83 0.11 F# 0.91 340 0.11 Sparql 0.10 88 0.01 Fortran 1.67 654 0.21 SQL 15.14 7,009 1.90 GLSL 0.92 296 0.11 Stan 0.20 41 0.03 Go 2.58 1,365 0.32 Standard ML 0.74 117 0.09 Groovy 0.89 340 0.11 Stata 0.91 122 0.11 Haskell 0.87 213 0.11 SystemVerilog 0.91 165 0.11 HTML 30.05 14,998 3.77 TCL 0.90 1...

2.4 Quality Screening and Decontamination

(2.4 Quality Screening and Decontaminatio的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: In addition to applying the filtering rules mentioned in Section 2.1 , we also employ a compiler and a quality model, combined with heuristic rules, to further filter out low-quality data. This includes code with syntax errors, poor readability, and low modularity. We provide the statistical summary of source code in Table 1 , which includes a total of 87 languages, detailing the disk size, number of files, and percentage for each language. The total data volume is 798 GB with 603 million files. To ensure that our code training data is not contaminated by information from the test set, which m...

3 Training Policy

(3 Training Policy的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 3.1 Training Strategy 3.1.1 Next Token Prediction The first training objective for our model is known as next token prediction . In this process, various files are concatenated to form a fixed-length entry. Then, these entries are used to train the model, enabling it to predict the subsequent token based on the provided context. 3.1.2 Fill-in-the-Middle The second training objective for our model is known as fill-in-the-middle. In the code pre-training scenario, it is often necessary to generate corresponding inserted content based on the given context and subsequent text. Due to specific depe...

3 Training Policy

(3 Training Policy的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: rmine the effectiveness of various hyperparameters within the FIM approach, we conducted a series of ablation experiments. Experiment Settings: In this experiment, we employ DeepSeek-Coder-Base 1.3B as our model architecture. We focused on a Python subset from our training dataset to streamline the experimental process. Our primary objective was to assess the efficacy of the Fill-in-the-Middle (FIM) technique, utilizing the HumanEval-FIM benchmark (Fried et al., 2022 ) . This benchmark specializes in a single-line FIM task for Python, in which one line of code from a HumanEval solution is rand...

3 Training Policy

(3 Training Policy的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: d three sentinel tokens specifically for this task. For each code file, we initially divide its content into three segments, denoted as f p ​ r ​ e subscript 𝑓 𝑝 𝑟 𝑒 f_{pre} , f m ​ i ​ d ​ d ​ l ​ e subscript 𝑓 𝑚 𝑖 𝑑 𝑑 𝑙 𝑒 f_{middle} , and f s ​ u ​ f subscript 𝑓 𝑠 𝑢 𝑓 f_{suf} . Using the PSM mode, we construct the training example as follows: <|fim_start|> ​ f p ​ r ​ e ​ <|fim_hole|> ​ f s ​ u ​ f ​ <|fim_end|> ​ f m ​ i ​ d ​ d ​ l ​ e ​ <|eos_token|> <|fim_start|> subscript 𝑓 𝑝 𝑟 𝑒 <|fim_hole|> subscript 𝑓 𝑠 𝑢 𝑓 <|fim_end|> subscript 𝑓 𝑚 𝑖 𝑑 𝑑 𝑙 𝑒 <|eos_token|> \displaystyle\texttt{<|fim\...

3 Training Policy

(3 Training Policy的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: r models are summarized in Table 2 . 3.4 Optimization Following DeepSeek LLM (DeepSeek-AI, 2024 ) , we use AdamW (Loshchilov and Hutter, 2019 ) as the optimizer with β 1 subscript 𝛽 1 \beta_{1} and β 2 subscript 𝛽 2 \beta_{2} values of 0.9 and 0.95. We adapt batch sizes and learning rates by the scaling laws suggested in DeepSeek LLM. For the learning rate scheduling, we implement a three-stage policy, which includes 2000 warm-up steps, and set the final learning rate to 10% of the initial rate. Notably, the learning rate at each stage is scaled down to 1 10 1 10 \sqrt{\frac{1}{10}} of the pre...

3 Training Policy

(3 Training Policy的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: in both A100 and H800 clusters, we employ InfiniBand interconnects, known for their high throughput and low latency. This setup provides a robust and efficient infrastructure for our computational experiments. 3.6 Long Context To enhance the capabilities of DeepSeek-Coder in handling extended contexts, particularly for scenarios like repository-level code processing, we have reconfigured the RoPE (Su et al., 2023 ) parameters to extend the default context window. Following previous practices (Chen et al., 2023 ; kaiokendev, 2023 ) , we employed a linear scaling strategy, increasing the scaling...

3 Training Policy

(3 Training Policy的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: a multi-turn dialogue scenario for building a snake game. Initially, we ask the model to write a game snake using pygame. The model successfully creates a basic snake game that can run without bugs. To improve the game, we further request adding a scoring system in the top left corner. The model then introduces a "score" variable and a "display_score" function, along with an explanation of how to integrate these features. This example illustrates DeepSeek-Coder-Instruct’s ability to provide complete solutions in multi-turn dialogue settings. More cases can be found in the Appendix A . Figure 4...

3.1 Training Strategy

(3.1 Training Strategy的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 3.1.1 Next Token Prediction The first training objective for our model is known as next token prediction . In this process, various files are concatenated to form a fixed-length entry. Then, these entries are used to train the model, enabling it to predict the subsequent token based on the provided context. 3.1.2 Fill-in-the-Middle The second training objective for our model is known as fill-in-the-middle. In the code pre-training scenario, it is often necessary to generate corresponding inserted content based on the given context and subsequent text. Due to specific dependencies in a programm...

3.1 Training Strategy

(3.1 Training Strategy的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: s of various hyperparameters within the FIM approach, we conducted a series of ablation experiments. Experiment Settings: In this experiment, we employ DeepSeek-Coder-Base 1.3B as our model architecture. We focused on a Python subset from our training dataset to streamline the experimental process. Our primary objective was to assess the efficacy of the Fill-in-the-Middle (FIM) technique, utilizing the HumanEval-FIM benchmark (Fried et al., 2022 ) . This benchmark specializes in a single-line FIM task for Python, in which one line of code from a HumanEval solution is randomly obscured, testing...

3.1 Training Strategy

(3.1 Training Strategy的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: s specifically for this task. For each code file, we initially divide its content into three segments, denoted as f p ​ r ​ e subscript 𝑓 𝑝 𝑟 𝑒 f_{pre} , f m ​ i ​ d ​ d ​ l ​ e subscript 𝑓 𝑚 𝑖 𝑑 𝑑 𝑙 𝑒 f_{middle} , and f s ​ u ​ f subscript 𝑓 𝑠 𝑢 𝑓 f_{suf} . Using the PSM mode, we construct the training example as follows: <|fim_start|> ​ f p ​ r ​ e ​ <|fim_hole|> ​ f s ​ u ​ f ​ <|fim_end|> ​ f m ​ i ​ d ​ d ​ l ​ e ​ <|eos_token|> <|fim_start|> subscript 𝑓 𝑝 𝑟 𝑒 <|fim_hole|> subscript 𝑓 𝑠 𝑢 𝑓 <|fim_end|> subscript 𝑓 𝑚 𝑖 𝑑 𝑑 𝑙 𝑒 <|eos_token|> \displaystyle\texttt{<|fim\_start|>}f_{pre}\textt...

3.2 Tokenizer

(3.2 Tokenizer的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: For the tokenization process, we employ the HuggingFace Tokenizer library 2 2 2 https://github.com/huggingface/tokenizers to train Byte Pair Encoding (BPE) tokenizers, as outlined in Sennrich et al. (2015) (Sennrich et al., 2015 ) , on a subset of our training corpus. Ultimately, we utilize a tokenizer configured with a vocabulary size of 32,000.

3.3 Model Architecture

(3.3 Model Architecture的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: We develop a range of models with varying parameters to cater to diverse applications, including models with 1.3B, 6.7B, and 33B parameters. These models are built upon the same framework as the DeepSeek Large Language Model (LLM) outlined by DeepSeek-AI ( 2024 ) . Each model is a decoder-only Transformer, incorporating Rotary Position Embedding (RoPE) as described by Su et al. ( 2023 ) . Notably, the DeepSeek 33B model integrates Grouped-Query-Attention (GQA) with a group size of 8, enhancing both training and inference efficiency. Additionally, we employ FlashAttention v2 (Dao, 2023 ) to exp...

3.4 Optimization

(3.4 Optimization的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Following DeepSeek LLM (DeepSeek-AI, 2024 ) , we use AdamW (Loshchilov and Hutter, 2019 ) as the optimizer with β 1 subscript 𝛽 1 \beta_{1} and β 2 subscript 𝛽 2 \beta_{2} values of 0.9 and 0.95. We adapt batch sizes and learning rates by the scaling laws suggested in DeepSeek LLM. For the learning rate scheduling, we implement a three-stage policy, which includes 2000 warm-up steps, and set the final learning rate to 10% of the initial rate. Notably, the learning rate at each stage is scaled down to 1 10 1 10 \sqrt{\frac{1}{10}} of the preceding stage’s rate, following the guidelines establis...

3.5 Environments

(3.5 Environments的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Our experiments are conducted using the HAI-LLM (High-Flyer, 2023 ) framework, known for its efficiency and lightweight approach in training large language models. This framework incorporates a variety of parallelism strategies to optimize computational efficiency. These include tensor parallelism (Korthikanti et al., 2023 ) , alongside ZeRO data parallelism (Rajbhandari et al., 2020 ) and PipeDream pipeline parallelism (Narayanan et al., 2019 ) . Our experiments utilize clusters outfitted with NVIDIA A100 and H800 GPUs. In the A100 cluster, each node is configured with 8 GPUs, interconnected ...

3.6 Long Context

(3.6 Long Context的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: To enhance the capabilities of DeepSeek-Coder in handling extended contexts, particularly for scenarios like repository-level code processing, we have reconfigured the RoPE (Su et al., 2023 ) parameters to extend the default context window. Following previous practices (Chen et al., 2023 ; kaiokendev, 2023 ) , we employed a linear scaling strategy, increasing the scaling factor from 1 1 1 to 4 4 4 and altering the base frequency from 10000 10000 10000 to 100000 100000 100000 . The model underwent an additional 1000 1000 1000 steps of training, using a batch size of 512 512 512 and a sequence l...

3.7 Instruction Tuning

(3.7 Instruction Tuning的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: We develop DeepSeek-Coder-Instruct by enhancing the DeepSeek-Coder-Base through instruction-based fine-tuning using high-quality data. This data comprises helpful and impartial human instructions, structured by the Alpaca Instruction format (Taori et al., 2023 ) . To demarcate each dialogue turn, we employed a unique delimiter token <|EOT|> to signify the conclusion of each segment. For training, we use a cosine schedule with 100 warm-up steps and an initial learning rate 1e-5. We also use a batch size of 4M tokens and 2B tokens in total. An example of using DeepSeek-Coder-Instruct 34B is depi...

4 Experimental Results

(4 Experimental Results的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: In this section, we evaluate DeepSeek-Coder on four tasks, including code generation (§ 4.1 ), FIM code completion (§ 4.2 ), cross-file code completion (§ 4.3 ) and program-based math reasoning (§ 4.4 ). We compare DeepSeek-Coder with the previous state-of-the-art large language models: • CodeGeeX2 (Zheng et al., 2023 ) represents the second generation of the multilingual code generation model CodeGeeX. It is developed using the ChatGLM2 (Du et al., 2022 ) architecture and is enhanced with an extensive dataset of coding examples. • StarCoder (Li et al., 2023 ) is a publicly accessible model wi...

4 Experimental Results

(4 Experimental Results的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: hile the MBPP benchmark includes 500 problems in a few-shot setting. To evaluate the model’s multilingual capabilities, we expanded the Python problems of Humaneval Benchmark to seven additional commonly used programming languages, namely C++, Java, PHP, TypeScript (TS), C#, Bash, and JavaScript (JS) (Cassano et al., 2023 ) . For both benchmarks, We adopted a greedy search approach and re-implemented the baseline results using the same script and environment for fair comparison. Model Size Python C++ Java PHP TS C# Bash JS Avg MBPP Multilingual Base Models code-cushman-001 12B 33.5% 31.9% 30.6...

4 Experimental Results

(4 Experimental Results的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: rce model CodeLlama-Base 34B, our model has demonstrated a notable improvement of 9% and 11% in accuracy, respectively. It’s worth noting that even our smaller model, DeepSeek-Coder-Base 6.7B, surpasses the performance of CodeLlama-Base 34B. After instruction fine-tuning, our model surpasses the closed-source GPT-3.5-Turbo model in HumanEval benchmark, significantly reducing the performance gap between OpenAI GPT-4 and open-source models. DS-1000 Benchmark HumanEval and MBPP have a significant drawback in that they rely heavily on straightforward programming tasks that may not accurately repre...

4 Experimental Results

(4 Experimental Results的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: .0% 34.3% DeepSeek-Coder-Base 1.3B 32.3% 21.4% 9.3% 8.8% 8.5% 16.5% 8.9% 16.2% DeepSeek-Coder-Base 6.7B 48.4% 35.5% 20.6% 19.1% 22.6% 38.3% 24.4% 30.5% DeepSeek-Coder-Base 33B 56.1% 49.6% 25.8% 36.8% 36.8% 40.0% 46.7% 40.2% Table 4: Performance of different approaches on the DS-1000-Tasks. LeetCode Contest Benchmark To further validate the model’s capability in real-world programming problems, we construct the LeetCode Contest benchmark 3 3 3 We have published this benchmark in https://github.com/deepseek-ai/DeepSeek-Coder/tree/main/Evaluation/LeetCode . . LeetCode 4 4 4 https://leetcode.com/ ...

4 Experimental Results

(4 Experimental Results的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Turbo - 46.7% 15.4 % 15.9% 23.3% GPT-3.5-Turbo + CoT - 42.2% 15.4% 20.5% 23.3% GPT-4-Turbo - 73.3% 31.9% 25.0% 40.6% GPT-4-Turbo + CoT - 71.1% 35.2% 25.0% 41.8% DeepSeek-Coder-Instruct 1.3B 22.2% 1.1% 4.5% 7.2% DeepSeek-Coder-Instruct + CoT 1.3B 22.2% 2.2% 2.3% 7.2% DeepSeek-Coder-Instruct 6.7B 44.4% 12.1% 9.1% 19.4% DeepSeek-Coder-Instruct + CoT 6.7B 44.4% 17.6% 4.5% 21.1% DeepSeek-Coder-Instruct 33B 57.8% 22.0% 9.1% 27.8% DeepSeek-Coder-Instruct + CoT 33B 53.3% 25.3% 11.4% 28.9% Table 5: Performance of different models on the LeetCode Contest Benchmark. Our analysis indicates that the implem...

4 Experimental Results

(4 Experimental Results的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: n future studies using our released LeetCode data. 4.2 Fill-in-the-Middle Code Completion DeepSeek-Coder models are trained with a 0.5 FIM (Fill-In-the-Middle) rate during their pretraining phase. This specialized training strategy empowers the model to proficiently generate code by filling in blanks based on the surrounding context, both prefix and suffix, of the given code snippet. This capability is particularly advantageous in the realm of code completion tools. Several open-source models have emerged with similar capabilities. Notable among these are SantaCoder (Allal et al., 2023 ) , Sta...

4 Experimental Results

(4 Experimental Results的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: ere is a corresponding and responsible enhancement in performance. This trend underscores the importance of model capacity in achieving higher accuracy in code completion tasks. Based on these findings, we recommend the deployment of the DeepSeek-Coder-Base 6.7B model in code completion tools. This recommendation is grounded in the model’s demonstrated balance between efficiency and accuracy. The DeepSeek-Coder-Base 6.7B model, with its substantial parameter size, has proven to be highly effective in the context of code completion, making it an ideal choice for integrating advanced computation...

4 Experimental Results

(4 Experimental Results的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: % 15.61% 64.78% 7.54% 42.06% 14.20% 65.03% CodeLlama-Base 7B 7.32% 59.66% 9.68% 62.64% 8.19% 58.50% 4.07% 59.19% + Retrieval 13.02% 64.30% 16.41% 64.64% 12.34% 60.64% 13.19% 63.04% DeepSeek-Coder-Base 6.7B 9.53% 61.65% 10.80% 61.77% 9.59% 60.17% 5.26% 61.32% + Retrieval 16.14% 66.51% 17.72% 63.18% 14.03% 61.77% 16.23% 63.42% + Retrieval w/o Repo Pre-training 16.02% 66.65% 16.64% 61.88% 13.23% 60.92% 14.48% 62.38% Table 7: Performance of different models on cross-file code completion. In our evaluation of various models, we set the maximum sequence length to 2048 tokens, the maximum output leng...

4 Experimental Results

(4 Experimental Results的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: nd MAWPS (Gou et al., 2023 ) . In each of these benchmarks, the model is prompted to alternately describe a solution step in natural language and then execute that step with code. As seen in Table 8 , DeepSeek-Coder models achieve a remarkable performance across all benchmarks, especially the 33B variant, which demonstrates the potential of using such models in applications that require complex mathematical computations and problem-solving abilities. Model Size GSM8k MATH GSM-Hard SVAMP TabMWP ASDiv MAWPS Avg Multilingual Base Models CodeGeex-2 7B 22.2% 9.7% 23.6% 39.0% 44.6% 48.5% 66.0% 36.2%...

4.1 Code Generation

(4.1 Code Generation的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: HumanEval and MBPP Benchmarks The HumanEval (Chen et al., 2021 ) and MBPP (Austin et al., 2021 ) benchmarks are widely used for evaluating code LLMs. HumanEval consists of 164 hand-written Python problems that are validated using test cases to assess the code generated by a Code LLM in a zero-shot setting, while the MBPP benchmark includes 500 problems in a few-shot setting. To evaluate the model’s multilingual capabilities, we expanded the Python problems of Humaneval Benchmark to seven additional commonly used programming languages, namely C++, Java, PHP, TypeScript (TS), C#, Bash, and JavaS...

4.1 Code Generation

(4.1 Code Generation的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Table 3: Performance of approaches on the Multilingual HumanEval and MBPP Benchmarks. The results are presented in Table 3 . As we can see, DeepSeek-Coder-Base achieves state-of-the-art performance with an average accuracy of 50.3% on HumanEval and 66.0% on MBPP. In comparison to the similarly sized open-source model CodeLlama-Base 34B, our model has demonstrated a notable improvement of 9% and 11% in accuracy, respectively. It’s worth noting that even our smaller model, DeepSeek-Coder-Base 6.7B, surpasses the performance of CodeLlama-Base 34B. After instruction fine-tuning, our model surpasse...

4.1 Code Generation

(4.1 Code Generation的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 6B 38.7% 26.8% 14.4% 11.8% 19.8% 27.0% 17.8% 22.9% StarCoder-Base 16B 43.2% 29.1% 11.0% 20.6% 23.6% 32.2% 15.6% 24.6% CodeLlama-Base 7B 41.9% 24.6% 14.8% 16.2% 18.9% 17.4% 17.8% 22.1% CodeLlama-Base 13B 46.5% 28.6% 18.2% 19.1% 18.9% 27.8% 33.3% 26.8% CodeLlama-Base 34B 50.3% 42.7% 23.0% 25.0% 28.3% 33.9% 40.0% 34.3% DeepSeek-Coder-Base 1.3B 32.3% 21.4% 9.3% 8.8% 8.5% 16.5% 8.9% 16.2% DeepSeek-Coder-Base 6.7B 48.4% 35.5% 20.6% 19.1% 22.6% 38.3% 24.4% 30.5% DeepSeek-Coder-Base 33B 56.1% 49.6% 25.8% 36.8% 36.8% 40.0% 46.7% 40.2% Table 4: Performance of different approaches on the DS-1000-Tasks. L...

4.1 Code Generation

(4.1 Code Generation的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: n this task. However, there remains a substantial performance gap when compared to the more advanced GPT-4-Turbo. Model Size Easy (45) Medium (91) Hard (44) Overall(180) WizardCoder-V1.0 15B 17.8% 1.1% 0.0% 5.0% CodeLlama-Instruct 34B 24.4% 4.4% 4.5% 9.4% Phind-CodeLlama-V2 34B 26.7% 8.8% 9.1% 13.3% GPT-3.5-Turbo - 46.7% 15.4 % 15.9% 23.3% GPT-3.5-Turbo + CoT - 42.2% 15.4% 20.5% 23.3% GPT-4-Turbo - 73.3% 31.9% 25.0% 40.6% GPT-4-Turbo + CoT - 71.1% 35.2% 25.0% 41.8% DeepSeek-Coder-Instruct 1.3B 22.2% 1.1% 4.5% 7.2% DeepSeek-Coder-Instruct + CoT 1.3B 22.2% 2.2% 2.3% 7.2% DeepSeek-Coder-Instruct ...

4.1 Code Generation

(4.1 Code Generation的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: possibility of data contamination cannot be entirely ruled out. We observed that the GPT-4-Turbo and DeepSeek-Coder models achieved higher scores in the LeetCode Contest held in July and August. We encourage the research community to consider the potential issue of data contamination when evaluating models in future studies using our released LeetCode data.

4.2 Fill-in-the-Middle Code Completion

(4.2 Fill-in-the-Middle Code Completion的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: DeepSeek-Coder models are trained with a 0.5 FIM (Fill-In-the-Middle) rate during their pretraining phase. This specialized training strategy empowers the model to proficiently generate code by filling in blanks based on the surrounding context, both prefix and suffix, of the given code snippet. This capability is particularly advantageous in the realm of code completion tools. Several open-source models have emerged with similar capabilities. Notable among these are SantaCoder (Allal et al., 2023 ) , StarCoder (Li et al., 2023 ) , and CodeLlama (Roziere et al., 2023 ) . These models have set ...

4.2 Fill-in-the-Middle Code Completion

(4.2 Fill-in-the-Middle Code Completion的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: the importance of model capacity in achieving higher accuracy in code completion tasks. Based on these findings, we recommend the deployment of the DeepSeek-Coder-Base 6.7B model in code completion tools. This recommendation is grounded in the model’s demonstrated balance between efficiency and accuracy. The DeepSeek-Coder-Base 6.7B model, with its substantial parameter size, has proven to be highly effective in the context of code completion, making it an ideal choice for integrating advanced computational capabilities into coding environments.

4.3 Cross-File Code Completion

(4.3 Cross-File Code Completion的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: In this section, we will evaluate the performance of existing open-source models in cross-file code completion tasks. Unlike code generation discussed in the previous section, cross-file code completion requires the model to access and understand repositories that span multiple files with numerous cross-file dependencies. We use CrossCodeEval (Ding et al., 2023 ) to evaluate the capabilities of currently available open-source code models of 7B scale in cross-file completion tasks. This dataset is constructed on a diverse set of real-world, open-sourced, permissively licensed repositories in fo...

4.3 Cross-File Code Completion

(4.3 Cross-File Code Completion的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: r the cross-file context, we utilize the official BM25 search results provided by Ding et al. ( 2023 ) . Evaluation metrics include exact match and edit similarity. The results, presented in Table 7 , demonstrate that DeepSeek-Coder consistently outperforms other models in cross-file completion tasks across multiple languages, showcasing its superior practical application capabilities. When only utilizing file-level code corpus ( w/o Repo Pre-training ) to pre-train DeepSeek-Coder, we observe a decrease in performance in the Java, TypeScript, and C# languages, indicating the effectiveness of t...

4.4 Program-based Math Reasoning

(4.4 Program-based Math Reasoning的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Program-based math reasoning involves evaluating a model’s ability to understand and solve mathematical problems through programming. This type of reasoning is critical in fields such as data analysis and scientific computing. To conduct this assessment, we utilize the Program-Aided Math Reasoning (PAL) method as outlined in Gao et al. ( 2023 ) . This approach is applied across seven distinct benchmarks, each offering unique challenges and contexts. These benchmarks includes GSM8K (Cobbe et al., 2021 ) , MATH (Hendrycks et al., 2021 ) , GSM-Hard (Gao et al., 2023 ) , SVAMP (Patel et al., 2021 ...

5 Continue Pre-Training From General LLM

(5 Continue Pre-Training From General LLM的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: To further enhance the natural language understanding and mathematical reasoning abilities of the DeepSeek-Coder model, we perform additional pre-training from the general language model DeepSeek-LLM-7B Base (DeepSeek-AI, 2024 ) on 2 trillion tokens, resulting in DeepSeek-Coder-v1.5 7B. For this pre-training, we specifically use the data sources listed in Table 9 . Unlike DeepSeek-Coder, DeepSeek-Coder-v1.5 employs solely a next token prediction objective with a 4K context length during its pre-training phase. Data Source Percentage Source Code 70% Markdown and StackExchange 10% Natural langua...

5 Continue Pre-Training From General LLM

(5 Continue Pre-Training From General LLM的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: model. In particular, in the Math Reasoning and Natural Language categories, DeepSeek-Coder-Base-v1.5 significantly outperforms its predecessor across all benchmarks, which also demonstrates significant improvements in its mathematical reasoning and natural language processing capabilities. Programming Math Reasoning Natural Language Models Size HumanEval MBPP GSM8K MATH MMLU BBH HellaSwag WinoG ARC-C DeepSeek-Coder-Base 6.7B 44.7% 60.6% 43.2% 19.2% 36.6% 44.3% 53.8% 57.1% 32.5% DeepSeek-Coder-Base-v1.5 6.9B 43.2% 60.4% 62.4% 24.7% 49.1% 55.2% 69.9% 63.8% 47.2% DeepSeek-Coder-Instruct 6.7B 66....

6 Conclusion

(6 Conclusion的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: In this technical report, we introduce a series of specialized Large Language Models (LLMs) for coding, named DeepSeek-Coder, available in three distinct scales: 1.3B, 6.7B, and 33B parameters. These models are uniquely trained on a meticulously curated project-level code corpus, utilizing a "fill-in-the-blank" pre-training objective to enhance code infilling capabilities. A significant advancement is the extension of the models’ context window to 16,384 tokens, thereby greatly improving their effectiveness in handling extensive code generation tasks. Our evaluations reveal that the most advan...

6 Conclusion

(6 Conclusion的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: st general LLMs. The reason is evident: to effectively interpret and execute coding tasks, these models must also possess a deep understanding of human instructions, which often come in various forms of natural language. Looking ahead, our commitment is to develop and openly share even more powerful code-focused LLMs based on larger-scale general LLMs.

Appendix A Cases of Chatting with DeepSeek-Coder-Instruct

(Appendix A Cases of Chatting with DeepSe的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: We will present two cases of interactions with DeepSeek-Coder-Instruct, with one involving a multi-turn conversation about creating a database and performing data analysis, and the other centered around using a model to solve a sample problem from LeetCode. In the first scenario, depicted in Figure 5 , we instruct the model to build a student database using Python and randomly insert 10 pieces of information. Subsequently, in the second round of the conversation, we continue to ask the model by analyzing the age distribution of the students. From Figure 5 , it’s evident that the model can gene...

Appendix B Benchmark curves during training of DeepSeek-Coder-Base

(Appendix B Benchmark curves during train的详细内容,翻译见上面对应章节) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: In Figure 7 , we present the benchmark curves illustrating the performance of DeepSeek-Coder-Base models during their training phase. For validation, a carefully curated subset of the training corpus was employed, consisting of 8,000 code files. This subset was deliberately chosen to ensure a diverse and representative sample, critical for an accurate assessment of the models’ capabilities. The performance metrics of these models are specifically detailed in the final two sub-figures of Figure 7 , offering a clear visual representation of their efficacy throughout the training process. Figure ...
← 返回首页详细解读