DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

原文: Qihao Zhu* Daya Guo Zhihong Shao Dejian Yang Peiyi Wang Runxin Xu Y. Wu Yukun Li Huazuo Gao Shirong Ma Wangding Zeng Xiao Bi Zihui Gu Hanwei Xu Damai Dai Kai Dong Liyue Zhang Yishi Piao Zhibin Gou Zhenda Xie Zhewen Hao Bingxuan Wang Junxiao Song Deli Chen Xin Xie Kang Guan Yuxiang You Aixin Liu Qiushi Du Wenjun Gao Xuan Lu Qinyu Chen Yaohui Wang Chengqi Deng Jiashi Li Chenggang Zhao Chong Ruan Fuli Luo Wenfeng Liang DeepSeek-AI https://github.com/deepseek-ai/DeepSeek-Coder-V2 Abstract We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves perfor...

原文: source counterparts, contributing to the progress of code intelligence. However, there remains a discernible gap when comparing them to state-of-the-art closed-source models like GPT4-Turbo (OpenAI, 2023 ) , Claude 3 Opus (Anthropic, 2024 ) , and Gemini 1.5 Pro (Reid et al., 2024 ) . To bridge this gap and further propel the development of open-source code models, we introduce the DeepSeek-Coder-V2 series. These models are built upon the foundation of DeepSeek-V2 (DeepSeek-AI, 2024 ) and are further pre-trained with an additional corpus with 6 trillion tokens. In the pre-training phase, the da...

原文: andle more complex and extensive coding tasks. After continuous pre-training DeepSeek-V2 on this multi-source corpora, we find that DeepSeek-Coder-V2 significantly enhances the model’s capabilities in coding and mathematical reasoning while maintaining comparable general language performance. In the alignment phase, we first construct an instruction training dataset that includes code and math data from DeepSeek-Coder (Guo et al., 2024 ) and DeepSeek-Math (Shao et al., 2024 ) , as well as general instruction data from DeepSeek-V2 (DeepSeek-AI, 2024 ) . This dataset is used to fine-tune the bas...

原文: ly under a permissive license, allowing for both research and unrestricted commercial use. 1.2 Summary of Evaluations and Metrics • Code : Regarding code generation benchmark evaluation, DeepSeek-Coder-V2 demonstrates remarkable superiority over all open source models while exhibiting performance on par with the leading closed-source models, such as GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro. Notably, we achieve a 90.2 % score on HumanEval (Chen et al., 2021 ) , a 76.2 % score on MBPP (Austin et al., 2021a ) (establishing a new state-of-the-art result with EvalPlus evaluation pipeline), and...

原文: training data for DeepSeek-Coder-V2 primarily consists of 60% source code, 10% math corpus, and 30% natural language corpus. Since the natural language corpus is directly sampled from the training dataset of DeepSeek-V2, this section focuses on the collection, cleaning, and filtering processes of the code and math data. Meanwhile, we further validate the quality of this data through comparative analysis experiments. We collect public repositories created before November 2023 on GitHub. We first apply the same filtering rules and near-deduplication as those used in the DeepSeek-Coder (Guo et al...

原文: tackoverflow.com , library sites such as PyTorch documentation 2 2 2 https://pytorch.org/docs , and mathematics website such as StackExchange 3 3 3 https://math.stackexchange.com as our initial seed corpus. Using this seed corpus, we train a fastText model (Joulin et al., 2016 ) to recall more coding-related and math-related web pages. Since tokenization for languages like Chinese cannot be done through spaces, we use the Byte Pair Encoding (BPE) tokenizer from DeepSeek-V2, which significantly improves the recall accuracy of fastText. For each domain, we calculate the percentage of web pages c...

原文: Therefore, the new code corpus is superior to the code corpus used to train DeepSeek-Coder. Model Tokens Python C++ Java PHP TS C# Bash JS Avg MBPP DeepSeek-Coder-1B 1T 1 Introduction

【引言】DeepSeek-Coder-V2的研究背景、动机和主要贡献。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: The open-source community has made significant strides in advancing code intelligence through the development of open-source code models such as StarCoder (Li et al., 2023b ; Lozhkov et al., 2024 ) , CodeLlama (Roziere et al., 2023 ) , DeepSeek-Coder (Guo et al., 2024 ) , and Codestral (MistralAI, 2024 ) . These models have steadily approached the performance levels of closed-source counterparts, contributing to the progress of code intelligence. However, there remains a discernible gap when comparing them to state-of-the-art closed-source models like GPT4-Turbo (OpenAI, 2023 ) , Claude 3 Opus...

1 Introduction

【引言】DeepSeek-Coder-V2的研究背景、动机和主要贡献。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: pSeek-Coder-V2 has been exposed to 10.2T training tokens, where 4.2 trillion tokens originate from the DeepSeek V2 dataset, while the remaining 6 trillion tokens come from the DeepSeek-Coder-V2 dataset. To accommodate longer code inputs and enhance applicability across various programming scenarios, we extend the context length from 16K to 128K tokens, allowing our models to handle more complex and extensive coding tasks. After continuous pre-training DeepSeek-V2 on this multi-source corpora, we find that DeepSeek-Coder-V2 significantly enhances the model’s capabilities in coding and mathemati...

1 Introduction

【引言】DeepSeek-Coder-V2的研究背景、动机和主要贡献。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: e make the first attempt to develop an open-source hundred-billion-parameter code model to advance the field of code intelligence. Experimental results indicate that DeepSeek-Coder-V2 236B outperforms state-of-the-art closed-source models, such as GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro, in both coding and mathematics tasks. • DeepSeek-Coder-V2 models are released publicly under a permissive license, allowing for both research and unrestricted commercial use. 1.2 Summary of Evaluations and Metrics • Code : Regarding code generation benchmark evaluation, DeepSeek-Coder-V2 demonstrates rem...

1 Introduction

【引言】DeepSeek-Coder-V2的研究背景、动机和主要贡献。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: AI simple-eval pipeline. Regarding subjective evaluation with GPT-4 as a judger, DeepSeek-Coder-V2 achieves 65.0 on arena-hard (Li et al., 2024 ) , 8.77 on MT-bench (Zheng et al., 2023 ) and 7.84 on alignbench (Liu et al., 2023c ) . These scores are significantly better than other code-specific models, even comparable with general open source models.

1.1 Contributions

（1.1 Contributions - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: In summary, our main contributions are: • We introduce DeepSeek-Coder-V2 with 16B and 236B parameters based on the DeepSeekMoE framework, which has activation parameters of only 2.4B and 21B, efficiently supporting diverse computational and application needs. Additionally, DeepSeek-Coder-V2 supports 338 programming languages and a maximum context length of 128K tokens. • We make the first attempt to develop an open-source hundred-billion-parameter code model to advance the field of code intelligence. Experimental results indicate that DeepSeek-Coder-V2 236B outperforms state-of-the-art closed-...

1.2 Summary of Evaluations and Metrics

【实验结果】DeepSeek-Coder-V2在各基准测试上的性能评估结果。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: • Code : Regarding code generation benchmark evaluation, DeepSeek-Coder-V2 demonstrates remarkable superiority over all open source models while exhibiting performance on par with the leading closed-source models, such as GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro. Notably, we achieve a 90.2 % score on HumanEval (Chen et al., 2021 ) , a 76.2 % score on MBPP (Austin et al., 2021a ) (establishing a new state-of-the-art result with EvalPlus evaluation pipeline), and a 43.4 % score on LiveCodeBench (Jain et al., 2024 ) (questions from Dec. 2023 to June. 2024). Additionally, DeepSeek-Coder-V2 is...

2 Data Collection

【数据/训练】DeepSeek-Coder-V2的训练数据构建和训练流程。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: The pre-training data for DeepSeek-Coder-V2 primarily consists of 60% source code, 10% math corpus, and 30% natural language corpus. Since the natural language corpus is directly sampled from the training dataset of DeepSeek-V2, this section focuses on the collection, cleaning, and filtering processes of the code and math data. Meanwhile, we further validate the quality of this data through comparative analysis experiments. We collect public repositories created before November 2023 on GitHub. We first apply the same filtering rules and near-deduplication as those used in the DeepSeek-Coder (G...

2 Data Collection

【数据/训练】DeepSeek-Coder-V2的训练数据构建和训练流程。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: ttps://stackoverflow.com , library sites such as PyTorch documentation 2 2 2 https://pytorch.org/docs , and mathematics website such as StackExchange 3 3 3 https://math.stackexchange.com as our initial seed corpus. Using this seed corpus, we train a fastText model (Joulin et al., 2016 ) to recall more coding-related and math-related web pages. Since tokenization for languages like Chinese cannot be done through spaces, we use the Byte Pair Encoding (BPE) tokenizer from DeepSeek-V2, which significantly improves the recall accuracy of fastText. For each domain, we calculate the percentage of web...

2 Data Collection

（2 Data Collection - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: ctively. Therefore, the new code corpus is superior to the code corpus used to train DeepSeek-Coder. Model Tokens Python C++ Java PHP TS C# Bash JS Avg MBPP DeepSeek-Coder-1B 1T 30.5% 28.0% 31.7% 23.0% 30.8% 31.7% 9.5% 28.6% 26.7% 44.6% DeepSeek-Coder-V2-1B 1T 36.0% 34.8% 31.7% 27.3% 37.7% 34.2% 6.3% 38.5% 31.2% 49.0% DeepSeek-Coder-V2-1B 2T 37.2% 39.1% 32.3% 31.7% 34.6% 36.7% 12.0% 32.9% 32.0% 54.0% Table 1: Performance of 1B base model between DeepSeek-Coder and DeepSeek-Coder-V2.

3 Training Policy

（3 Training Policy - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: 3.1 Training Strategy We use two training objectives for DeepSeek-Coder-v2 16B: Next-Token-Prediction and Fill-In-Middle (FIM) (Li et al., 2023b ; Bavarian et al., 2022 ; Guo et al., 2024 ) . For DeepSeek-Coder-v2 236B, we only utilize the Next-Token-Prediction objective. Here we give a brief introduction of the FIM training policy. We adopt the FIM training approach for the development of DeepSeek-Coder-v2-16B, leveraging the PSM (Prefix, Suffix, Middle) mode. This method structures the content reconstruction in the sequence: Prefix, Suffix, and Middle, as illustrated below: <｜fim_begin｜> f...

3 Training Policy

（3 Training Policy - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: a cosine decay strategy, starting with 2000 warm-up steps and gradually reducing the learning rate to 10% of its initial value. Both DeepSeek-Coder-V2 and DeepSeek-Coder-V2-Lite are trained using the same methodology. To maintain robust natural language understanding capabilities in DeepSeek-Coder-V2, we continue the pre-training process from an intermediate checkpoint of DeepSeek-V2. The intermediate checkpoint was initially trained on 4.2T tokens. Consequently, DeepSeek-Coder-V2 has been exposed to a total of 10.2T high-quality tokens during the pre-training phase. Model DeepSeek-Coder-V2-Li...

3 Training Policy

（3 Training Policy - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: collect 20k code-related instruction data and 30k math related data from DeepSeek-Coder and DeepSeek-Math. To maintain the general ability, we also sample several data from the instruction data of DeepSeek-V2. Finally, we use a instruction dataset of 300M tokens. For training, we use a cosine schedule with 100 warm-up steps and an initial learning rate 5 e − 6 5 superscript 𝑒 6 5e^{-6} . We also use a batch size of 1M tokens and 1B tokens in total. 3.5.2 Reinforcement Learning We further employ Reinforcement Learning (RL) techniques to fully simulate the capabilities of DeepSeek-Coder-V2, wh...

3 Training Policy

（3 Training Policy - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: orithm, which is the same as what DeepSeek-V2 uses. Notably, GRPO is proven to be quite effective and has less cost compared with PPO, since there is no need to maintain an additional critic model. Figure 3: Performances of Different Methods

3.1 Training Strategy

（3.1 Training Strategy - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: We use two training objectives for DeepSeek-Coder-v2 16B: Next-Token-Prediction and Fill-In-Middle (FIM) (Li et al., 2023b ; Bavarian et al., 2022 ; Guo et al., 2024 ) . For DeepSeek-Coder-v2 236B, we only utilize the Next-Token-Prediction objective. Here we give a brief introduction of the FIM training policy. We adopt the FIM training approach for the development of DeepSeek-Coder-v2-16B, leveraging the PSM (Prefix, Suffix, Middle) mode. This method structures the content reconstruction in the sequence: Prefix, Suffix, and Middle, as illustrated below: <｜fim_begin｜> f p r e <｜fim_hol...

3.2 Model Architecture

（3.2 Model Architecture - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Our architecture aligns with that of DeepSeekV2 (DeepSeek-AI, 2024 ) . The hyperparameters settings, 16B and 236B, correspond to those used in DeepSeek-V2-Lite and DeepSeek-V2, respectively. Notably, we encountered instability during training and spikes in gradient values, which we attributed to the exponential normalization technique. To address this, we reverted to the conventional normalization method.

3.3 Training Hyper-Parameters

（3.3 Training Hyper-Parameters - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Consistent with the DeepSeek V2 methodology (DeepSeek-AI, 2024 ) , we utilize the AdamW optimizer (Loshchilov and Hutter, 2019 ) , configured with β 1 = 0.9 subscript 𝛽 1 0.9 \beta_{1}=0.9 , β 2 = 0.95 subscript 𝛽 2 0.95 \beta_{2}=0.95 , and a weight decay of 0.1. Batch sizes and learning rates are adjusted according to DeepSeek-V2 specifications. For learning rate scheduling, we employ a cosine decay strategy, starting with 2000 warm-up steps and gradually reducing the learning rate to 10% of its initial value. Both DeepSeek-Coder-V2 and DeepSeek-Coder-V2-Lite are trained using the same metho...

3.4 Long Context Extension

（3.4 Long Context Extension - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Following DeepSeek-V2, we extend the context length of DeepSeek-Coder-V2 to 128K using Yarn (Peng et al., 2023 ) . The hyper-parameters of YARN are the same as DeepSeek-V2: the scale s 𝑠 s to 40, α 𝛼 \alpha to 1, β 𝛽 \beta to 32. We further continue training the model using two stages to enhance its capability for handling long contexts. In the first stage, we utilize a sequence length of 32K and a batch size of 1152 for 1000 steps. In the second stage, we train the model for an additional 1000 steps, employing a sequence length of 128K and a batch size of 288 sequences. It should be noted her...

3.5 Alignment

（3.5 Alignment - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: 3.5.1 Supervised Fine-Tuning To build DeepSeek-Coder-V2 Chat, we construct the instruction training dataset mixed with code and math data. We first collect 20k code-related instruction data and 30k math related data from DeepSeek-Coder and DeepSeek-Math. To maintain the general ability, we also sample several data from the instruction data of DeepSeek-V2. Finally, we use a instruction dataset of 300M tokens. For training, we use a cosine schedule with 100 warm-up steps and an initial learning rate 5 e − 6 5 superscript 𝑒 6 5e^{-6} . We also use a batch size of 1M tokens and 1B tokens in tota...

3.5 Alignment

（3.5 Alignment - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: all subsequent experiments. Reinforcement Learning Algorithm We employ Group Relative Policy Optimization (GRPO) Shao et al. ( 2024 ) as our RL algorithm, which is the same as what DeepSeek-V2 uses. Notably, GRPO is proven to be quite effective and has less cost compared with PPO, since there is no need to maintain an additional critic model. Figure 3: Performances of Different Methods