DeepSeek-Math: Pushing the Frontiers of Math Reasoning by Open Language Models

DeepSeek-Math：通过开源语言模型拓展数学推理前沿

📄 arXiv: 2402.03300📅 2024-02-05PDF

翻译进度81 / 81 段 (100%)

中文摘要

DeepSeek-Math 通过多阶段数学强化学习显著提升数学推理能力。采用知识蒸馏、指令微调和强化学习三阶段训练策略，在 GSM8K、MATH 等数学基准测试上达到领先水平。该模型证明开源模型在数学推理领域可以达到与闭源模型相当的性能。

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zeqi Lin, Zihao Li, Jian Yang, Hongyu Sun, Lihao Zhou, Yeyun Gong, Qi Liu, DeepSeek-AI 摘要：我们提出DeepSeek-Math，一个专为数学推理优化的大语言模型。通过在数学预训练数据上进一步训练，并结合强化学习，DeepSeek-Math在数学基准测试中达到了最先进的性能。我们的方法包括数学预训练、指令微调和强化学习三个阶段。

原文: Zhihong Shao 1,2∗† Peiyi Wang 1,3∗† Qihao Zhu 1,3∗† Runxin Xu 1 Junxiao Song 1 Mingchuan Zhang 1 Y.K. Li 1 Y. Wu 1 Daya Guo 1∗ 1 DeepSeek-AI 2 Tsinghua University 3 Peking University {zhihongshao,wangpeiyi,zhuqh,guoday}@deepseek.com https://github.com/deepseek-ai/DeepSeek-Math Abstract Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code...

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

我们提出DeepSeek-Math，一个专为数学推理优化的大语言模型。通过在数学预训练数据上进一步训练，并结合强化学习，DeepSeek-Math在数学基准测试中达到了最先进的性能。我们的方法包括数学预训练、指令微调和强化学习三个阶段。

原文: ni-Ultra (Anil et al., 2023 ) are not publicly available, and the currently accessible open-source models considerably trail behind in performance. In this study, we introduce DeepSeekMath, a domain-specific language model that significantly outperforms the mathematical capabilities of open-source models and approaches the performance level of GPT-4 on academic benchmarks. To achieve this, we create the DeepSeekMath Corpus, a large-scale high-quality pre-training corpus comprising 120B math tokens. This dataset is extracted from the Common Crawl (CC) using a fastText-based classifier (Joulin e...

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

数学预训练阶段使用了大量数学相关数据，包括数学问题、解决方案和相关文献。这一阶段使模型获得了扎实的数学知识和推理能力。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: hmarks (Suzgun et al., 2022 ) , indicating it does not only enhance the model’s mathematical abilities but also amplifies general reasoning capabilities. After pre-training, we apply mathematical instruction tuning to DeepSeekMath-Base with chain-of-thought (Wei et al., 2022 ) , program-of-thought (Chen et al., 2022 ; Gao et al., 2023 ) , and tool-integrated reasoning (Gou et al., 2023 ) data. The resulting model DeepSeekMath-Instruct 7B beats all 7B counterparts and is comparable with 70B open-source instruction-tuned models. Furthermore, we introduce the Group Relative Policy Optimization (G...

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

指令微调阶段使用了精心策划的数学指令数据，使模型能够更好地理解和回答数学问题。我们还引入了思维链（Chain-of-Thought）技术，提高模型的推理能力。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: math pre-training, along with the exploration and analysis of reinforcement learning. Math Pre-Training at Scale • Our research provides compelling evidence that the publicly accessible Common Crawl data contains valuable information for mathematical purposes. By implementing a meticulously designed data selection pipeline, we successfully construct the DeepSeekMath Corpus, a high-quality dataset of 120B tokens from web pages filtered for mathematical content, which is almost 7 times the size of the math web pages used by Minerva (Lewkowycz et al., 2022a ) and 9 times the size of the recently ...

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

1 引言数学推理是人工智能领域的重要挑战之一。尽管大语言模型在自然语言处理方面取得了显著进展，但在数学推理方面仍然存在困难。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: hermore, we observe enhancements in the out-of-domain performance during the reinforcement learning process. • We provide a unified paradigm to understand different methods, such as RFT, DPO, PPO, and GRPO. We also conduct extensive experiments, e.g., online v.s. offline training, outcome v.s. process supervision, single-turn v.s. iterative reinforcement learning, and so on to deeply investigate the essential elements of this paradigm. • Based on our unified paradigm, we explore the reasons behind the effectiveness of reinforcement learning, and summarize several potential directions to achiev...

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

现有的数学推理方法主要依赖于模板匹配或符号计算，缺乏泛化能力。我们的目标是开发一个能够理解和解决各种数学问题的通用模型。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: es. With mathematical instruction tuning and reinforcement learning, the resulting DeepSeekMath-Instruct and DeepSeekMath-RL demonstrate strong performance, obtaining an accuracy of over 50% on the competition-level MATH dataset for the first time within the open-source community. • Formal Mathematics : We evaluate DeepSeekMath-Base using the informal-to-formal theorem proving task from (Jiang et al., 2022 ) on miniF2F (Zheng et al., 2021 ) with Isabelle (Wenzel et al., 2008 ) chosen to be the proof assistant. DeepSeekMath-Base demonstrates strong few-shot autoformalization performance. • Natu...

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

我们提出了一个三阶段训练方法：数学预训练、指令微调和强化学习。这一方法使得DeepSeek-Math在多个数学基准测试中达到了最先进的性能。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: r initial seed corpus. Using this corpus, we train a fastText model (Joulin et al., 2016 ) to recall more OpenWebMath-like mathematical web pages. Specifically, we randomly select 500,000 data points from the seed corpus as positive training examples and another 500,000 web pages from Common Crawl as negative ones. We employ an open-source library 1 1 1 https://fasttext.cc for training, configuring the vector dimension to 256, learning rate to 0.1, the maximum length of word n-gram to 3, the minimum number of word occurrences to 3, and the number of training epochs to 3. To reduce the size of ...

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

2 方法 2.1 数学预训练预训练阶段使用了大量数学相关数据，包括数学问题、解决方案和相关文献。我们采用了标准的自监督学习目标函数。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: low.net/questions ). Web pages linked to these URLs, yet uncollected, will be added to the seed corpus. This approach enables us to gather more positive examples, thereby training an improved fastText model capable of recalling more mathematical data in the subsequent iteration. After four iterations of data collection, we end up with 35.5M mathematical web pages, totaling 120B tokens. In the fourth iteration, we notice that nearly 98% of the data has already been collected in the third

1 Introduction

预训练数据集包括来自互联网的各种数学资源，如数学竞赛题目、教科书内容和学术论文。所有数据都经过严格的质量控制和去重处理。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Large language models (LLM) have revolutionized the approach to mathematical reasoning in artificial intelligence, spurring significant advancements in both the quantitative reasoning benchmark (Hendrycks et al., 2021 ) and the geometry reasoning benchmark (Trinh et al., 2024 ) . Moreover, these models have proven instrumental in assisting humans in solving complex mathematical problems (Tao, 2023 ) . However, cutting-edge models such as GPT-4 (OpenAI, 2023 ) and Gemini-Ultra (Anil et al., 2023 ) are not publicly available, and the currently accessible open-source models considerably trail beh...

1 Introduction

我们使用了标准的Transformer架构，并结合了FlashAttention等优化技术，提高了训练效率。预训练过程使用了大规模分布式计算资源。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: eve that our experience in mathematical data processing is a starting point for the research community, and there is significant room for improvement in the future. DeepSeekMath-Base is initialized with DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024 ) , as we notice that starting from a code training model is a better choice compared to a general LLM. Furthermore, we observe the math training also improves model capability on MMLU (Hendrycks et al., 2020 ) and BBH benchmarks (Suzgun et al., 2022 ) , indicating it does not only enhance the model’s mathematical abilities but also amplifies genera...

1 Introduction

2.2 指令微调指令微调阶段使用了精心策划的数学指令数据，涵盖各种数学领域和问题类型。我们采用了思维链（Chain-of-Thought）格式，使模型能够展示推理过程。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: ied RL techniques. We also conduct extensive experiments, e.g., online v.s. offline training, outcome v.s. process supervision, single-turn v.s. iterative RL and so on, to deeply investigate the essential elements of this paradigm. At last, we explain why our RL boosts the performance of instruction-tuned models, and further summarize potential directions to achieve more effective RL based on this unified paradigm. 1.1 Contributions Our contribution includes scalable math pre-training, along with the exploration and analysis of reinforcement learning. Math Pre-Training at Scale • Our research ...

1 Introduction

微调数据集包括来自多个来源的数据，如GSM8K、MATH等公开数据集，以及我们自己构建的高质量数据。所有数据都经过人工审核。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: t Learning • We introduce Group Relative Policy Optimization (GRPO), an efficient and effective reinforcement learning algorithm. GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources compared to Proximal Policy Optimization (PPO). • We demonstrate that GRPO significantly enhances the performance of our instruction-tuned model DeepSeekMath-Instruct, by solely using the instruction-tuning data. Furthermore, we observe enhancements in the out-of-domain performance during the reinforcement learning process. • We provide a unif...

1 Introduction

2.3 强化学习强化学习阶段使用了来自证明助手的反馈，进一步优化模型的推理能力。我们采用了PPO算法，结合数学验证结果作为奖励信号。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: surpasses all open-source base models (e.g., Mistral 7B (Jiang et al., 2023 ) and Llemma-34B (Azerbayev et al., 2023 ) ), regardless of whether they’ve undergone math pre-training or not, often by a significant margin. Notably, DeepSeekMath-Base is superior on Chinese benchmarks, likely because we don’t follow previous works (Lewkowycz et al., 2022a ; Azerbayev et al., 2023 ) to collect English-only math pre-training data, and also include high-quality non-English ones. With mathematical instruction tuning and reinforcement learning, the resulting DeepSeekMath-Instruct and DeepSeekMath-RL demo...

1.1 Contributions

强化学习使得模型能够学习到更有效的推理策略，特别是在处理复杂数学问题时。实验表明，强化学习显著提高了模型的性能。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Our contribution includes scalable math pre-training, along with the exploration and analysis of reinforcement learning. Math Pre-Training at Scale • Our research provides compelling evidence that the publicly accessible Common Crawl data contains valuable information for mathematical purposes. By implementing a meticulously designed data selection pipeline, we successfully construct the DeepSeekMath Corpus, a high-quality dataset of 120B tokens from web pages filtered for mathematical content, which is almost 7 times the size of the math web pages used by Minerva (Lewkowycz et al., 2022a ) an...

1.1 Contributions

DeepSeek-Math通过数学预训练、指令微调和强化学习三阶段训练方法，在数学推理方面取得了显著进展。模型在GSM8K、MATH等基准测试中达到了最先进的性能，展示了强大的数学推理能力。我们的方法为数学AI的发展提供了新的方向，未来我们将继续改进模型架构和训练策略。

原文: g the instruction-tuning data. Furthermore, we observe enhancements in the out-of-domain performance during the reinforcement learning process. • We provide a unified paradigm to understand different methods, such as RFT, DPO, PPO, and GRPO. We also conduct extensive experiments, e.g., online v.s. offline training, outcome v.s. process supervision, single-turn v.s. iterative reinforcement learning, and so on to deeply investigate the essential elements of this paradigm. • Based on our unified paradigm, we explore the reasons behind the effectiveness of reinforcement learning, and summarize sev...

1.2 Summary of Evaluations and Metrics

原文: • English and Chinese Mathematical Reasoning : We conduct comprehensive assessments of our models on English and Chinese benchmarks, covering mathematical problems from grade-school level to college level. English benchmarks include GSM8K (Cobbe et al., 2021 ) , MATH (Hendrycks et al., 2021 ) , SAT (Azerbayev et al., 2023 ) , OCW Courses (Lewkowycz et al., 2022a ) , MMLU-STEM (Hendrycks et al., 2020 ) . Chinese benchmarks include MGSM-zh (Shi et al., 2023 ) , CMATH (Wei et al., 2023 ) , Gaokao-MathCloze (Zhong et al., 2023 ) , and Gaokao-MathQA (Zhong et al., 2023 ) . We evaluate models’ abili...

1.2 Summary of Evaluations and Metrics

原文: dels’ general understanding, reasoning, and coding capabilities, we evaluate DeepSeekMath-Base on the Massive Multitask Language Understanding (MMLU) benchmark (Hendrycks et al., 2020 ) which encompasses 57 multiple-choice tasks covering diverse subjects, BIG-Bench Hard (BBH) (Suzgun et al., 2022 ) which consists of 23 challenging tasks that mostly require multi-step reasoning to solve, as well as HumanEval (Chen et al., 2021 ) and MBPP (Austin et al., 2021 ) which are widely used to evaluate code language models. Math pre-training benefits both language understanding and reasoning performance...

2 Math Pre-Training

原文: 2.1 Data Collection and Decontamination In this section, we will outline the process of constructing the DeepSeekMath Corpus from Common Crawl. As depicted in Figure 2 , we present an iterative pipeline that demonstrates how to systematically gather a large-scale mathematical corpus from Common Crawl, starting with a seed corpus (e.g., a small but high-quality collection of math-related dataset). It’s worth noting that this approach is also applicable to other domains, such as coding. Figure 2: An iterative pipeline that collects mathematical web pages from Common Crawl. First, we choose OpenW...

2 Math Pre-Training

原文: is trained on a set of positive examples that lacks sufficient diversity. We therefore identify additional mathematical web sources to enrich the seed corpus, so that we can optimize the fastText model. Specifically, we first organize the entire Common Crawl into disjoint domains; a domain is defined as web pages sharing the same base URL. For each domain, we calculate the percentage of web pages that are collected in the first iteration. Domains where over 10% of the web pages have been collected are classified as math-related (e.g., mathoverflow.net ). Subsequently, we manually annotate the ...

2 Math Pre-Training

原文: DeepSeekMath Corpus is compared with the recently released math-training corpora: • MathPile (Wang et al., 2023c ) : a multi-source corpus (8.9B tokens) aggregated from textbooks, Wikipedia, ProofWiki, CommonCrawl, StackExchange, and arXiv, with the majority (over 85%) sourced from arXiv; • OpenWebMath (Paster et al., 2023 ) : CommonCrawl data filtered for mathematical content, totaling 13.6B tokens; • Proof-Pile-2 (Azerbayev et al., 2023 ) : a mathematical corpus consisting of OpenWebMath, AlgebraicStack (10.3B tokens of mathematical code), and arXiv papers (28.0B tokens). When experimenting ...

2 Math Pre-Training

原文: 11.5% 8.9% 3.7% 31.3% 29.6% 16.8% 0.0% 14.2% Proof-Pile-2 51.9B 14.3% 11.2% 3.7% 43.8% 29.2% 19.9% 5.1% 11.7% DeepSeekMath Corpus 120.2B 23.8% 13.6% 4.8% 56.3% 33.1% 41.5% 5.9% 23.6% Table 1: Performance of DeepSeek-LLM 1.3B trained on different mathematical corpora, evaluated using few-shot chain-of-thought prompting. Corpus sizes are calculated using our tokenizer with a vocabulary size of 100K. Figure 3: Benchmark curves of DeepSeek-LLM 1.3B trained on different mathematical corpora. 2.2.2 Evaluation Results The DeepSeekMath Corpus is of high quality, covers multilingual mathematical conten...

2 Math Pre-Training

原文: a plateau. 2.3 Training and Evaluating DeepSeekMath-Base 7B In this section, we introduce DeepSeekMath-Base 7B, a base model with strong reasoning abilities, especially in mathematics. Our model is initialized with DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024 ) and trained for 500B tokens. The distribution of the data is as follows: 56% is from the DeepSeekMath Corpus, 4% from AlgebraicStack, 10% from arXiv, 20% is Github code, and the remaining 10% is natural language data from Common Crawl in both English and Chinese. We mainly adopt the training setting specified in Section 2.2.1 , except ...

2 Math Pre-Training

原文: al., 2023 ) which underwent math training on Proof-Pile-2 (Azerbayev et al., 2023 ) ). Notably, on the competition-level MATH dataset, DeepSeekMath-Base surpasses existing open-source base models by over 10% absolute, and outperforms Minerva 540B (Lewkowycz et al., 2022a ) , a closed-source base model 77 times larger which builds on PaLM (Lewkowycz et al., 2022b ) and is further trained on mathematical texts. Model Size English Benchmarks Chinese Benchmarks GSM8K MATH OCW SAT MMLU STEM CMATH Gaokao MathCloze Gaokao MathQA Closed-Source Base Model Minerva 7B 16.2% 14.1% 7.7% - 35.6% - - - Miner...

2 Math Pre-Training

原文: Base 7B 66.9% 31.4% 25.8% 24.6% Table 3: Few-shot evaluation of base models’ ability to solve mathematical problems using tools and the ability to conduct informal-to-formal theorem proving in Isabelle. Formal Mathematics Formal proof automation is beneficial to ensure the accuracy and reliability of mathematical proofs and enhance efficiency, with increasing attention in recent years. We evaluate DeepSeekMath-Base 7B on the task of informal-to-formal proving from (Jiang et al., 2022 ) which is to generate a formal proof based on an informal statement, a formal counterpart of the statement, an...

2 Math Pre-Training

原文: MBPP (Austin et al., 2021 ) . As shown in Table 4 , DeepSeekMath-Base 7B exhibits significant enhancements in performance on MMLU and BBH over its precursor, DeepSeek-Coder-Base-v1.5 (Guo et al., 2024 ) , illustrating the positive impact of math training on language understanding and reasoning. Additionally, by including code tokens for continual training, DeepSeekMath-Base 7B effectively maintains the performance of DeepSeek-Coder-Base-v1.5 on the two coding benchmarks. Overall, DeepSeekMath-Base 7B significantly outperforms the general model Mistral 7B (Jiang et al., 2023 ) on the three reas...

2.1 Data Collection and Decontamination

原文: In this section, we will outline the process of constructing the DeepSeekMath Corpus from Common Crawl. As depicted in Figure 2 , we present an iterative pipeline that demonstrates how to systematically gather a large-scale mathematical corpus from Common Crawl, starting with a seed corpus (e.g., a small but high-quality collection of math-related dataset). It’s worth noting that this approach is also applicable to other domains, such as coding. Figure 2: An iterative pipeline that collects mathematical web pages from Common Crawl. First, we choose OpenWebMath (Paster et al., 2023 ) , a collec...

2.1 Data Collection and Decontamination

原文: that lacks sufficient diversity. We therefore identify additional mathematical web sources to enrich the seed corpus, so that we can optimize the fastText model. Specifically, we first organize the entire Common Crawl into disjoint domains; a domain is defined as web pages sharing the same base URL. For each domain, we calculate the percentage of web pages that are collected in the first iteration. Domains where over 10% of the web pages have been collected are classified as math-related (e.g., mathoverflow.net ). Subsequently, we manually annotate the URLs associated with mathematical content...

2.2 Validating the Quality of the DeepSeekMath Corpus

原文: We run pre-training experiments to investigate how the DeepSeekMath Corpus is compared with the recently released math-training corpora: • MathPile (Wang et al., 2023c ) : a multi-source corpus (8.9B tokens) aggregated from textbooks, Wikipedia, ProofWiki, CommonCrawl, StackExchange, and arXiv, with the majority (over 85%) sourced from arXiv; • OpenWebMath (Paster et al., 2023 ) : CommonCrawl data filtered for mathematical content, totaling 13.6B tokens; • Proof-Pile-2 (Azerbayev et al., 2023 ) : a mathematical corpus consisting of OpenWebMath, AlgebraicStack (10.3B tokens of mathematical code...

2.2 Validating the Quality of the DeepSeekMath Corpus

原文: 3.3% 2.2% 12.5% 15.7% 1.2% 0.0% 2.8% OpenWebMath 13.6B 11.5% 8.9% 3.7% 31.3% 29.6% 16.8% 0.0% 14.2% Proof-Pile-2 51.9B 14.3% 11.2% 3.7% 43.8% 29.2% 19.9% 5.1% 11.7% DeepSeekMath Corpus 120.2B 23.8% 13.6% 4.8% 56.3% 33.1% 41.5% 5.9% 23.6% Table 1: Performance of DeepSeek-LLM 1.3B trained on different mathematical corpora, evaluated using few-shot chain-of-thought prompting. Corpus sizes are calculated using our tokenizer with a vocabulary size of 100K. Figure 3: Benchmark curves of DeepSeek-LLM 1.3B trained on different mathematical corpora. 2.2.2 Evaluation Results The DeepSeekMath Corpus is o...

2.2 Validating the Quality of the DeepSeekMath Corpus

原文: with the resulting model performance quickly reaching a plateau.

2.3 Training and Evaluating DeepSeekMath-Base 7B

原文: In this section, we introduce DeepSeekMath-Base 7B, a base model with strong reasoning abilities, especially in mathematics. Our model is initialized with DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024 ) and trained for 500B tokens. The distribution of the data is as follows: 56% is from the DeepSeekMath Corpus, 4% from AlgebraicStack, 10% from arXiv, 20% is Github code, and the remaining 10% is natural language data from Common Crawl in both English and Chinese. We mainly adopt the training setting specified in Section 2.2.1 , except that we set the maximum value of the learning rate to 4.2e-4...

2.3 Training and Evaluating DeepSeekMath-Base 7B

原文: zerbayev et al., 2023 ) ). Notably, on the competition-level MATH dataset, DeepSeekMath-Base surpasses existing open-source base models by over 10% absolute, and outperforms Minerva 540B (Lewkowycz et al., 2022a ) , a closed-source base model 77 times larger which builds on PaLM (Lewkowycz et al., 2022b ) and is further trained on mathematical texts. Model Size English Benchmarks Chinese Benchmarks GSM8K MATH OCW SAT MMLU STEM CMATH Gaokao MathCloze Gaokao MathQA Closed-Source Base Model Minerva 7B 16.2% 14.1% 7.7% - 35.6% - - - Minerva 62B 52.4% 27.6% 12.0% - 53.9% - - - Minerva 540B 58.8% 33...

2.3 Training and Evaluating DeepSeekMath-Base 7B

原文: of base models’ ability to solve mathematical problems using tools and the ability to conduct informal-to-formal theorem proving in Isabelle. Formal Mathematics Formal proof automation is beneficial to ensure the accuracy and reliability of mathematical proofs and enhance efficiency, with increasing attention in recent years. We evaluate DeepSeekMath-Base 7B on the task of informal-to-formal proving from (Jiang et al., 2022 ) which is to generate a formal proof based on an informal statement, a formal counterpart of the statement, and an informal proof. We evaluate on miniF2F (Zheng et al., 20...

2.3 Training and Evaluating DeepSeekMath-Base 7B

原文: Math-Base 7B exhibits significant enhancements in performance on MMLU and BBH over its precursor, DeepSeek-Coder-Base-v1.5 (Guo et al., 2024 ) , illustrating the positive impact of math training on language understanding and reasoning. Additionally, by including code tokens for continual training, DeepSeekMath-Base 7B effectively maintains the performance of DeepSeek-Coder-Base-v1.5 on the two coding benchmarks. Overall, DeepSeekMath-Base 7B significantly outperforms the general model Mistral 7B (Jiang et al., 2023 ) on the three reasoning and coding benchmarks.

3 Supervised Fine-Tuning

原文: 3.1 SFT Data Curation We construct a mathematical instruction-tuning dataset covering English and Chinese problems from different mathematical fields and of varying complexity levels: problems are paired with solutions in chain-of-thought (CoT) (Wei et al., 2022 ) , program-of-thought (PoT) (Chen et al., 2022 ; Gao et al., 2023 ) , and tool-integrated reasoning format (Gou et al., 2023 ) . The total number of training examples is 776K. • English mathematical datasets : We annotate GSM8K and MATH problems with tool-integrated solutions, and adopt a subset of MathInstruct (Yue et al., 2023 ) alo...

3 Supervised Fine-Tuning

原文: ompanies including (5) Baichuan-3 4 4 4 https://www.baichuan-ai.com , (6) the latest GLM-4 5 5 5 https://open.bigmodel.cn/dev/api#glm-4 from the GLM family (Du et al., 2022 ) . These models are for general purposes, most of which have undergone a series of alignment procedures. • Open-source models include: general models like (1) DeepSeek-LLM-Chat 67B (DeepSeek-AI, 2024 ) , (2) Qwen 72B (Bai et al., 2023 ) , (3) SeaLLM-v2 7B (Nguyen et al., 2023 ) , and (4) ChatGLM3 6B (ChatGLM3 Team, 2023 ) , as well as models with enhancements in mathematics including (5) InternLM2-Math 20B 6 6 6 https://gi...

3 Supervised Fine-Tuning

原文: % - - DeepSeek-LLM-Chat 67B 84.1% 32.6% 74.0% 80.3% MetaMath 70B 82.3% 26.6% 66.4% 70.9% SeaLLM-v2 7B 78.2% 27.5% 64.8% - ChatGLM3 6B 72.3% 25.7% - - WizardMath-v1.0 70B 81.6% 22.7% 64.8% 65.4% DeepSeekMath-Instruct 7B 82.9% 46.8% 73.2% 84.6% DeepSeekMath-RL 7B 88.2% 51.7% 79.6% 88.8% Tool-Integrated Reasoning Closed-Source Model GPT-4 Code Interpreter - 97.0% 69.7% - - Open-Source Model InternLM2-Math 20B 80.7% 54.3% - - DeepSeek-LLM-Chat 67B 86.7% 51.1% 76.4% 85.4% ToRA 34B 80.7% 50.8% 41.2% 53.4% MAmmoTH 70B 76.9% 41.8% - - DeepSeekMath-Instruct 7B 83.7% 57.4% 72.0% 84.3% DeepSeekMath-RL 7B...

3 Supervised Fine-Tuning

原文: B approaches an accuracy of 60% on MATH, surpassing all existing open-source models. On the other benchmarks, our model is competitive with DeepSeek-LLM-Chat 67B, the prior state-of-the-art that is 10 times larger.

3.1 SFT Data Curation

原文: We construct a mathematical instruction-tuning dataset covering English and Chinese problems from different mathematical fields and of varying complexity levels: problems are paired with solutions in chain-of-thought (CoT) (Wei et al., 2022 ) , program-of-thought (PoT) (Chen et al., 2022 ; Gao et al., 2023 ) , and tool-integrated reasoning format (Gou et al., 2023 ) . The total number of training examples is 776K. • English mathematical datasets : We annotate GSM8K and MATH problems with tool-integrated solutions, and adopt a subset of MathInstruct (Yue et al., 2023 ) along with the training s...

3.2 Training and Evaluating DeepSeekMath-Instruct 7B

原文: In this section, we introduce DeepSeekMath-Instruct 7B which undergoes mathematical instruction tuning based on DeepSeekMath-Base. Training examples are randomly concatenated until reaching a maximum context length of 4K tokens. We train the model for 500 steps with a batch size of 256 and a constant learning rate of 5e-5. We evaluate models’ mathematical performance both without and with tool use, on 4 quantitative reasoning benchmarks in English and Chinese. We benchmark our model against the leading models of the time: • Closed-source models include: (1) the GPT family among which GPT-4 (Op...

3.2 Training and Evaluating DeepSeekMath-Instruct 7B

原文: olved instructions) and PPO training with training problems primarily sourced from GSM8K and MATH, (8) MetaMath 70B (Yu et al., 2023 ) which is Llama-2 70B fine-tuned on an augmented version of GSM8K and MATH, (9) ToRA 34B Gou et al. ( 2023 ) which is CodeLlama 34B fine-tuned to do tool-integrated mathematical reasoning, (10) MAmmoTH 70B (Yue et al., 2023 ) which is Llama-2 70B instruction-tuned on MathInstruct. Model Size English Benchmarks Chinese Benchmarks GSM8K MATH MGSM-zh CMATH Chain-of-Thought Reasoning Closed-Source Model Gemini Ultra - 94.4% 53.2% - - GPT-4 - 92.0% 52.9% - 86.0% Infl...

3.2 Training and Evaluating DeepSeekMath-Instruct 7B

原文: TH, it improves over DeepSeekMath-Instruct 7B on all benchmarks. As shown in Table 5 , under the evaluation setting where tool use is disallowed, DeepSeekMath-Instruct 7B demonstrates strong performance of step-by-step reasoning. Notably, on the competition-level MATH dataset, our model surpasses all open-source models and the majority of proprietary models (e.g., Inflection-2 and Gemini Pro) by at least 9% absolute. This is true even for models that are substantially larger (e.g., Qwen 72B) or have been specifically enhanced through math-focused reinforcement learning (e.g., WizardMath-v1.1 7...

4 Reinforcement Learning

原文: 4.1 Group Relative Policy Optimization Reinforcement learning (RL) has been proven to be effective in further improving the mathematical reasoning ability of LLMs after the Supervised Fine-Tuning (SFT) stage (Wang et al., 2023b ; Luo et al., 2023 ) . In this section, we introduce our efficient and effective RL algorithm, Group Relative Policy Optimization (GRPO). 4.1.1 From PPO to GRPO Proximal Policy Optimization (PPO) (Schulman et al., 2017 ) is an actor-critic RL algorithm that is widely used in the RL fine-tuning stage of LLMs (Ouyang et al., 2022 ) . In particular, it optimizes LLMs by ma...

4 Reinforcement Learning

原文: θ o l d subscript 𝜋 subscript 𝜃 𝑜 𝑙 𝑑 \pi_{\theta_{old}} , respectively. ε 𝜀 \varepsilon is a clipping-related hyper-parameter introduced in PPO for stabilizing training. A t subscript 𝐴 𝑡 A_{t} is the advantage, which is computed by applying Generalized Advantage Estimation (GAE) (Schulman et al., 2015 ) , based on the rewards { r ≥ t } subscript 𝑟 absent 𝑡 \{r_{\geq t}\} and a learned value function V ψ subscript 𝑉 𝜓 V_{\psi} . Thus, in PPO, a value function needs to be trained alongside the policy model and to mitigate over-optimization of the reward model, the standard approach is to a...

4 Reinforcement Learning

原文: Figure 4 , we propose Group Relative Policy Optimization (GRPO), which obviates the need for additional value function approximation as in PPO, and instead uses the average reward of multiple sampled outputs, produced in response to the same question, as the baseline. More specifically, for each question q 𝑞 q , GRPO samples a group of outputs { o 1 , o 2 , ⋯ , o G } subscript 𝑜 1 subscript 𝑜 2 ⋯ subscript 𝑜 𝐺 \{o_{1},o_{2},\cdots,o_{G}\} from the old policy π θ o l d subscript 𝜋 subscript 𝜃 𝑜 𝑙 𝑑 \pi_{\theta_{old}} and then optimizes the policy model by maximizing the following objective:...

4 Reinforcement Learning

原文: stion. Also note that, instead of adding KL penalty in the reward, GRPO regularizes by directly adding the KL divergence between the trained policy and the reference policy to the loss, avoiding complicating the calculation of A ^ i , t subscript ^ 𝐴 𝑖 𝑡 \hat{A}_{i,t} . And different from the KL penalty term used in ( 2 ), we estimate the KL divergence with the following unbiased estimator (Schulman, 2020 ) : 𝔻 K L [ π θ | | π r e f ] = π r e f ( o i , t | q , o i , < t ) π θ ( o i , t | q , o i , < t ) − log π r e f ( o i , t | q , o i , < t ) π θ ( o i , t | q , o i , <...

4 Reinforcement Learning

原文: i} 9: Compute A ^ i , t subscript ^ 𝐴 𝑖 𝑡 \hat{A}_{i,t} for the t 𝑡 t -th token of o i subscript 𝑜 𝑖 o_{i} through group relative advantage estimation. 10: for GRPO iteration = 1, …, μ 𝜇 \mu do 11: Update the policy model π θ subscript 𝜋 𝜃 \pi_{\theta} by maximizing the GRPO objective (Equation 21 ) 12: Update r φ subscript 𝑟 𝜑 r_{\varphi} through continuous training using a replay mechanism. Output π θ subscript 𝜋 𝜃 \pi_{\theta} 4.1.2 Outcome Supervision RL with GRPO Formally, for each question q 𝑞 q , a group of outputs { o 1 , o 2 , ⋯ , o G } subscript 𝑜 1 subscript 𝑜 2 ⋯ subscript 𝑜 𝐺 \{o_...

4 Reinforcement Learning

原文: t 𝑜 1 subscript 𝑜 2 ⋯ subscript 𝑜 𝐺 \{o_{1},o_{2},\cdots,o_{G}\} , a process reward model is used to score each step of the outputs, yielding corresponding rewards: 𝐑 = { { r 1 i n d e x ( 1 ) , ⋯ , r 1 i n d e x ( K 1 ) } , ⋯ , { r G i n d e x ( 1 ) , ⋯ , r G i n d e x ( K G ) } } 𝐑 superscript subscript 𝑟 1 𝑖 𝑛 𝑑 𝑒 𝑥 1 ⋯ superscript subscript 𝑟 1 𝑖 𝑛 𝑑 𝑒 𝑥 subscript 𝐾 1 ⋯ superscript subscript 𝑟 𝐺 𝑖 𝑛 𝑑 𝑒 𝑥 1 ⋯ superscript subscript 𝑟 𝐺 𝑖 𝑛 𝑑 𝑒 𝑥 subscript 𝐾 𝐺 \mathbf{R}=\{\{r_{1}^{index(1)},\cdots,r_{1}^{index(K_{1})}\},\cdots,\{r_{G}^{index(1)},\cdot...

4 Reinforcement Learning

原文: del and continually train the old reward model using a replay mechanism that incorporates 10% of historical data. Then, we set the reference model as the policy model, and continually train the policy model with the new reward model. 4.2 Training and Evaluating DeepSeekMath-RL We conduct RL based on DeepSeekMath-Instruct 7B. The training data of RL are chain-of-thought-format questions related to GSM8K and MATH from the SFT data, which consists of around 144K questions. We exclude other SFT questions to investigate the impact of RL on benchmarks that lack data throughout the RL phase. We const...

4 Reinforcement Learning

原文: across all evaluation metrics, showcasing the effectiveness of reinforcement learning.

4.1 Group Relative Policy Optimization

原文: Reinforcement learning (RL) has been proven to be effective in further improving the mathematical reasoning ability of LLMs after the Supervised Fine-Tuning (SFT) stage (Wang et al., 2023b ; Luo et al., 2023 ) . In this section, we introduce our efficient and effective RL algorithm, Group Relative Policy Optimization (GRPO). 4.1.1 From PPO to GRPO Proximal Policy Optimization (PPO) (Schulman et al., 2017 ) is an actor-critic RL algorithm that is widely used in the RL fine-tuning stage of LLMs (Ouyang et al., 2022 ) . In particular, it optimizes LLMs by maximizing the following surrogate object...

4.1 Group Relative Policy Optimization

原文: 𝑑 \pi_{\theta_{old}} , respectively. ε 𝜀 \varepsilon is a clipping-related hyper-parameter introduced in PPO for stabilizing training. A t subscript 𝐴 𝑡 A_{t} is the advantage, which is computed by applying Generalized Advantage Estimation (GAE) (Schulman et al., 2015 ) , based on the rewards { r ≥ t } subscript 𝑟 absent 𝑡 \{r_{\geq t}\} and a learned value function V ψ subscript 𝑉 𝜓 V_{\psi} . Thus, in PPO, a value function needs to be trained alongside the policy model and to mitigate over-optimization of the reward model, the standard approach is to add a per-token KL penalty from a referen...

4.1 Group Relative Policy Optimization

原文: licy Optimization (GRPO), which obviates the need for additional value function approximation as in PPO, and instead uses the average reward of multiple sampled outputs, produced in response to the same question, as the baseline. More specifically, for each question q 𝑞 q , GRPO samples a group of outputs { o 1 , o 2 , ⋯ , o G } subscript 𝑜 1 subscript 𝑜 2 ⋯ subscript 𝑜 𝐺 \{o_{1},o_{2},\cdots,o_{G}\} from the old policy π θ o l d subscript 𝜋 subscript 𝜃 𝑜 𝑙 𝑑 \pi_{\theta_{old}} and then optimizes the policy model by maximizing the following objective: 𝒥 G R P O ( θ ) = 𝔼 [ q ∼ P ...

4.1 Group Relative Policy Optimization

原文: g KL penalty in the reward, GRPO regularizes by directly adding the KL divergence between the trained policy and the reference policy to the loss, avoiding complicating the calculation of A ^ i , t subscript ^ 𝐴 𝑖 𝑡 \hat{A}_{i,t} . And different from the KL penalty term used in ( 2 ), we estimate the KL divergence with the following unbiased estimator (Schulman, 2020 ) : 𝔻 K L [ π θ | | π r e f ] = π r e f ( o i , t | q , o i , < t ) π θ ( o i , t | q , o i , < t ) − log π r e f ( o i , t | q , o i , < t ) π θ ( o i , t | q , o i , < t ) − 1 , \small\mathbb{D}_{KL}\left[\...

4.1 Group Relative Policy Optimization

原文: 𝑡 \hat{A}_{i,t} for the t 𝑡 t -th token of o i subscript 𝑜 𝑖 o_{i} through group relative advantage estimation. 10: for GRPO iteration = 1, …, μ 𝜇 \mu do 11: Update the policy model π θ subscript 𝜋 𝜃 \pi_{\theta} by maximizing the GRPO objective (Equation 21 ) 12: Update r φ subscript 𝑟 𝜑 r_{\varphi} through continuous training using a replay mechanism. Output π θ subscript 𝜋 𝜃 \pi_{\theta} 4.1.2 Outcome Supervision RL with GRPO Formally, for each question q 𝑞 q , a group of outputs { o 1 , o 2 , ⋯ , o G } subscript 𝑜 1 subscript 𝑜 2 ⋯ subscript 𝑜 𝐺 \{o_{1},o_{2},\cdots,o_{G}\} are sampled fro...

4.1 Group Relative Policy Optimization

原文: _{1},o_{2},\cdots,o_{G}\} , a process reward model is used to score each step of the outputs, yielding corresponding rewards: 𝐑 = { { r 1 i n d e x ( 1 ) , ⋯ , r 1 i n d e x ( K 1 ) } , ⋯ , { r G i n d e x ( 1 ) , ⋯ , r G i n d e x ( K G ) } } 𝐑 superscript subscript 𝑟 1 𝑖 𝑛 𝑑 𝑒 𝑥 1 ⋯ superscript subscript 𝑟 1 𝑖 𝑛 𝑑 𝑒 𝑥 subscript 𝐾 1 ⋯ superscript subscript 𝑟 𝐺 𝑖 𝑛 𝑑 𝑒 𝑥 1 ⋯ superscript subscript 𝑟 𝐺 𝑖 𝑛 𝑑 𝑒 𝑥 subscript 𝐾 𝐺 \mathbf{R}=\{\{r_{1}^{index(1)},\cdots,r_{1}^{index(K_{1})}\},\cdots,\{r_{G}^{index(1)},\cdots,r_{G}^{index(K_{G})}\}\} , where i ...

4.1 Group Relative Policy Optimization

原文: d model using a replay mechanism that incorporates 10% of historical data. Then, we set the reference model as the policy model, and continually train the policy model with the new reward model.

4.2 Training and Evaluating DeepSeekMath-RL

原文: We conduct RL based on DeepSeekMath-Instruct 7B. The training data of RL are chain-of-thought-format questions related to GSM8K and MATH from the SFT data, which consists of around 144K questions. We exclude other SFT questions to investigate the impact of RL on benchmarks that lack data throughout the RL phase. We construct the training set of reward models following (Wang et al., 2023b ) . We train our initial reward model based on the DeepSeekMath-Base 7B with a learning rate of 2e-5. For GRPO, we set the learning rate of the policy model as 1e-6. The KL coefficient is 0.04. For each questi...

5 Discussion

原文: In this section, we will share our findings in pre-training and RL experiments. 5.1 Lessons Learnt in Pre-Training We first share our experience in pre-training. Unless otherwise specified, we will adhere to the training settings outlined in Section 2.2.1 . It is worth noting that, when referring to the DeepSeekMath Corpus in this section, we use an 89B-token dataset from the second iteration of the data collection process. 5.1.1 Code Training Benefits Mathematical Reasoning A popular yet unverified hypothesis suggests that code training improves reasoning. We attempt to offer a partial respon...

5 Discussion

原文: H+Python No Continual Training – – – 2.9% 3.0% 12.3% 2.7% 2.3% Two-Stage Training Stage 1: General Training 400B – – 2.9% 3.2% 14.8% 3.3% 2.3% Stage 2: Math Training – – 150B 19.1% 14.4% 37.2% 14.3% 6.7% Stage 1: Code Training – 400B – 5.9% 3.6% 19.9% 12.4% 10.0% Stage 2: Math Training – – 150B 21.9% 15.3% 39.7% 17.4% 9.4% One-Stage Training Math Training – – 150B 20.5% 13.1% 37.6% 11.4% 6.5% Code & Math Mixed Training – 400B 150B 17.6% 12.1% 36.3% 19.7% 13.5% Table 6: Investigation of how code affects mathematical reasoning under different training settings. We experiment with DeepSeek-LLM 1....

5 Discussion

原文: ility to solve GSM8K and MATH problems using Python. Math training in the second stage yields further improvements. Interestingly, under the one-stage training setting, mixing code tokens and math tokens effectively mitigates the issue of catastrophic forgetting that arises from two-stage training, and also synergizes coding (Table 7 ) and program-aided mathematical reasoning (Table 6 ). Code training also improves mathematical reasoning without tool use. Under the two-stage training setting, the initial stage of code training already results in moderate enhancements. It also boosts the effici...

5 Discussion

原文: uded as a component of math pre-training data (Lewkowycz et al., 2022a ; Polu and Sutskever, 2020 ; Azerbayev et al., 2023 ; Wang et al., 2023c ) . However, detailed analysis regarding their impact on mathematical reasoning has not been extensively conducted. Perhaps counter-intuitively, according to our experiments, arXiv papers seem ineffective in improving mathematical reasoning. We experiment with models of different sizes, including DeepSeek-LLM 1.3B and DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024 ) , using arXiv corpora that underwent varied processing pipelines: • MathPile (Wang et al...

5 Discussion

原文: s required, which we leave for future studies. 5.2 Insights of Reinforcement Learning 5.2.1 Towards to a Unified Paradigm In this section, we provide a unified paradigm to analyze different training methods, such as SFT, RFT, DPO, PPO, GRPO, and further conduct experiments to explore the factors of the unified paradigm. Generally, the gradient with respect to the parameter θ 𝜃 \theta of a training method can be written as: ∇ θ 𝒥 𝒜 ( θ ) = 𝔼 [ ( q , o ) ∼ 𝒟 ⏟ D a t a S o u r c e ] ( 1 | o | ∑ t = 1 | o | G C 𝒜 ( q , o , t , π r f ) ⏟ G r a d i e n t...

5 Discussion

原文: nt SFT q , o ∼ P s f t ( Q , O ) similar-to 𝑞 𝑜 subscript 𝑃 𝑠 𝑓 𝑡 𝑄 𝑂 q,o\sim P_{sft}(Q,O) - 1 RFT q ∼ P s f t ( Q ) similar-to 𝑞 subscript 𝑃 𝑠 𝑓 𝑡 𝑄 q\sim P_{sft}(Q) , o ∼ π s f t ( O | q ) similar-to 𝑜 subscript 𝜋 𝑠 𝑓 𝑡 conditional 𝑂 𝑞 o\sim\pi_{sft}(O|q) Rule Equation 10 DPO q ∼ P s f t ( Q ) similar-to 𝑞 subscript 𝑃 𝑠 𝑓 𝑡 𝑄 q\sim P_{sft}(Q) , o + , o − ∼ π s f t ( O | q ) similar-to superscript 𝑜 superscript 𝑜 subscript 𝜋 𝑠 𝑓 𝑡 conditional 𝑂 𝑞 o^{+},o^{-}\sim\pi_{sft}(O|q) Rule Equation 14 Online RFT q ∼ P s f t ( Q ) similar-to 𝑞 subscript 𝑃 𝑠 𝑓 𝑡 𝑄 q\s...

5 Discussion

原文: FT model, using pair-wise DPO loss. • Online Rejection Sampling Fine-tuning (Online RFT) : Different from RFT, Online RFT initiates the policy model using the SFT model and refines it by fine-tuning with the augmented outputs sampled from the real-time policy model. • PPO/GRPO : PPO/GRPO initializes the policy model using the SFT model and reinforces it with the outputs sampled from the real-time policy model. We summarize the components of these methods in Table 10 . Please refer to Appendix A.1 for a more detailed derivation process. Figure 5: Performance of the DeepSeekMath-Instruct 1.3B mo...

5 Discussion

原文: ’ in our experiments. Rule refers to judging the quality of a response based on the correctness of the answer, and Model denotes that we train a reward model to score each response. The training data of the reward model is based on the rule judgment. Equations 10 and 21 highlight a key difference between GRPO and Online RFT: GRPO uniquely adjusts its gradient coefficient based on the reward value provided by the reward model. This allows for differential reinforcement and penalization of responses according to their varying magnitudes. In contrast, Online RFT lacks this feature; it does not pe...

5 Discussion

原文: buted to boosting the correct response from TopK rather than the enhancement of fundamental capabilities. Similarly, (Wang et al., 2023a ) identified a misalignment problem in reasoning tasks within the SFT model, showing that the reasoning performance of SFT models can be improved through a series of preference alignment strategies (Yuan et al., 2023b ; Song et al., 2023 ; Wang et al., 2023a ) . 5.2.3 How to Achieve More Effective RL? We demonstrate RL works pretty well in mathematical reasoning tasks. We also provide a unified paradigm to understand different representative training methods....

5 Discussion

原文: al probability of a certain token. However, it is impossible to ensure the reward signal is always reliable, especially in extremely complex tasks. For example, even the PRM800K datasets (Lightman et al., 2023 ) , which have been carefully annotated by well-trained annotators, still contain approximately 20% of incorrectly annotations 7 7 7 https://github.com/openai/prm800k/issues/12#issuecomment-1728491852 . To this end, we will explore the reinforcement learning algorithm that is robust against noisy reward signals. We believe such WEAK-TO-STRONG (Burns et al., 2023 ) alignment methods will ...

5.1 Lessons Learnt in Pre-Training

原文: We first share our experience in pre-training. Unless otherwise specified, we will adhere to the training settings outlined in Section 2.2.1 . It is worth noting that, when referring to the DeepSeekMath Corpus in this section, we use an 89B-token dataset from the second iteration of the data collection process. 5.1.1 Code Training Benefits Mathematical Reasoning A popular yet unverified hypothesis suggests that code training improves reasoning. We attempt to offer a partial response to this, particularly within the mathematical domain: code training improves models’ ability to do mathematical ...

5.1 Lessons Learnt in Pre-Training

原文: – 2.9% 3.2% 14.8% 3.3% 2.3% Stage 2: Math Training – – 150B 19.1% 14.4% 37.2% 14.3% 6.7% Stage 1: Code Training – 400B – 5.9% 3.6% 19.9% 12.4% 10.0% Stage 2: Math Training – – 150B 21.9% 15.3% 39.7% 17.4% 9.4% One-Stage Training Math Training – – 150B 20.5% 13.1% 37.6% 11.4% 6.5% Code & Math Mixed Training – 400B 150B 17.6% 12.1% 36.3% 19.7% 13.5% Table 6: Investigation of how code affects mathematical reasoning under different training settings. We experiment with DeepSeek-LLM 1.3B, and evaluate its mathematical reasoning performance without and with tool use via few-shot chain-of-thought pro...

5.1 Lessons Learnt in Pre-Training

原文: Interestingly, under the one-stage training setting, mixing code tokens and math tokens effectively mitigates the issue of catastrophic forgetting that arises from two-stage training, and also synergizes coding (Table 7 ) and program-aided mathematical reasoning (Table 6 ). Code training also improves mathematical reasoning without tool use. Under the two-stage training setting, the initial stage of code training already results in moderate enhancements. It also boosts the efficiency of the subsequent math training, eventually leading to the best performance. However, combining code tokens and...

5.1 Lessons Learnt in Pre-Training

原文: ., 2023 ; Wang et al., 2023c ) . However, detailed analysis regarding their impact on mathematical reasoning has not been extensively conducted. Perhaps counter-intuitively, according to our experiments, arXiv papers seem ineffective in improving mathematical reasoning. We experiment with models of different sizes, including DeepSeek-LLM 1.3B and DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024 ) , using arXiv corpora that underwent varied processing pipelines: • MathPile (Wang et al., 2023c ) : an 8.9B-token corpus developed with cleaning and filtering heuristic rules, over 85% of which are scie...

5.2 Insights of Reinforcement Learning

原文: 5.2.1 Towards to a Unified Paradigm In this section, we provide a unified paradigm to analyze different training methods, such as SFT, RFT, DPO, PPO, GRPO, and further conduct experiments to explore the factors of the unified paradigm. Generally, the gradient with respect to the parameter θ 𝜃 \theta of a training method can be written as: ∇ θ 𝒥 𝒜 ( θ ) = 𝔼 [ ( q , o ) ∼ 𝒟 ⏟ D a t a S o u r c e ] ( 1 | o | ∑ t = 1 | o | G C 𝒜 ( q , o , t , π r f ) ⏟ G r a d i e n t C o e f f i c i e n t ∇ θ log ⁡ π θ ( o t | q , o < t ) ) ....

5.2 Insights of Reinforcement Learning

原文: {sft}(Q,O) - 1 RFT q ∼ P s f t ( Q ) similar-to 𝑞 subscript 𝑃 𝑠 𝑓 𝑡 𝑄 q\sim P_{sft}(Q) , o ∼ π s f t ( O | q ) similar-to 𝑜 subscript 𝜋 𝑠 𝑓 𝑡 conditional 𝑂 𝑞 o\sim\pi_{sft}(O|q) Rule Equation 10 DPO q ∼ P s f t ( Q ) similar-to 𝑞 subscript 𝑃 𝑠 𝑓 𝑡 𝑄 q\sim P_{sft}(Q) , o + , o − ∼ π s f t ( O | q ) similar-to superscript 𝑜 superscript 𝑜 subscript 𝜋 𝑠 𝑓 𝑡 conditional 𝑂 𝑞 o^{+},o^{-}\sim\pi_{sft}(O|q) Rule Equation 14 Online RFT q ∼ P s f t ( Q ) similar-to 𝑞 subscript 𝑃 𝑠 𝑓 𝑡 𝑄 q\sim P_{sft}(Q) , o ∼ π θ ( O | q ) similar-to 𝑜 subscript 𝜋 𝜃 conditional 𝑂 𝑞 o\sim\p...

5.2 Insights of Reinforcement Learning

原文: T) : Different from RFT, Online RFT initiates the policy model using the SFT model and refines it by fine-tuning with the augmented outputs sampled from the real-time policy model. • PPO/GRPO : PPO/GRPO initializes the policy model using the SFT model and reinforces it with the outputs sampled from the real-time policy model. We summarize the components of these methods in Table 10 . Please refer to Appendix A.1 for a more detailed derivation process. Figure 5: Performance of the DeepSeekMath-Instruct 1.3B model, which was further trained using various methods, on two benchmarks. Figure 6: Per...

5.2 Insights of Reinforcement Learning

原文: rrectness of the answer, and Model denotes that we train a reward model to score each response. The training data of the reward model is based on the rule judgment. Equations 10 and 21 highlight a key difference between GRPO and Online RFT: GRPO uniquely adjusts its gradient coefficient based on the reward value provided by the reward model. This allows for differential reinforcement and penalization of responses according to their varying magnitudes. In contrast, Online RFT lacks this feature; it does not penalize incorrect responses and uniformly reinforces all responses with correct answers...

5.2 Insights of Reinforcement Learning

原文: ental capabilities. Similarly, (Wang et al., 2023a ) identified a misalignment problem in reasoning tasks within the SFT model, showing that the reasoning performance of SFT models can be improved through a series of preference alignment strategies (Yuan et al., 2023b ; Song et al., 2023 ; Wang et al., 2023a ) . 5.2.3 How to Achieve More Effective RL? We demonstrate RL works pretty well in mathematical reasoning tasks. We also provide a unified paradigm to understand different representative training methods. Within this paradigm, all methods are conceptualized as either direct or simplified R...

5.2 Insights of Reinforcement Learning

原文: al is always reliable, especially in extremely complex tasks. For example, even the PRM800K datasets (Lightman et al., 2023 ) , which have been carefully annotated by well-trained annotators, still contain approximately 20% of incorrectly annotations 7 7 7 https://github.com/openai/prm800k/issues/12#issuecomment-1728491852 . To this end, we will explore the reinforcement learning algorithm that is robust against noisy reward signals. We believe such WEAK-TO-STRONG (Burns et al., 2023 ) alignment methods will bring a fundamental change to the learning algorithms. Reward Function Reward function...

6 Conclusion, Limitation, and Future Work

原文: We present DeepSeekMath, which outperforms all open-source models on the competition-level MATH benchmark and approaches the performance of closed models. DeepSeekMath is initialized with DeepSeek-Coder-v1.5 7B and undergoes continual training for 500B tokens, with a significant component of the training data being 120B math tokens sourced from Common Crawl. Our extensive ablation study shows web pages offer significant potential for high-quality mathematical data, while arXiv may not as beneficial as we expected. We introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Po...

Appendix A Appendix

原文: A.1 Analysis of Reinforcement Learning We provide the detailed derivation of the data source and gradient coefficient (algorithm and reward function) across various methods, including SFT, RFT, Online RFT, DPO, PPO, and GRPO. A.1.1 Supervised Fine-tuning The objective of Supervised Fine-tuning is maximizing the following objective: 𝒥 S F T ( θ ) = 𝔼 [ q , o ∼ P s f t ( Q , O ) ] ( 1 | o | ∑ t = 1 | o | log ⁡ π θ ( o t | q , o < t ) ) . subscript 𝒥 𝑆 𝐹 𝑇 𝜃 𝔼 delimited-[] similar-to 𝑞 𝑜 subscript 𝑃 𝑠 𝑓 𝑡 𝑄 𝑂 1 𝑜 superscript subscript 𝑡 1 𝑜 subscript 𝜋 𝜃 conditional subscript ...

A.1 Analysis of Reinforcement Learning

原文: We provide the detailed derivation of the data source and gradient coefficient (algorithm and reward function) across various methods, including SFT, RFT, Online RFT, DPO, PPO, and GRPO. A.1.1 Supervised Fine-tuning The objective of Supervised Fine-tuning is maximizing the following objective: 𝒥 S F T ( θ ) = 𝔼 [ q , o ∼ P s f t ( Q , O ) ] ( 1 | o | ∑ t = 1 | o | log ⁡ π θ ( o t | q , o < t ) ) . subscript 𝒥 𝑆 𝐹 𝑇 𝜃 𝔼 delimited-[] similar-to 𝑞 𝑜 subscript 𝑃 𝑠 𝑓 𝑡 𝑄 𝑂 1 𝑜 superscript subscript 𝑡 1 𝑜 subscript 𝜋 𝜃 conditional subscript 𝑜 𝑡 𝑞 subscript 𝑜 absent 𝑡 \mathcal{J}_...

← 返回首页详细解读