[原文]Zhihong Shao 1,2∗† Peiyi Wang 1,3∗† Qihao Zhu 1,3∗† Runxin Xu 1 Junxiao Song 1 Mingchuan Zhang 1 Y.K. Li 1 Y. Wu 1 Daya Guo 1∗ 1 DeepSeek-AI 2 Tsinghua University 3 Peking University {zhihongshao,wangpeiyi,zhuqh,guoday}@deepseek.com https://github.com/deepseek-ai/DeepSeek-Math Abstract Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code...
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
[原文]ni-Ultra (Anil et al., 2023 ) are not publicly available, and the currently accessible open-source models considerably trail behind in performance. In this study, we introduce DeepSeekMath, a domain-specific language model that significantly outperforms the mathematical capabilities of open-source models and approaches the performance level of GPT-4 on academic benchmarks. To achieve this, we create the DeepSeekMath Corpus, a large-scale high-quality pre-training corpus comprising 120B math tokens. This dataset is extracted from the Common Crawl (CC) using a fastText-based classifier (Joulin e...
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
[原文]hmarks (Suzgun et al., 2022 ) , indicating it does not only enhance the model’s mathematical abilities but also amplifies general reasoning capabilities. After pre-training, we apply mathematical instruction tuning to DeepSeekMath-Base with chain-of-thought (Wei et al., 2022 ) , program-of-thought (Chen et al., 2022 ; Gao et al., 2023 ) , and tool-integrated reasoning (Gou et al., 2023 ) data. The resulting model DeepSeekMath-Instruct 7B beats all 7B counterparts and is comparable with 70B open-source instruction-tuned models. Furthermore, we introduce the Group Relative Policy Optimization (G...
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
[原文]math pre-training, along with the exploration and analysis of reinforcement learning. Math Pre-Training at Scale • Our research provides compelling evidence that the publicly accessible Common Crawl data contains valuable information for mathematical purposes. By implementing a meticulously designed data selection pipeline, we successfully construct the DeepSeekMath Corpus, a high-quality dataset of 120B tokens from web pages filtered for mathematical content, which is almost 7 times the size of the math web pages used by Minerva (Lewkowycz et al., 2022a ) and 9 times the size of the recently ...
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
[原文]hermore, we observe enhancements in the out-of-domain performance during the reinforcement learning process. • We provide a unified paradigm to understand different methods, such as RFT, DPO, PPO, and GRPO. We also conduct extensive experiments, e.g., online v.s. offline training, outcome v.s. process supervision, single-turn v.s. iterative reinforcement learning, and so on to deeply investigate the essential elements of this paradigm. • Based on our unified paradigm, we explore the reasons behind the effectiveness of reinforcement learning, and summarize several potential directions to achiev...
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
[原文]es. With mathematical instruction tuning and reinforcement learning, the resulting DeepSeekMath-Instruct and DeepSeekMath-RL demonstrate strong performance, obtaining an accuracy of over 50% on the competition-level MATH dataset for the first time within the open-source community. • Formal Mathematics : We evaluate DeepSeekMath-Base using the informal-to-formal theorem proving task from (Jiang et al., 2022 ) on miniF2F (Zheng et al., 2021 ) with Isabelle (Wenzel et al., 2008 ) chosen to be the proof assistant. DeepSeekMath-Base demonstrates strong few-shot autoformalization performance. • Natu...
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
[原文]r initial seed corpus. Using this corpus, we train a fastText model (Joulin et al., 2016 ) to recall more OpenWebMath-like mathematical web pages. Specifically, we randomly select 500,000 data points from the seed corpus as positive training examples and another 500,000 web pages from Common Crawl as negative ones. We employ an open-source library 1 1 1 https://fasttext.cc for training, configuring the vector dimension to 256, learning rate to 0.1, the maximum length of word n-gram to 3, the minimum number of word occurrences to 3, and the number of training epochs to 3. To reduce the size of ...
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
[原文]low.net/questions ). Web pages linked to these URLs, yet uncollected, will be added to the seed corpus. This approach enables us to gather more positive examples, thereby training an improved fastText model capable of recalling more mathematical data in the subsequent iteration. After four iterations of data collection, we end up with 35.5M mathematical web pages, totaling 120B tokens. In the fourth iteration, we notice that nearly 98% of the data has already been collected in the third
[原文]Large language models (LLM) have revolutionized the approach to mathematical reasoning in artificial intelligence, spurring significant advancements in both the quantitative reasoning benchmark (Hendrycks et al., 2021 ) and the geometry reasoning benchmark (Trinh et al., 2024 ) . Moreover, these models have proven instrumental in assisting humans in solving complex mathematical problems (Tao, 2023 ) . However, cutting-edge models such as GPT-4 (OpenAI, 2023 ) and Gemini-Ultra (Anil et al., 2023 ) are not publicly available, and the currently accessible open-source models considerably trail beh...
[原文]eve that our experience in mathematical data processing is a starting point for the research community, and there is significant room for improvement in the future. DeepSeekMath-Base is initialized with DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024 ) , as we notice that starting from a code training model is a better choice compared to a general LLM. Furthermore, we observe the math training also improves model capability on MMLU (Hendrycks et al., 2020 ) and BBH benchmarks (Suzgun et al., 2022 ) , indicating it does not only enhance the model’s mathematical abilities but also amplifies genera...
[原文]ied RL techniques. We also conduct extensive experiments, e.g., online v.s. offline training, outcome v.s. process supervision, single-turn v.s. iterative RL and so on, to deeply investigate the essential elements of this paradigm. At last, we explain why our RL boosts the performance of instruction-tuned models, and further summarize potential directions to achieve more effective RL based on this unified paradigm. 1.1 Contributions Our contribution includes scalable math pre-training, along with the exploration and analysis of reinforcement learning. Math Pre-Training at Scale • Our research ...
[原文]t Learning • We introduce Group Relative Policy Optimization (GRPO), an efficient and effective reinforcement learning algorithm. GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources compared to Proximal Policy Optimization (PPO). • We demonstrate that GRPO significantly enhances the performance of our instruction-tuned model DeepSeekMath-Instruct, by solely using the instruction-tuning data. Furthermore, we observe enhancements in the out-of-domain performance during the reinforcement learning process. • We provide a unif...
[原文]surpasses all open-source base models (e.g., Mistral 7B (Jiang et al., 2023 ) and Llemma-34B (Azerbayev et al., 2023 ) ), regardless of whether they’ve undergone math pre-training or not, often by a significant margin. Notably, DeepSeekMath-Base is superior on Chinese benchmarks, likely because we don’t follow previous works (Lewkowycz et al., 2022a ; Azerbayev et al., 2023 ) to collect English-only math pre-training data, and also include high-quality non-English ones. With mathematical instruction tuning and reinforcement learning, the resulting DeepSeekMath-Instruct and DeepSeekMath-RL demo...
[原文]Our contribution includes scalable math pre-training, along with the exploration and analysis of reinforcement learning. Math Pre-Training at Scale • Our research provides compelling evidence that the publicly accessible Common Crawl data contains valuable information for mathematical purposes. By implementing a meticulously designed data selection pipeline, we successfully construct the DeepSeekMath Corpus, a high-quality dataset of 120B tokens from web pages filtered for mathematical content, which is almost 7 times the size of the math web pages used by Minerva (Lewkowycz et al., 2022a ) an...
1.1 Contributions
指令微调数据。此外,我们观察到在强化学习过程中,域外(out-of-domain)性能也有所提升。• 我们提供了一个统一范式来理解 RFT、DPO、PPO 和 GRPO 等不同方法。我们还进行了大量实验,例如在线 vs. 离线训练、结果监督 vs. 过程监督、单轮 vs. 迭代强化学习等,以深入探究该范式的核心要素。• 基于我们的统一范式,我们探索了强化学习有效性的原因,并总结了实现更高效 LLM 强化学习的若干潜在方向。
[原文]g the instruction-tuning data. Furthermore, we observe enhancements in the out-of-domain performance during the reinforcement learning process. • We provide a unified paradigm to understand different methods, such as RFT, DPO, PPO, and GRPO. We also conduct extensive experiments, e.g., online v.s. offline training, outcome v.s. process supervision, single-turn v.s. iterative reinforcement learning, and so on to deeply investigate the essential elements of this paradigm. • Based on our unified paradigm, we explore the reasons behind the effectiveness of reinforcement learning, and summarize sev...
[原文]• English and Chinese Mathematical Reasoning : We conduct comprehensive assessments of our models on English and Chinese benchmarks, covering mathematical problems from grade-school level to college level. English benchmarks include GSM8K (Cobbe et al., 2021 ) , MATH (Hendrycks et al., 2021 ) , SAT (Azerbayev et al., 2023 ) , OCW Courses (Lewkowycz et al., 2022a ) , MMLU-STEM (Hendrycks et al., 2020 ) . Chinese benchmarks include MGSM-zh (Shi et al., 2023 ) , CMATH (Wei et al., 2023 ) , Gaokao-MathCloze (Zhong et al., 2023 ) , and Gaokao-MathQA (Zhong et al., 2023 ) . We evaluate models’ abili...
[原文]dels’ general understanding, reasoning, and coding capabilities, we evaluate DeepSeekMath-Base on the Massive Multitask Language Understanding (MMLU) benchmark (Hendrycks et al., 2020 ) which encompasses 57 multiple-choice tasks covering diverse subjects, BIG-Bench Hard (BBH) (Suzgun et al., 2022 ) which consists of 23 challenging tasks that mostly require multi-step reasoning to solve, as well as HumanEval (Chen et al., 2021 ) and MBPP (Austin et al., 2021 ) which are widely used to evaluate code language models. Math pre-training benefits both language understanding and reasoning performance...
2 Math Pre-Training
2.1 数据收集与去污染 本节概述从 Common Crawl 构建 DeepSeekMath 语料库的过程。如图 2 所示,我们提出一种迭代式流水线,展示如何以种子语料库(例如小规模但高质量的数学相关数据集)为起点,从 Common Crawl 系统性地收集大规模数学语料库。值得注意的是,该方法也适用于其他领域,如编程。图 2:从 Common Crawl 收集数学网页的迭代流水线。首先,我们选择 OpenWebMath(Paster 等,2023),一个高质量数学网页文本集合,作为初始种子语料库。利用该语料库,我们训练 fastText 模型(Joulin 等,2016)以召回更多类似 OpenWebMath 的数学网页。具体而言,我们从种子语料库中随机选取 500,000 个数据点作为正例训练样本,并从 Common Crawl 中另选 500,000 个网页作为负例。我们使用开源库 1 1 1 https://fasttext.cc 进行训练,配置向量维度为 256、学习率为 0.1、词 n-gram 最大长度为 3、词最小出现次数为 3、训练轮数为 3。为缩减原始 Common Crawl 的规模,我们采用基于 URL 的去重和近似去重技术,得到 40B 个 HTML 网页。随后,我们使用 fastText 模型从去重后的 Common Crawl 中召回数学网页。为过滤低质量数学内容,我们按 fastText 模型预测得分对收集到的页面排序,仅保留排名靠前的页面。保留的数据量通过在前 40B、80B、120B 和 160B tokens 上的预训练实验来评估。在第一次迭代中,我们选择保留前 40B tokens。第一次数据收集迭代后,仍有大量数学网页未被收集,主要因为 fastText 模型
[原文]is trained on a set of positive examples that lacks sufficient diversity. We therefore identify additional mathematical web sources to enrich the seed corpus, so that we can optimize the fastText model. Specifically, we first organize the entire Common Crawl into disjoint domains; a domain is defined as web pages sharing the same base URL. For each domain, we calculate the percentage of web pages that are collected in the first iteration. Domains where over 10% of the web pages have been collected are classified as math-related (e.g., mathoverflow.net ). Subsequently, we manually annotate the ...
[原文]DeepSeekMath Corpus is compared with the recently released math-training corpora: • MathPile (Wang et al., 2023c ) : a multi-source corpus (8.9B tokens) aggregated from textbooks, Wikipedia, ProofWiki, CommonCrawl, StackExchange, and arXiv, with the majority (over 85%) sourced from arXiv; • OpenWebMath (Paster et al., 2023 ) : CommonCrawl data filtered for mathematical content, totaling 13.6B tokens; • Proof-Pile-2 (Azerbayev et al., 2023 ) : a mathematical corpus consisting of OpenWebMath, AlgebraicStack (10.3B tokens of mathematical code), and arXiv papers (28.0B tokens). When experimenting ...
[原文]a plateau. 2.3 Training and Evaluating DeepSeekMath-Base 7B In this section, we introduce DeepSeekMath-Base 7B, a base model with strong reasoning abilities, especially in mathematics. Our model is initialized with DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024 ) and trained for 500B tokens. The distribution of the data is as follows: 56% is from the DeepSeekMath Corpus, 4% from AlgebraicStack, 10% from arXiv, 20% is Github code, and the remaining 10% is natural language data from Common Crawl in both English and Chinese. We mainly adopt the training setting specified in Section 2.2.1 , except ...
[原文]al., 2023 ) which underwent math training on Proof-Pile-2 (Azerbayev et al., 2023 ) ). Notably, on the competition-level MATH dataset, DeepSeekMath-Base surpasses existing open-source base models by over 10% absolute, and outperforms Minerva 540B (Lewkowycz et al., 2022a ) , a closed-source base model 77 times larger which builds on PaLM (Lewkowycz et al., 2022b ) and is further trained on mathematical texts. Model Size English Benchmarks Chinese Benchmarks GSM8K MATH OCW SAT MMLU STEM CMATH Gaokao MathCloze Gaokao MathQA Closed-Source Base Model Minerva 7B 16.2% 14.1% 7.7% - 35.6% - - - Miner...
[原文]Base 7B 66.9% 31.4% 25.8% 24.6% Table 3: Few-shot evaluation of base models’ ability to solve mathematical problems using tools and the ability to conduct informal-to-formal theorem proving in Isabelle. Formal Mathematics Formal proof automation is beneficial to ensure the accuracy and reliability of mathematical proofs and enhance efficiency, with increasing attention in recent years. We evaluate DeepSeekMath-Base 7B on the task of informal-to-formal proving from (Jiang et al., 2022 ) which is to generate a formal proof based on an informal statement, a formal counterpart of the statement, an...
[原文]MBPP (Austin et al., 2021 ) . As shown in Table 4 , DeepSeekMath-Base 7B exhibits significant enhancements in performance on MMLU and BBH over its precursor, DeepSeek-Coder-Base-v1.5 (Guo et al., 2024 ) , illustrating the positive impact of math training on language understanding and reasoning. Additionally, by including code tokens for continual training, DeepSeekMath-Base 7B effectively maintains the performance of DeepSeek-Coder-Base-v1.5 on the two coding benchmarks. Overall, DeepSeekMath-Base 7B significantly outperforms the general model Mistral 7B (Jiang et al., 2023 ) on the three reas...
2.1 Data Collection and Decontamination
本节概述从 Common Crawl 构建 DeepSeekMath 语料库的过程。如图 2 所示,我们提出一种迭代式流水线,展示如何以种子语料库(例如小规模但高质量的数学相关数据集)为起点,从 Common Crawl 系统性地收集大规模数学语料库。值得注意的是,该方法也适用于其他领域,如编程。图 2:从 Common Crawl 收集数学网页的迭代流水线。首先,我们选择 OpenWebMath(Paster 等,2023),一个高质量数学网页文本集合,作为初始种子语料库。利用该语料库,我们训练 fastText 模型(Joulin 等,2016)以召回更多类似 OpenWebMath 的数学网页。具体而言,我们从种子语料库中随机选取 500,000 个数据点作为正例训练样本,并从 Common Crawl 中另选 500,000 个网页作为负例。我们使用开源库 1 1 1 https://fasttext.cc 进行训练,配置向量维度为 256、学习率为 0.1、词 n-gram 最大长度为 3、词最小出现次数为 3、训练轮数为 3。为缩减原始 Common Crawl 的规模,我们采用基于 URL 的去重和近似去重技术,得到 40B 个 HTML 网页。随后,我们使用 fastText 模型从去重后的 Common Crawl 中召回数学网页。为过滤低质量数学内容,我们按 fastText 模型预测得分对收集到的页面排序,仅保留排名靠前的页面。保留的数据量通过在前 40B、80B、120B 和 160B tokens 上的预训练实验来评估。在第一次迭代中,我们选择保留前 40B tokens。第一次数据收集迭代后,仍有大量数学网页未被收集,主要因为 fastText 模型
[原文]In this section, we will outline the process of constructing the DeepSeekMath Corpus from Common Crawl. As depicted in Figure 2 , we present an iterative pipeline that demonstrates how to systematically gather a large-scale mathematical corpus from Common Crawl, starting with a seed corpus (e.g., a small but high-quality collection of math-related dataset). It’s worth noting that this approach is also applicable to other domains, such as coding. Figure 2: An iterative pipeline that collects mathematical web pages from Common Crawl. First, we choose OpenWebMath (Paster et al., 2023 ) , a collec...
[原文]that lacks sufficient diversity. We therefore identify additional mathematical web sources to enrich the seed corpus, so that we can optimize the fastText model. Specifically, we first organize the entire Common Crawl into disjoint domains; a domain is defined as web pages sharing the same base URL. For each domain, we calculate the percentage of web pages that are collected in the first iteration. Domains where over 10% of the web pages have been collected are classified as math-related (e.g., mathoverflow.net ). Subsequently, we manually annotate the URLs associated with mathematical content...
2.2 Validating the Quality of the DeepSeekMath Corpus
[原文]We run pre-training experiments to investigate how the DeepSeekMath Corpus is compared with the recently released math-training corpora: • MathPile (Wang et al., 2023c ) : a multi-source corpus (8.9B tokens) aggregated from textbooks, Wikipedia, ProofWiki, CommonCrawl, StackExchange, and arXiv, with the majority (over 85%) sourced from arXiv; • OpenWebMath (Paster et al., 2023 ) : CommonCrawl data filtered for mathematical content, totaling 13.6B tokens; • Proof-Pile-2 (Azerbayev et al., 2023 ) : a mathematical corpus consisting of OpenWebMath, AlgebraicStack (10.3B tokens of mathematical code...
2.2 Validating the Quality of the DeepSeekMath Corpus
[原文]In this section, we introduce DeepSeekMath-Base 7B, a base model with strong reasoning abilities, especially in mathematics. Our model is initialized with DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024 ) and trained for 500B tokens. The distribution of the data is as follows: 56% is from the DeepSeekMath Corpus, 4% from AlgebraicStack, 10% from arXiv, 20% is Github code, and the remaining 10% is natural language data from Common Crawl in both English and Chinese. We mainly adopt the training setting specified in Section 2.2.1 , except that we set the maximum value of the learning rate to 4.2e-4...
[原文]zerbayev et al., 2023 ) ). Notably, on the competition-level MATH dataset, DeepSeekMath-Base surpasses existing open-source base models by over 10% absolute, and outperforms Minerva 540B (Lewkowycz et al., 2022a ) , a closed-source base model 77 times larger which builds on PaLM (Lewkowycz et al., 2022b ) and is further trained on mathematical texts. Model Size English Benchmarks Chinese Benchmarks GSM8K MATH OCW SAT MMLU STEM CMATH Gaokao MathCloze Gaokao MathQA Closed-Source Base Model Minerva 7B 16.2% 14.1% 7.7% - 35.6% - - - Minerva 62B 52.4% 27.6% 12.0% - 53.9% - - - Minerva 540B 58.8% 33...
[原文]of base models’ ability to solve mathematical problems using tools and the ability to conduct informal-to-formal theorem proving in Isabelle. Formal Mathematics Formal proof automation is beneficial to ensure the accuracy and reliability of mathematical proofs and enhance efficiency, with increasing attention in recent years. We evaluate DeepSeekMath-Base 7B on the task of informal-to-formal proving from (Jiang et al., 2022 ) which is to generate a formal proof based on an informal statement, a formal counterpart of the statement, and an informal proof. We evaluate on miniF2F (Zheng et al., 20...
[原文]Math-Base 7B exhibits significant enhancements in performance on MMLU and BBH over its precursor, DeepSeek-Coder-Base-v1.5 (Guo et al., 2024 ) , illustrating the positive impact of math training on language understanding and reasoning. Additionally, by including code tokens for continual training, DeepSeekMath-Base 7B effectively maintains the performance of DeepSeek-Coder-Base-v1.5 on the two coding benchmarks. Overall, DeepSeekMath-Base 7B significantly outperforms the general model Mistral 7B (Jiang et al., 2023 ) on the three reasoning and coding benchmarks.
[原文]ompanies including (5) Baichuan-3 4 4 4 https://www.baichuan-ai.com , (6) the latest GLM-4 5 5 5 https://open.bigmodel.cn/dev/api#glm-4 from the GLM family (Du et al., 2022 ) . These models are for general purposes, most of which have undergone a series of alignment procedures. • Open-source models include: general models like (1) DeepSeek-LLM-Chat 67B (DeepSeek-AI, 2024 ) , (2) Qwen 72B (Bai et al., 2023 ) , (3) SeaLLM-v2 7B (Nguyen et al., 2023 ) , and (4) ChatGLM3 6B (ChatGLM3 Team, 2023 ) , as well as models with enhancements in mathematics including (5) InternLM2-Math 20B 6 6 6 https://gi...
[原文]B approaches an accuracy of 60% on MATH, surpassing all existing open-source models. On the other benchmarks, our model is competitive with DeepSeek-LLM-Chat 67B, the prior state-of-the-art that is 10 times larger.
[原文]We construct a mathematical instruction-tuning dataset covering English and Chinese problems from different mathematical fields and of varying complexity levels: problems are paired with solutions in chain-of-thought (CoT) (Wei et al., 2022 ) , program-of-thought (PoT) (Chen et al., 2022 ; Gao et al., 2023 ) , and tool-integrated reasoning format (Gou et al., 2023 ) . The total number of training examples is 776K. • English mathematical datasets : We annotate GSM8K and MATH problems with tool-integrated solutions, and adopt a subset of MathInstruct (Yue et al., 2023 ) along with the training s...
3.2 Training and Evaluating DeepSeekMath-Instruct 7B
[原文]In this section, we introduce DeepSeekMath-Instruct 7B which undergoes mathematical instruction tuning based on DeepSeekMath-Base. Training examples are randomly concatenated until reaching a maximum context length of 4K tokens. We train the model for 500 steps with a batch size of 256 and a constant learning rate of 5e-5. We evaluate models’ mathematical performance both without and with tool use, on 4 quantitative reasoning benchmarks in English and Chinese. We benchmark our model against the leading models of the time: • Closed-source models include: (1) the GPT family among which GPT-4 (Op...
3.2 Training and Evaluating DeepSeekMath-Instruct 7B
[原文]olved instructions) and PPO training with training problems primarily sourced from GSM8K and MATH, (8) MetaMath 70B (Yu et al., 2023 ) which is Llama-2 70B fine-tuned on an augmented version of GSM8K and MATH, (9) ToRA 34B Gou et al. ( 2023 ) which is CodeLlama 34B fine-tuned to do tool-integrated mathematical reasoning, (10) MAmmoTH 70B (Yue et al., 2023 ) which is Llama-2 70B instruction-tuned on MathInstruct. Model Size English Benchmarks Chinese Benchmarks GSM8K MATH MGSM-zh CMATH Chain-of-Thought Reasoning Closed-Source Model Gemini Ultra - 94.4% 53.2% - - GPT-4 - 92.0% 52.9% - 86.0% Infl...
3.2 Training and Evaluating DeepSeekMath-Instruct 7B
[原文]TH, it improves over DeepSeekMath-Instruct 7B on all benchmarks. As shown in Table 5 , under the evaluation setting where tool use is disallowed, DeepSeekMath-Instruct 7B demonstrates strong performance of step-by-step reasoning. Notably, on the competition-level MATH dataset, our model surpasses all open-source models and the majority of proprietary models (e.g., Inflection-2 and Gemini Pro) by at least 9% absolute. This is true even for models that are substantially larger (e.g., Qwen 72B) or have been specifically enhanced through math-focused reinforcement learning (e.g., WizardMath-v1.1 7...
[原文]θ o l d subscript 𝜋 subscript 𝜃 𝑜 𝑙 𝑑 \pi_{\theta_{old}} , respectively. ε 𝜀 \varepsilon is a clipping-related hyper-parameter introduced in PPO for stabilizing training. A t subscript 𝐴 𝑡 A_{t} is the advantage, which is computed by applying Generalized Advantage Estimation (GAE) (Schulman et al., 2015 ) , based on the rewards { r ≥ t } subscript 𝑟 absent 𝑡 \{r_{\geq t}\} and a learned value function V ψ subscript 𝑉 𝜓 V_{\psi} . Thus, in PPO, a value function needs to be trained alongside the policy model and to mitigate over-optimization of the reward model, the standard approach is to a...
[原文]Figure 4 , we propose Group Relative Policy Optimization (GRPO), which obviates the need for additional value function approximation as in PPO, and instead uses the average reward of multiple sampled outputs, produced in response to the same question, as the baseline. More specifically, for each question q 𝑞 q , GRPO samples a group of outputs { o 1 , o 2 , ⋯ , o G } subscript 𝑜 1 subscript 𝑜 2 ⋯ subscript 𝑜 𝐺 \{o_{1},o_{2},\cdots,o_{G}\} from the old policy π θ o l d subscript 𝜋 subscript 𝜃 𝑜 𝑙 𝑑 \pi_{\theta_{old}} and then optimizes the policy model by maximizing the following objective:...
[原文]stion. Also note that, instead of adding KL penalty in the reward, GRPO regularizes by directly adding the KL divergence between the trained policy and the reference policy to the loss, avoiding complicating the calculation of A ^ i , t subscript ^ 𝐴 𝑖 𝑡 \hat{A}_{i,t} . And different from the KL penalty term used in ( 2 ), we estimate the KL divergence with the following unbiased estimator (Schulman, 2020 ) : 𝔻 K L [ π θ | | π r e f ] = π r e f ( o i , t | q , o i ,
[原文]i} 9: Compute A ^ i , t subscript ^ 𝐴 𝑖 𝑡 \hat{A}_{i,t} for the t 𝑡 t -th token of o i subscript 𝑜 𝑖 o_{i} through group relative advantage estimation. 10: for GRPO iteration = 1, …, μ 𝜇 \mu do 11: Update the policy model π θ subscript 𝜋 𝜃 \pi_{\theta} by maximizing the GRPO objective (Equation 21 ) 12: Update r φ subscript 𝑟 𝜑 r_{\varphi} through continuous training using a replay mechanism. Output π θ subscript 𝜋 𝜃 \pi_{\theta} 4.1.2 Outcome Supervision RL with GRPO Formally, for each question q 𝑞 q , a group of outputs { o 1 , o 2 , ⋯ , o G } subscript 𝑜 1 subscript 𝑜 2 ⋯ subscript 𝑜 𝐺 \{o_...
[原文]t 𝑜 1 subscript 𝑜 2 ⋯ subscript 𝑜 𝐺 \{o_{1},o_{2},\cdots,o_{G}\} , a process reward model is used to score each step of the outputs, yielding corresponding rewards: 𝐑 = { { r 1 i n d e x ( 1 ) , ⋯ , r 1 i n d e x ( K 1 ) } , ⋯ , { r G i n d e x ( 1 ) , ⋯ , r G i n d e x ( K G ) } } 𝐑 superscript subscript 𝑟 1 𝑖 𝑛 𝑑 𝑒 𝑥 1 ⋯ superscript subscript 𝑟 1 𝑖 𝑛 𝑑 𝑒 𝑥 subscript 𝐾 1 ⋯ superscript subscript 𝑟 𝐺 𝑖 𝑛 𝑑 𝑒 𝑥 1 ⋯ superscript subscript 𝑟 𝐺 𝑖 𝑛 𝑑 𝑒 𝑥 subscript 𝐾 𝐺 \mathbf{R}=\{\{r_{1}^{index(1)},\cdots,r_{1}^{index(K_{1})}\},\cdots,\{r_{G}^{index(1)},\cdot...
[原文]del and continually train the old reward model using a replay mechanism that incorporates 10% of historical data. Then, we set the reference model as the policy model, and continually train the policy model with the new reward model. 4.2 Training and Evaluating DeepSeekMath-RL We conduct RL based on DeepSeekMath-Instruct 7B. The training data of RL are chain-of-thought-format questions related to GSM8K and MATH from the SFT data, which consists of around 144K questions. We exclude other SFT questions to investigate the impact of RL on benchmarks that lack data throughout the RL phase. We const...
[原文]𝑑 \pi_{\theta_{old}} , respectively. ε 𝜀 \varepsilon is a clipping-related hyper-parameter introduced in PPO for stabilizing training. A t subscript 𝐴 𝑡 A_{t} is the advantage, which is computed by applying Generalized Advantage Estimation (GAE) (Schulman et al., 2015 ) , based on the rewards { r ≥ t } subscript 𝑟 absent 𝑡 \{r_{\geq t}\} and a learned value function V ψ subscript 𝑉 𝜓 V_{\psi} . Thus, in PPO, a value function needs to be trained alongside the policy model and to mitigate over-optimization of the reward model, the standard approach is to add a per-token KL penalty from a referen...
[原文]licy Optimization (GRPO), which obviates the need for additional value function approximation as in PPO, and instead uses the average reward of multiple sampled outputs, produced in response to the same question, as the baseline. More specifically, for each question q 𝑞 q , GRPO samples a group of outputs { o 1 , o 2 , ⋯ , o G } subscript 𝑜 1 subscript 𝑜 2 ⋯ subscript 𝑜 𝐺 \{o_{1},o_{2},\cdots,o_{G}\} from the old policy π θ o l d subscript 𝜋 subscript 𝜃 𝑜 𝑙 𝑑 \pi_{\theta_{old}} and then optimizes the policy model by maximizing the following objective: 𝒥 G R P O ( θ ) = 𝔼 [ q ∼ P ...
[原文]g KL penalty in the reward, GRPO regularizes by directly adding the KL divergence between the trained policy and the reference policy to the loss, avoiding complicating the calculation of A ^ i , t subscript ^ 𝐴 𝑖 𝑡 \hat{A}_{i,t} . And different from the KL penalty term used in ( 2 ), we estimate the KL divergence with the following unbiased estimator (Schulman, 2020 ) : 𝔻 K L [ π θ | | π r e f ] = π r e f ( o i , t | q , o i ,
[原文]𝑡 \hat{A}_{i,t} for the t 𝑡 t -th token of o i subscript 𝑜 𝑖 o_{i} through group relative advantage estimation. 10: for GRPO iteration = 1, …, μ 𝜇 \mu do 11: Update the policy model π θ subscript 𝜋 𝜃 \pi_{\theta} by maximizing the GRPO objective (Equation 21 ) 12: Update r φ subscript 𝑟 𝜑 r_{\varphi} through continuous training using a replay mechanism. Output π θ subscript 𝜋 𝜃 \pi_{\theta} 4.1.2 Outcome Supervision RL with GRPO Formally, for each question q 𝑞 q , a group of outputs { o 1 , o 2 , ⋯ , o G } subscript 𝑜 1 subscript 𝑜 2 ⋯ subscript 𝑜 𝐺 \{o_{1},o_{2},\cdots,o_{G}\} are sampled fro...
[原文]_{1},o_{2},\cdots,o_{G}\} , a process reward model is used to score each step of the outputs, yielding corresponding rewards: 𝐑 = { { r 1 i n d e x ( 1 ) , ⋯ , r 1 i n d e x ( K 1 ) } , ⋯ , { r G i n d e x ( 1 ) , ⋯ , r G i n d e x ( K G ) } } 𝐑 superscript subscript 𝑟 1 𝑖 𝑛 𝑑 𝑒 𝑥 1 ⋯ superscript subscript 𝑟 1 𝑖 𝑛 𝑑 𝑒 𝑥 subscript 𝐾 1 ⋯ superscript subscript 𝑟 𝐺 𝑖 𝑛 𝑑 𝑒 𝑥 1 ⋯ superscript subscript 𝑟 𝐺 𝑖 𝑛 𝑑 𝑒 𝑥 subscript 𝐾 𝐺 \mathbf{R}=\{\{r_{1}^{index(1)},\cdots,r_{1}^{index(K_{1})}\},\cdots,\{r_{G}^{index(1)},\cdots,r_{G}^{index(K_{G})}\}\} , where i ...
[原文]d model using a replay mechanism that incorporates 10% of historical data. Then, we set the reference model as the policy model, and continually train the policy model with the new reward model.
[原文]We conduct RL based on DeepSeekMath-Instruct 7B. The training data of RL are chain-of-thought-format questions related to GSM8K and MATH from the SFT data, which consists of around 144K questions. We exclude other SFT questions to investigate the impact of RL on benchmarks that lack data throughout the RL phase. We construct the training set of reward models following (Wang et al., 2023b ) . We train our initial reward model based on the DeepSeekMath-Base 7B with a learning rate of 2e-5. For GRPO, we set the learning rate of the policy model as 1e-6. The KL coefficient is 0.04. For each questi...
[原文]In this section, we will share our findings in pre-training and RL experiments. 5.1 Lessons Learnt in Pre-Training We first share our experience in pre-training. Unless otherwise specified, we will adhere to the training settings outlined in Section 2.2.1 . It is worth noting that, when referring to the DeepSeekMath Corpus in this section, we use an 89B-token dataset from the second iteration of the data collection process. 5.1.1 Code Training Benefits Mathematical Reasoning A popular yet unverified hypothesis suggests that code training improves reasoning. We attempt to offer a partial respon...
5 Discussion
H+Python No Continual Training – – – 2.9% 3.0% 12.3% 2.7% 2.3% Two-Stage Training Stage 1: General Training 400B – – 2.9% 3.2% 14.8% 3.3% 2.3% Stage 2: Math Training – – 150B 19.1% 14.4% 37.2% 14.3% 6.7% Stage 1: Code Training – 400B – 5.9% 3.6% 19.9% 12.4% 10.0% Stage 2: Math Training – – 150B 21.9% 15.3% 39.7% 17.4% 9.4% One-Stage Training Math Training – – 150B 20.5% 13.1% 37.6% 11.4% 6.5% Code & Math Mixed Training – 400B 150B 17.6% 12.1% 36.3% 19.7% 13.5% 表 6:不同训练设置下代码对数学推理的影响。我们在 DeepSeek-LLM 1.3B 上实验,分别通过少样本思维链提示和少样本程序思维提示评估有/无工具使用的数学推理性能。Training Setting Training Tokens MMLU BBH HumanEval (Pass@1) MBPP (Pass@1) General Code Math No Continual Training – – – 24.5% 28.1% 12.2% 13.0% Two-Stage Training Stage 1: General Training 400B – – 25.9% 27.7% 15.2% 13.6% Stage 2: Math Training – – 150B 33.1% 32.7% 12.8% 13.2% Stage 1: Code Training – 400B – 25.0% 31.5% 25.0% 40.0% Stage 2: Math Training – – 150B 36.2% 35.3% 12.2% 17.0% One-Stage Training Math Training – – 150B 32.3% 32.5% 11.6% 13.2% Code & Math Mixed Training – 400B 150B 33.5% 35.6% 29.3% 39.4% 表 7:不同代码与数学训练设置对语言理解、推理和编程性能的影响。我们在 DeepSeek-LLM 1.3B 上实验。在 MMLU 和 BBH 上使用少样本思维链提示评估。在 HumanEval 和 MBPP 上分别进行零样本和少样本评估。结果 表 6 和表 7 展示了不同训练设置下的下游性能。代码训练有益于程序辅助数学推理,无论两阶段还是单阶段训练均如此。如表 6 所示,在两阶段训练下,仅代码训练已显著增强使用 Python 求解 GSM8K 和 MATH 问题的能力。
[原文]H+Python No Continual Training – – – 2.9% 3.0% 12.3% 2.7% 2.3% Two-Stage Training Stage 1: General Training 400B – – 2.9% 3.2% 14.8% 3.3% 2.3% Stage 2: Math Training – – 150B 19.1% 14.4% 37.2% 14.3% 6.7% Stage 1: Code Training – 400B – 5.9% 3.6% 19.9% 12.4% 10.0% Stage 2: Math Training – – 150B 21.9% 15.3% 39.7% 17.4% 9.4% One-Stage Training Math Training – – 150B 20.5% 13.1% 37.6% 11.4% 6.5% Code & Math Mixed Training – 400B 150B 17.6% 12.1% 36.3% 19.7% 13.5% Table 6: Investigation of how code affects mathematical reasoning under different training settings. We experiment with DeepSeek-LLM 1....
[原文]ility to solve GSM8K and MATH problems using Python. Math training in the second stage yields further improvements. Interestingly, under the one-stage training setting, mixing code tokens and math tokens effectively mitigates the issue of catastrophic forgetting that arises from two-stage training, and also synergizes coding (Table 7 ) and program-aided mathematical reasoning (Table 6 ). Code training also improves mathematical reasoning without tool use. Under the two-stage training setting, the initial stage of code training already results in moderate enhancements. It also boosts the effici...
[原文]uded as a component of math pre-training data (Lewkowycz et al., 2022a ; Polu and Sutskever, 2020 ; Azerbayev et al., 2023 ; Wang et al., 2023c ) . However, detailed analysis regarding their impact on mathematical reasoning has not been extensively conducted. Perhaps counter-intuitively, according to our experiments, arXiv papers seem ineffective in improving mathematical reasoning. We experiment with models of different sizes, including DeepSeek-LLM 1.3B and DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024 ) , using arXiv corpora that underwent varied processing pipelines: • MathPile (Wang et al...
[原文]s required, which we leave for future studies. 5.2 Insights of Reinforcement Learning 5.2.1 Towards to a Unified Paradigm In this section, we provide a unified paradigm to analyze different training methods, such as SFT, RFT, DPO, PPO, GRPO, and further conduct experiments to explore the factors of the unified paradigm. Generally, the gradient with respect to the parameter θ 𝜃 \theta of a training method can be written as: ∇ θ 𝒥 𝒜 ( θ ) = 𝔼 [ ( q , o ) ∼ 𝒟 ⏟ D a t a S o u r c e ] ( 1 | o | ∑ t = 1 | o | G C 𝒜 ( q , o , t , π r f ) ⏟ G r a d i e n t...
[原文]FT model, using pair-wise DPO loss. • Online Rejection Sampling Fine-tuning (Online RFT) : Different from RFT, Online RFT initiates the policy model using the SFT model and refines it by fine-tuning with the augmented outputs sampled from the real-time policy model. • PPO/GRPO : PPO/GRPO initializes the policy model using the SFT model and reinforces it with the outputs sampled from the real-time policy model. We summarize the components of these methods in Table 10 . Please refer to Appendix A.1 for a more detailed derivation process. Figure 5: Performance of the DeepSeekMath-Instruct 1.3B mo...
[原文]’ in our experiments. Rule refers to judging the quality of a response based on the correctness of the answer, and Model denotes that we train a reward model to score each response. The training data of the reward model is based on the rule judgment. Equations 10 and 21 highlight a key difference between GRPO and Online RFT: GRPO uniquely adjusts its gradient coefficient based on the reward value provided by the reward model. This allows for differential reinforcement and penalization of responses according to their varying magnitudes. In contrast, Online RFT lacks this feature; it does not pe...
[原文]buted to boosting the correct response from TopK rather than the enhancement of fundamental capabilities. Similarly, (Wang et al., 2023a ) identified a misalignment problem in reasoning tasks within the SFT model, showing that the reasoning performance of SFT models can be improved through a series of preference alignment strategies (Yuan et al., 2023b ; Song et al., 2023 ; Wang et al., 2023a ) . 5.2.3 How to Achieve More Effective RL? We demonstrate RL works pretty well in mathematical reasoning tasks. We also provide a unified paradigm to understand different representative training methods....
[原文]al probability of a certain token. However, it is impossible to ensure the reward signal is always reliable, especially in extremely complex tasks. For example, even the PRM800K datasets (Lightman et al., 2023 ) , which have been carefully annotated by well-trained annotators, still contain approximately 20% of incorrectly annotations 7 7 7 https://github.com/openai/prm800k/issues/12#issuecomment-1728491852 . To this end, we will explore the reinforcement learning algorithm that is robust against noisy reward signals. We believe such WEAK-TO-STRONG (Burns et al., 2023 ) alignment methods will ...
5.1 Lessons Learnt in Pre-Training
我们首先分享预训练经验。除非另有说明,我们遵循第 2.2.1 节的训练设置。值得注意的是,本节中提及 DeepSeekMath 语料库时,我们使用的是数据收集过程第二次迭代的 89B-token 数据集。5.1.1 代码训练有益于数学推理 一个流行但尚未验证的假设认为代码训练能提升推理能力。我们尝试对此给出部分回答,尤其在数学领域:代码训练能提升模型有工具和无工具两种情况下的数学推理能力。为研究代码训练如何影响数学推理,我们实验了以下两阶段和单阶段训练设置:两阶段训练 • 400B tokens 代码训练 → 150B tokens 数学训练:DeepSeek-LLM 1.3B 先训练 400B 代码 tokens,再训练 150B 数学 tokens;• 400B tokens 通用训练 → 150B tokens 数学训练:作为对照,第一阶段使用通用 tokens(从 DeepSeek-AI 创建的大规模通用语料采样)而非代码 tokens,以研究代码 tokens 相对通用 tokens 在提升数学推理方面的优势。单阶段训练 • 150B tokens 数学训练:DeepSeek-LLM 1.3B 训练 150B 数学 tokens;• 400B 代码 tokens 与 150B 数学 tokens 混合训练:代码训练后的数学训练会降低编程性能。我们研究代码 tokens 与数学 tokens 混合进行单阶段训练是否仍能提升数学推理并缓解灾难性遗忘。Training Setting Training Tokens w/o Tool Use w/ Tool Use General Code Math GSM8K MATH CMATH GSM8K+Python MATH+Python No Continual Training – – – 2.9% 3.0% 12.3% 2.7% 2.3% Two-Stage Training Stage 1: General Training 400B –
[原文]We first share our experience in pre-training. Unless otherwise specified, we will adhere to the training settings outlined in Section 2.2.1 . It is worth noting that, when referring to the DeepSeekMath Corpus in this section, we use an 89B-token dataset from the second iteration of the data collection process. 5.1.1 Code Training Benefits Mathematical Reasoning A popular yet unverified hypothesis suggests that code training improves reasoning. We attempt to offer a partial response to this, particularly within the mathematical domain: code training improves models’ ability to do mathematical ...
5.1 Lessons Learnt in Pre-Training
– 2.9% 3.2% 14.8% 3.3% 2.3% Stage 2: Math Training – – 150B 19.1% 14.4% 37.2% 14.3% 6.7% Stage 1: Code Training – 400B – 5.9% 3.6% 19.9% 12.4% 10.0% Stage 2: Math Training – – 150B 21.9% 15.3% 39.7% 17.4% 9.4% One-Stage Training Math Training – – 150B 20.5% 13.1% 37.6% 11.4% 6.5% Code & Math Mixed Training – 400B 150B 17.6% 12.1% 36.3% 19.7% 13.5% 表 6:不同训练设置下代码对数学推理的影响。我们在 DeepSeek-LLM 1.3B 上实验,分别通过少样本思维链提示和少样本程序思维提示评估有/无工具使用的数学推理性能。Training Setting Training Tokens MMLU BBH HumanEval (Pass@1) MBPP (Pass@1) General Code Math No Continual Training – – – 24.5% 28.1% 12.2% 13.0% Two-Stage Training Stage 1: General Training 400B – – 25.9% 27.7% 15.2% 13.6% Stage 2: Math Training – – 150B 33.1% 32.7% 12.8% 13.2% Stage 1: Code Training – 400B – 25.0% 31.5% 25.0% 40.0% Stage 2: Math Training – – 150B 36.2% 35.3% 12.2% 17.0% One-Stage Training Math Training – – 150B 32.3% 32.5% 11.6% 13.2% Code & Math Mixed Training – 400B 150B 33.5% 35.6% 29.3% 39.4% 表 7:不同代码与数学训练设置对语言理解、推理和编程性能的影响。我们在 DeepSeek-LLM 1.3B 上实验。在 MMLU 和 BBH 上使用少样本思维链提示评估。在 HumanEval 和 MBPP 上分别进行零样本和少样本评估。结果 表 6 和表 7 展示了不同训练设置下的下游性能。代码训练有益于程序辅助数学推理,无论两阶段还是单阶段训练均如此。如表 6 所示,在两阶段训练下,仅代码训练已显著增强使用 Python 求解 GSM8K 和 MATH 问题的能力。第二阶段数学训练带来进一步提升。
[原文]– 2.9% 3.2% 14.8% 3.3% 2.3% Stage 2: Math Training – – 150B 19.1% 14.4% 37.2% 14.3% 6.7% Stage 1: Code Training – 400B – 5.9% 3.6% 19.9% 12.4% 10.0% Stage 2: Math Training – – 150B 21.9% 15.3% 39.7% 17.4% 9.4% One-Stage Training Math Training – – 150B 20.5% 13.1% 37.6% 11.4% 6.5% Code & Math Mixed Training – 400B 150B 17.6% 12.1% 36.3% 19.7% 13.5% Table 6: Investigation of how code affects mathematical reasoning under different training settings. We experiment with DeepSeek-LLM 1.3B, and evaluate its mathematical reasoning performance without and with tool use via few-shot chain-of-thought pro...
5.1 Lessons Learnt in Pre-Training
第二阶段数学训练带来进一步提升。有趣的是,在单阶段训练下,混合代码 tokens 和数学 tokens 有效缓解了双阶段训练引发的灾难性遗忘,并协同提升了编程(表 7)和程序辅助数学推理(表 6)。代码训练也提升了无工具使用的数学推理。在两阶段训练下,初始代码训练阶段已带来中等幅度提升,并提高后续数学训练效率,最终达到最佳性能。然而,单阶段混合代码和数学 tokens 会损害无工具使用的数学推理。一种推测是 DeepSeek-LLM 1.3B 由于规模有限,无法同时充分吸收代码和数学数据。Model Size ArXiv Corpus English Benchmarks Chinese Benchmarks GSM8K MATH OCW SAT MMLU STEM CMATH Gaokao MathCloze Gaokao MathQA DeepSeek-LLM 1.3B No Math Training 2.9% 3.0% 2.9% 15.6% 19.5% 12.3% 0.8% 17.9% MathPile 2.7% 3.3% 2.2% 12.5% 15.7% 1.2% 0.0% 2.8% ArXiv-RedPajama 3.3% 3.4% 4.0% 9.4% 9.0% 7.4% 0.8% 2.3% DeepSeek-Coder-Base-v1.5 7B No Math Training 29.0% 12.5% 6.6% 40.6% 38.1% 45.9% 5.9% 21.1% MathPile 23.6% 11.5% 7.0% 46.9% 35.8% 37.9% 4.2% 25.6% ArXiv-RedPajama 28.1% 11.1% 7.7% 50.0% 35.2% 42.6% 7.6% 24.8% 表 8:在不同 arXiv 数据集上进行数学训练的效果。模型性能采用少样本思维链提示评估。ArXiv Corpus miniF2F-valid miniF2F-test No Math Training 20.1% 21.7% MathPile 16.8% 16.4% ArXiv-RedPajama 14.8% 11.9% 表 9:在不同 arXiv 语料上进行数学训练的效果,基座模型为 DeepSeek-Coder-Base-v1.5 7B。我们在 Isabelle 中评估非形式到形式证明。
[原文]Interestingly, under the one-stage training setting, mixing code tokens and math tokens effectively mitigates the issue of catastrophic forgetting that arises from two-stage training, and also synergizes coding (Table 7 ) and program-aided mathematical reasoning (Table 6 ). Code training also improves mathematical reasoning without tool use. Under the two-stage training setting, the initial stage of code training already results in moderate enhancements. It also boosts the efficiency of the subsequent math training, eventually leading to the best performance. However, combining code tokens and...
[原文]., 2023 ; Wang et al., 2023c ) . However, detailed analysis regarding their impact on mathematical reasoning has not been extensively conducted. Perhaps counter-intuitively, according to our experiments, arXiv papers seem ineffective in improving mathematical reasoning. We experiment with models of different sizes, including DeepSeek-LLM 1.3B and DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024 ) , using arXiv corpora that underwent varied processing pipelines: • MathPile (Wang et al., 2023c ) : an 8.9B-token corpus developed with cleaning and filtering heuristic rules, over 85% of which are scie...
[原文]T) : Different from RFT, Online RFT initiates the policy model using the SFT model and refines it by fine-tuning with the augmented outputs sampled from the real-time policy model. • PPO/GRPO : PPO/GRPO initializes the policy model using the SFT model and reinforces it with the outputs sampled from the real-time policy model. We summarize the components of these methods in Table 10 . Please refer to Appendix A.1 for a more detailed derivation process. Figure 5: Performance of the DeepSeekMath-Instruct 1.3B model, which was further trained using various methods, on two benchmarks. Figure 6: Per...
[原文]rrectness of the answer, and Model denotes that we train a reward model to score each response. The training data of the reward model is based on the rule judgment. Equations 10 and 21 highlight a key difference between GRPO and Online RFT: GRPO uniquely adjusts its gradient coefficient based on the reward value provided by the reward model. This allows for differential reinforcement and penalization of responses according to their varying magnitudes. In contrast, Online RFT lacks this feature; it does not penalize incorrect responses and uniformly reinforces all responses with correct answers...
[原文]ental capabilities. Similarly, (Wang et al., 2023a ) identified a misalignment problem in reasoning tasks within the SFT model, showing that the reasoning performance of SFT models can be improved through a series of preference alignment strategies (Yuan et al., 2023b ; Song et al., 2023 ; Wang et al., 2023a ) . 5.2.3 How to Achieve More Effective RL? We demonstrate RL works pretty well in mathematical reasoning tasks. We also provide a unified paradigm to understand different representative training methods. Within this paradigm, all methods are conceptualized as either direct or simplified R...
[原文]al is always reliable, especially in extremely complex tasks. For example, even the PRM800K datasets (Lightman et al., 2023 ) , which have been carefully annotated by well-trained annotators, still contain approximately 20% of incorrectly annotations 7 7 7 https://github.com/openai/prm800k/issues/12#issuecomment-1728491852 . To this end, we will explore the reinforcement learning algorithm that is robust against noisy reward signals. We believe such WEAK-TO-STRONG (Burns et al., 2023 ) alignment methods will bring a fundamental change to the learning algorithms. Reward Function Reward function...
[原文]We present DeepSeekMath, which outperforms all open-source models on the competition-level MATH benchmark and approaches the performance of closed models. DeepSeekMath is initialized with DeepSeek-Coder-v1.5 7B and undergoes continual training for 500B tokens, with a significant component of the training data being 120B math tokens sourced from Common Crawl. Our extensive ablation study shows web pages offer significant potential for high-quality mathematical data, while arXiv may not as beneficial as we expected. We introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Po...
[原文]A.1 Analysis of Reinforcement Learning We provide the detailed derivation of the data source and gradient coefficient (algorithm and reward function) across various methods, including SFT, RFT, Online RFT, DPO, PPO, and GRPO. A.1.1 Supervised Fine-tuning The objective of Supervised Fine-tuning is maximizing the following objective: 𝒥 S F T ( θ ) = 𝔼 [ q , o ∼ P s f t ( Q , O ) ] ( 1 | o | ∑ t = 1 | o | log π θ ( o t | q , o
[原文]We provide the detailed derivation of the data source and gradient coefficient (algorithm and reward function) across various methods, including SFT, RFT, Online RFT, DPO, PPO, and GRPO. A.1.1 Supervised Fine-tuning The objective of Supervised Fine-tuning is maximizing the following objective: 𝒥 S F T ( θ ) = 𝔼 [ q , o ∼ P s f t ( Q , O ) ] ( 1 | o | ∑ t = 1 | o | log π θ ( o t | q , o