DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1：通过强化学习激励大语言模型的推理能力

📄 arXiv: 2501.12948📅 2025-01-22PDF

翻译进度55 / 55 段 (100%)

中文摘要

DeepSeek-R1 推理模型通过强化学习（RL）激励 LLM 的推理能力，在数学、代码和科学领域表现突出。该论文提出了一种全新的训练范式——先通过监督微调获得基础推理能力，再通过强化学习大幅强化推理深度和广度。R1 在 AIME、MATH、GPQA 等基准测试中达到或超越 GPT-4o 和 Claude 的性能，标志着开源推理模型的重要里程碑。

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI research@deepseek.com 摘要我们介绍第一代推理模型：DeepSeek-R1-Zero和DeepSeek-R1。DeepSeek-R1-Zero是通过大规模强化学习（RL）训练的模型，没有监督微调（SFT）作为初步步骤，展示了非凡的推理能力。通过RL，DeepSeek-R1-Zero自然地涌现出大量强大且有趣的推理行为，包括自我验证、反思和生成长链思维（CoT）。

原文: DeepSeek-AI research@deepseek.com Abstract We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which...

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

为了实现通用推理，我们尝试了各种方法，包括基于过程的奖励模型、强化学习和蒙特卡洛树搜索等搜索算法。然而，这些方法都未能实现与OpenAI-o1相当的通用推理性能。我们的关键洞察是：通过大规模的纯强化学习，模型可以自然地发展出推理能力，而无需依赖人工标注的思维链数据。

原文: rious approaches, including process-based reward models (Uesato et al., 2022 ; Lightman et al., 2023 ; Wang et al., 2023 ) , reinforcement learning (Kumar et al., 2024 ) , and search algorithms such as Monte Carlo Tree Search and Beam Search (Feng et al., 2024 ; Xin et al., 2024 ; Trinh et al., 2024 ) . However, none of these methods has achieved general reasoning performance comparable to OpenAI’s o1 series models. In this paper, we take the first step toward improving language model reasoning capabilities using pure reinforcement learning (RL). Our goal is to explore the potential of LLMs to...

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

使用新的监督数据进行微调后，检查点经过额外的RL过程，考虑了所有场景的提示。经过这些步骤，我们获得了称为DeepSeek-R1的检查点，其性能与OpenAI-o1-1217相当。我们进一步探索了从DeepSeek-R1到更小稠密模型的蒸馏。使用Qwen2.5-32B作为基础模型，直接蒸馏获得了显著的性能提升。

原文: fine-tuning with the new data, the checkpoint undergoes an additional RL process, taking into account prompts from all scenarios. After these steps, we obtained a checkpoint referred to as DeepSeek-R1, which achieves performance on par with OpenAI-o1-1217. We further explore distillation from DeepSeek-R1 to smaller dense models. Using Qwen2.5-32B (Qwen, 2024b ) as the base model, direct distillation from DeepSeek-R1 outperforms applying RL on it. This demonstrates that the reasoning patterns discovered by larger base models are crucial for improving reasoning capabilities. We open-source the d...

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

关键发现： - 我们展示了大型模型的推理模式可以蒸馏到小型模型中，相比在小模型上通过RL发现的推理模式，性能更好。 - 开源的DeepSeek-R1及其API将使研究社区能够蒸馏出更好的小型模型。 - 使用生成的推理数据进行训练，小模型也能获得显著的推理能力。

原文: els Can Be Powerful Too • We demonstrate that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models. The open source DeepSeek-R1, as well as its API, will benefit the research community to distill better smaller models in the future. • Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well...

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

在事实基准SimpleQA上，DeepSeek-R1超越了DeepSeek-V3，展示了其处理事实性查询的能力。与OpenAI-o1在该基准上超越GPT-4o的趋势类似。在其他任务中，DeepSeek-R1在广泛的任务上表现出色，包括创意写作、通用问答和代码生成。

原文: er closed-source models, demonstrating its competitive edge in educational tasks. On the factual benchmark SimpleQA, DeepSeek-R1 outperforms DeepSeek-V3, demonstrating its capability in handling fact-based queries. A similar trend is observed where OpenAI-o1 surpasses 4o on this benchmark. • Others : DeepSeek-R1 also excels in a wide range of tasks, including creative writing, general question answering, editing, summarization, and more. It achieves an impressive length-controlled win-rate of 87.6% on AlpacaEval 2.0 and a win-rate of 92.3% on ArenaHard, showcasing its strong ability to intelli...

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

2 RL算法我们专注于监督数据，关注模型通过纯强化学习过程的自我进化。我们首先简要概述RL算法，然后展示一些令人兴奋的结果，希望为社区提供有价值的见解。 2.2.1 强化学习算法：组相对策略优化（GRPO）为了节省RL训练成本，我们采用组相对策略优化（GRPO）。

原文: y supervised data , focusing on their self-evolution through a pure reinforcement learning process. We start with a brief overview of our RL algorithm, followed by the presentation of some exciting results, and hope this provides the community with valuable insights. 2.2.1 Reinforcement Learning Algorithm Group Relative Policy Optimization In order to save the training costs of RL, we adopt Group Relative Policy Optimization (GRPO) (Shao et al., 2024 ) , which foregoes the critic model that is typically the same size as the policy model, and estimates the baseline from group scores instead. Sp...

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

优势函数计算：A_i = (r_i - mean({r_1, r_2, ..., r_G})) / std({r_1, r_2, ..., r_G}) 其中r_i是从对应输出的奖励中得出的优势。这种方法通过组内标准化来估计优势，无需价值模型。

原文: , … , r G } \{r_{1},r_{2},\ldots,r_{G}\} corresponding to the outputs within each group: A i = r i − m e a n ( { r 1 , r 2 , ⋯ , r G } ) s t d ( { r 1 , r 2 , ⋯ , r G } ) . A_{i}=\frac{r_{i}-{\mathrm{m}ean(\{r_{1},r_{2},\cdots,r_{G}\})}}{{\mathrm{s}td(\{r_{1},r_{2},\cdots,r_{G}\})}}. (3) A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within and

1 Introduction

1 引言近年来，大型语言模型（LLM）正在经历快速的迭代和演进，逐渐缩小与通用人工智能（AGI）的差距。最近，后训练已成为完整训练管道的重要组成部分。它被证明可以提高推理任务的准确性，与社会价值观对齐。

原文: In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and evolution (OpenAI, 2024a ; Anthropic, 2024 ; Google, 2024 ) , progressively diminishing the gap towards Artificial General Intelligence (AGI). Recently, post-training has emerged as an important component of the full training pipeline. It has been shown to enhance accuracy on reasoning tasks, align with social values, and adapt to user preferences, all while requiring relatively minimal computational resources against pre-training. In the context of reasoning capabilities, OpenAI’s o1 (OpenAI, 2024b ) series...

1 Introduction

经过数千步的RL训练，DeepSeek-R1-Zero在推理基准上表现出卓越的性能。例如，AIME 2024的pass@1分数从15.6%提高到71.0%，使用多数投票后分数进一步提升到86.7%，匹敌OpenAI-o1-0912的性能。然而，DeepSeek-R1-Zero面临可读性差和语言混合等挑战。为解决这些问题并进一步改进，我们进行了监督微调。

原文: thousands of RL steps, DeepSeek-R1-Zero exhibits super performance on reasoning benchmarks. For instance, the pass@1 score on AIME 2024 increases from 15.6% to 71.0%, and with majority voting, the score further improves to 86.7%, matching the performance of OpenAI-o1-0912. However, DeepSeek-R1-Zero encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates a small amount of cold-start data and a multi-stage training pipeline. Specifically, we begin by collecting thousands o...

1 Introduction

我们直接在基础模型上进行大规模强化学习，不依赖监督微调（SFT）作为初步步骤。这种方法允许模型探索用于解决复杂问题的链思维（CoT），从而开发了DeepSeek-R1-Zero。DeepSeek-R1-Zero展示了自我验证、反思和生成长CoT等能力，为研究社区标志了一个重要里程碑。

原文: base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT. This breakthrough paves the way for future advancemen...

1 Introduction

1.2 评估结果总结推理任务：（1）DeepSeek-R1在AIME 2024上达到79.8% Pass@1，略微超越OpenAI-o1-1217。在MATH-500上达到97.3%的惊人分数，与OpenAI-o1-1217相当，显著超越其他模型。（2）在代码相关任务中，DeepSeek-R1展示了专家级性能。

原文: wen2.5 and Llama3 series to the community. 1.2 Summary of Evaluation Results • Reasoning tasks : (1) DeepSeek-R1 achieves a score of 79.8% Pass@1 on AIME 2024, slightly surpassing OpenAI-o1-1217. On MATH-500, it attains an impressive score of 97.3%, performing on par with OpenAI-o1-1217 and significantly outperforming other models. (2) On coding-related tasks, DeepSeek-R1 demonstrates expert level in code competition tasks, as it achieves 2,029 Elo rating on Codeforces outperforming 96.3% human participants in the competition. For engineering-related tasks, DeepSeek-R1 performs slightly better...

1.1 Contributions

1.1 贡献后训练：在基础模型上的大规模强化学习 - 我们直接在基础模型上应用RL，不依赖监督微调（SFT）作为初步步骤。这种方法允许模型探索用于解决复杂问题的链思维（CoT），从而开发了DeepSeek-R1-Zero。

原文: Post-Training: Large-Scale Reinforcement Learning on the Base Model • We directly apply RL to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized ...

1.1 Contributions

可匹敌o1-mini的性能。我们向社区开源了基于Qwen2.5和Llama3系列的蒸馏1.5B、7B、8B、14B、32B和70B检查点。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: rable to o1-mini. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community.

1.2 Summary of Evaluation Results

1.2 评估结果总结推理任务：（1）DeepSeek-R1在AIME 2024上达到79.8% Pass@1，略微超越OpenAI-o1-1217。在MATH-500上达到97.3%的惊人分数。（2）在代码相关任务中，DeepSeek-R1展示了专家级性能。

原文: • Reasoning tasks : (1) DeepSeek-R1 achieves a score of 79.8% Pass@1 on AIME 2024, slightly surpassing OpenAI-o1-1217. On MATH-500, it attains an impressive score of 97.3%, performing on par with OpenAI-o1-1217 and significantly outperforming other models. (2) On coding-related tasks, DeepSeek-R1 demonstrates expert level in code competition tasks, as it achieves 2,029 Elo rating on Codeforces outperforming 96.3% human participants in the competition. For engineering-related tasks, DeepSeek-R1 performs slightly better than DeepSeek-V3, which could help developers in real world tasks. • Knowled...

2 Approach

2 方法 2.1 概述先前的工作严重依赖大量监督数据来提高模型性能。在本研究中，我们证明推理能力可以通过大规模强化学习（RL）显著提高，即使不使用监督数据。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: 2.1 Overview Previous work has heavily relied on large amounts of supervised data to enhance model performance. In this study, we demonstrate that reasoning capabilities can be significantly improved through large-scale reinforcement learning (RL), even without using supervised fine-tuning (SFT) as a cold start. Furthermore, performance can be further enhanced with the inclusion of a small amount of cold-start data. In the following sections, we present: (1) DeepSeek-R1-Zero, which applies RL directly to the base model without any SFT data, and (2) DeepSeek-R1, which applies RL starting from a...

2 Approach

GRPO目标函数：最大化J_GRPO(theta) = E[min(ratio * A_i, clip(ratio, 1-epsilon, 1+epsilon) * A_i) - beta * D_KL] 其中ratio是新旧策略的概率比，A_i是优势函数。

原文: maximizing the following objective: 𝒥 G R P O ( θ ) = 𝔼 [ q ∼ P ( Q ) , { o i } i = 1 G ∼ π θ o l d ( O | q ) ] 1 G ∑ i = 1 G ( min ( π θ ( o i | q ) π θ o l d ( o i | q ) A i , clip ( π θ ( o i | q ) π θ o l d ( o i | q ) , 1 − ε , 1 + ε ) A i ) − β 𝔻 K L ( π θ | | π r e f ) ) , \begin{split}\mathcal{J}_{GRPO}(\theta)&=\mathbb{E}{[q\sim P(Q),\{o_{i}\}_{i=1}^{G}\sim\pi_{\theta_{old}}(O|q)]}\\ &\frac{1}{G}\sum_{i=1}^{G}\left(\min\left(\frac{\pi_{\theta}(o_{i}|q)}{\pi_{\theta_{old}}(o_{i}|q)}A_{i},\text{clip}\left(\frac{\pi_{\theta}(o_{i}|q)}{\pi_{\theta_{...

2 Approach

2.2.2 奖励建模奖励是训练信号的来源，决定RL的优化方向。为了训练DeepSeek-R1-Zero，我们采用基于规则的奖励系统，主要包括两种类型的奖励： - 准确性奖励：基于答案是否正确。 - 格式奖励：基于输出格式是否符合要求。

原文: during training. 2.2.2 Reward Modeling The reward is the source of the training signal, which decides the optimization direction of RL. To train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two types of rewards: • Accuracy rewards : The accuracy reward model evaluates whether the response is correct. For example, in the case of math problems with deterministic results, the model is required to provide the final answer in a specified format (e.g., within a box), enabling reliable rule-based verification of correctness. Similarly, for LeetCode problems, a compile...

2 Approach

表2：DeepSeek-R1-Zero和OpenAI o1模型在推理相关基准上的比较。 DeepSeek-R1-Zero在AIME 2024上达到71.0%（pass@1），使用多数投票后达到86.7%，匹敌OpenAI-o1-0912。

原文: -mini 63.6 80.0 90.0 60.0 53.8 1820 OpenAI-o1-0912 74.4 83.3 94.8 77.3 63.4 1843 DeepSeek-R1-Zero 71.0 86.7 95.9 73.3 50.0 1444 Table 2: Comparison of DeepSeek-R1-Zero and OpenAI o1 models on reasoning-related benchmarks. Figure 2: AIME accuracy of DeepSeek-R1-Zero during training. For each question, we sample 16 responses and calculate the overall average accuracy to ensure a stable evaluation. Figure 2 depicts the performance trajectory of DeepSeek-R1-Zero on the AIME 2024 benchmark throughout the RL training process. As illustrated, DeepSeek-R1-Zero demonstrates a steady and consistent enha...

2 Approach

DeepSeek-R1-Zero自然学习到用更多思考时间解决推理任务。 DeepSeek-R1-Zero的自我进化过程是RL如何驱动模型改进能力的迷人展示。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: ocess. DeepSeek-R1-Zero naturally learns to solve reasoning tasks with more thinking time. Self-evolution Process of DeepSeek-R1-Zero The self-evolution process of DeepSeek-R1-Zero is a fascinating demonstration of how RL can drive a model to improve its reasoning capabilities autonomously. By initiating RL directly from the base model, we can closely monitor the model’s progression without the influence of the supervised fine-tuning stage. This approach provides a clear view of how the model evolves over time, particularly in terms of its ability to handle complex reasoning tasks. As depicted...

2 Approach

在这个阶段，DeepSeek-R1-Zero通过学习重新评估初始方法来为问题分配更多思考时间。这种行为不仅是模型推理能力增长的证明，也是一个令人着迷的现象。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: rsion of the model. During this phase, DeepSeek-R1-Zero learns to allocate more thinking time to a problem by reevaluating its initial approach. This behavior is not only a testament to the model’s growing reasoning abilities but also a captivating example of how reinforcement learning can lead to unexpected and sophisticated outcomes. This moment is not only an “aha moment” for the model but also for the researchers observing its behavior. It underscores the power and beauty of reinforcement learning: rather than explicitly teaching the model on how to solve a problem, we simply provide it wi...

2 Approach

模型展示了一个典型的反思过程：逐步重新评估以识别正确的求和方法。这种自我纠正能力是RL训练涌现出的关键特性。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: reevaluate this step-by-step to identify if the correct sum can be ⋯ \cdots We started with the equation: a − a + x = x \sqrt{a-\sqrt{a+x}}=x First, let’s square both sides: a − a + x = x 2 ⟹ a + x = a − x 2 a-\sqrt{a+x}=x^{2}\implies\sqrt{a+x}=a-x^{2} Next, I could square both sides again, treating the equation: ⋯ \cdots … Table 3: An interesting “aha moment” of an intermediate version of DeepSeek-R1-Zero. The model learns to rethink using an anthropomorphic tone. This is also an aha moment for us, allowing us to witness the power and beauty of reinforcement learning. 2.3 DeepSeek-R1: Reinfor...

2 Approach

DeepSeek-R1-Zero的输出不适合阅读。响应可能混合多种语言或缺乏markdown格式来为用户突出显示答案。相比之下，在为DeepSeek-R1创建冷启动数据时，我们设计了包含结尾摘要的可读模式。

原文: not suitable for reading. Responses may mix multiple languages or lack markdown formatting to highlight answers for users. In contrast, when creating cold-start data for DeepSeek-R1, we design a readable pattern that includes a summary at the end of each response and filters out responses that are not reader-friendly. Here, we define the output format as |special_token||special_token|

, where the reasoning process is the CoT for the query, and the summary is used to summarize the reasoning results. • Potential: By carefully designing the pattern for cold-start data w...

2 Approach

当RL收敛时，我们利用生成的检查点收集SFT（监督微调）数据用于下一轮。与主要关注推理的初始冷启动数据不同，这个阶段融入了其他领域的数据。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: ted RL converges, we utilize the resulting checkpoint to collect SFT (Supervised Fine-Tuning) data for the subsequent round. Unlike the initial cold-start data, which primarily focuses on reasoning, this stage incorporates data from other domains to enhance the model’s capabilities in writing, role-playing, and other general-purpose tasks. Specifically, we generate the data and fine-tune the model as described below. Reasoning data We curate reasoning prompts and generate reasoning trajectories by performing rejection sampling from the checkpoint from the above RL training. In the previous sta...

2 Approach

模型的有用性和无害性，同时改进其推理能力。具体来说，我们使用奖励信号和多样化提示分布的组合来训练模型。对于推理数据，我们遵循之前描述的方法论。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: model’s helpfulness and harmlessness while simultaneously refining its reasoning capabilities. Specifically, we train the model using a combination of reward signals and diverse prompt distributions. For reasoning data, we adhere to the methodology outlined in DeepSeek-R1-Zero, which utilizes rule-based rewards to guide the learning process in math, code, and logical reasoning domains. For general data, we resort to reward models to capture human preferences in complex and nuanced scenarios. We build upon the DeepSeek-V3 pipeline and adopt a similar distribution of preference pairs and trainin...

2 Approach

即使结合RL可以大幅提升模型性能。我们这里的主要目标是展示蒸馏技术的有效性，将RL阶段的探索留给更广泛的研究社区。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: tage, even though incorporating RL could substantially boost model performance. Our primary goal here is to demonstrate the effectiveness of the distillation technique, leaving the exploration of the RL stage to the broader research community.

2.1 Overview

先前的工作严重依赖大量监督数据来提高模型性能。在本研究中，我们证明推理能力可以通过大规模强化学习（RL）显著提高，即使不使用监督数据。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Previous work has heavily relied on large amounts of supervised data to enhance model performance. In this study, we demonstrate that reasoning capabilities can be significantly improved through large-scale reinforcement learning (RL), even without using supervised fine-tuning (SFT) as a cold start. Furthermore, performance can be further enhanced with the inclusion of a small amount of cold-start data. In the following sections, we present: (1) DeepSeek-R1-Zero, which applies RL directly to the base model without any SFT data, and (2) DeepSeek-R1, which applies RL starting from a checkpoint f...

2.2 DeepSeek-R1-Zero: Reinforcement Learning on the Base Model

2.2 DeepSeek-R1-Zero：基础模型上的强化学习强化学习在推理任务中已经证明了显著的效力，如我们之前的工作所示。然而，这些工作严重依赖于监督数据，这些数据耗时且昂贵。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Reinforcement learning has demonstrated significant effectiveness in reasoning tasks, as evidenced by our previous works (Wang et al., 2023 ; Shao et al., 2024 ) . However, these works heavily depended on supervised data, which are time-intensive to gather. In this section, we explore the potential of LLMs to develop reasoning capabilities without any supervised data , focusing on their self-evolution through a pure reinforcement learning process. We start with a brief overview of our RL algorithm, followed by the presentation of some exciting results, and hope this provides the community with...

2.2 DeepSeek-R1-Zero: Reinforcement Learning on the Base Model

KL散度公式：D_KL(pi_theta || pi_ref) = pi_ref(o_i|q)/pi_theta(o_i|q) - log(pi_ref(o_i|q)/pi_theta(o_i|q)) - 1 其中epsilon和beta是超参数。

原文: ) − log π r e f ( o i | q ) π θ ( o i | q ) − 1 , \mathbb{D}_{KL}\left(\pi_{\theta}||\pi_{ref}\right)=\frac{\pi_{ref}(o_{i}|q)}{\pi_{\theta}(o_{i}|q)}-\log\frac{\pi_{ref}(o_{i}|q)}{\pi_{\theta}(o_{i}|q)}-1, (2) where ε \varepsilon and β \beta are hyper-parameters, and A i A_{i} is the advantage, computed using a group of rewards { r 1 , r 2 , … , r G } \{r_{1},r_{2},\ldots,r_{G}\} corresponding to the outputs within each group: A i = r i − m e a n ( { r 1 , r 2 , ⋯ , r G } ) s t d ( { r 1 , r 2 , ⋯ , r G } ) . A_{i}=\frac{r_{i}-{\mathrm{m}ean(\{r_{1},r_{2},\cdots,r_{G}\})...

2.2 DeepSeek-R1-Zero: Reinforcement Learning on the Base Model

使用''和''标签。我们不在开发DeepSeek-R1-Zero时应用结果或过程神经奖励模型，因为我们发现神经奖励模型可能在大规模强化学习过程中遭受奖励黑客攻击。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: n ‘’ and ‘’ tags. We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline. 2.2.3 Training Template To train DeepSeek-R1-Zero, we begin by designing a straightforward template that guides the base model to adhere to our specified instructions. As depicted in Table 1 , this template requires DeepSeek-R1-Zero ...

2.2 DeepSeek-R1-Zero: Reinforcement Learning on the Base Model

与OpenAI-o1-0912相当。这种显著改进突出了我们RL算法在优化模型性能方面的有效性。表2提供了DeepSeek-R1-Zero和OpenAI o1-0912模型在各种推理基准上的比较分析。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: ble to OpenAI-o1-0912. This significant improvement highlights the efficacy of our RL algorithm in optimizing the model’s performance over time. Table 2 provides a comparative analysis between DeepSeek-R1-Zero and OpenAI’s o1-0912 models across a variety of reasoning-related benchmarks. The findings reveal that RL empowers DeepSeek-R1-Zero to attain robust reasoning capabilities without the need for any supervised fine-tuning data. This is a noteworthy achievement, as it underscores the model’s ability to learn and generalize effectively through RL alone. Additionally, the performance of DeepS...

2.2 DeepSeek-R1-Zero: Reinforcement Learning on the Base Model

模型通过利用扩展的测试时计算，自然获得了解决越来越复杂的推理任务的能力。这种计算范围从生成数百到数千个推理token，允许模型进行深入的推理过程。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: he model. DeepSeek-R1-Zero naturally acquires the ability to solve increasingly complex reasoning tasks by leveraging extended test-time computation. This computation ranges from generating hundreds to thousands of reasoning tokens, allowing the model to explore and refine its thought processes in greater depth. One of the most remarkable aspects of this self-evolution is the emergence of sophisticated behaviors as the test-time computation increases. Behaviors such as reflection—where the model revisits and reevaluates its previous steps—and the exploration of alternative approaches to proble...

2.2 DeepSeek-R1-Zero: Reinforcement Learning on the Base Model

为未来更自主和适应性更强的模型铺平了道路。 DeepSeek-R1-Zero的缺点：尽管DeepSeek-R1-Zero展示了强大的推理能力和自主开发的意外且强大的推理行为，但它面临一些挑战，包括可读性差和语言混合。

原文: s, paving the way for more autonomous and adaptive models in the future. Drawback of DeepSeek-R1-Zero Although DeepSeek-R1-Zero exhibits strong reasoning capabilities and autonomously develops unexpected and powerful reasoning behaviors, it faces several issues. For instance, DeepSeek-R1-Zero struggles with challenges like poor readability, and language mixing. To make reasoning processes more readable and share them with the open community, we explore DeepSeek-R1, a method that utilizes RL with human-friendly cold-start data. Question: If a > 1 a>1 , then the sum of the real solutions of a − ...

2.3 DeepSeek-R1: Reinforcement Learning with Cold Start

2.3 DeepSeek-R1：带冷启动的强化学习受DeepSeek-R1-Zero有希望的结果启发，两个自然问题出现：1) 通过纳入少量高质量数据作为冷启动，推理性能是否可以进一步提高或收敛加速？2) 我们如何训练一个更实用的推理模型？

原文: Inspired by the promising results of DeepSeek-R1-Zero, two natural questions arise: 1) Can reasoning performance be further improved or convergence accelerated by incorporating a small amount of high-quality data as a cold start? 2) How can we train a user-friendly model that not only produces clear and coherent Chains of Thought (CoT) but also demonstrates strong general capabilities? To address these questions, we design a pipeline to train DeepSeek-R1. The pipeline consists of four stages, outlined as follows. 2.3.1 Cold Start Unlike DeepSeek-R1-Zero, to prevent the early unstable cold star...

2.3 DeepSeek-R1: Reinforcement Learning with Cold Start

与DeepSeek-R1-Zero相比的性能。我们相信迭代训练是推理模型的更好方式。 2.3.2 推理导向的强化学习在冷启动数据上微调DeepSeek-V3-Base后，我们应用相同的大规模RL过程。

原文: ter performance against DeepSeek-R1-Zero. We believe the iterative training is a better way for reasoning models. 2.3.2 Reasoning-oriented Reinforcement Learning After fine-tuning DeepSeek-V3-Base on the cold start data, we apply the same large-scale reinforcement learning training process as employed in DeepSeek-R1-Zero. This phase focuses on enhancing the model’s reasoning capabilities, particularly in reasoning-intensive tasks such as coding, mathematics, science, and logic reasoning, which involve well-defined problems with clear solutions. During the training process, we observe that CoT ...

2.3 DeepSeek-R1: Reinforcement Learning with Cold Start

应该使用基于规则的奖励进行评估。然而，在这个阶段，我们通过纳入额外数据来扩展数据集，其中一些使用生成式奖励模型，将真实值和模型预测输入DeepSeek-V3进行判断。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: uld be evaluated using rule-based rewards. However, in this stage, we expand the dataset by incorporating additional data, some of which use a generative reward model by feeding the ground-truth and model predictions into DeepSeek-V3 for judgment. Additionally, because the model output is sometimes chaotic and difficult to read, we have filtered out chain-of-thought with mixed languages, long parapraphs, and code blocks. For each prompt, we sample multiple responses and retain only the correct ones. In total, we collect about 600k reasoning related training samples. Non-Reasoning data For non-...

2.3 DeepSeek-R1: Reinforcement Learning with Cold Start

仅关注最终摘要，确保评估强调响应对用户的实用性和相关性，同时最小化对底层推理过程的干扰。对于无害性，我们评估整个响应。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: cus exclusively on the final summary, ensuring that the assessment emphasizes the utility and relevance of the response to the user while minimizing interference with the underlying reasoning process. For harmlessness, we evaluate the entire response of the model, including both the reasoning process and the summary, to identify and mitigate any potential risks, biases, or harmful content that may arise during the generation process. Ultimately, the integration of reward signals and diverse data distributions enables us to train a model that excels in reasoning while prioritizing helpfulness a...

2.4 Distillation: Empower Small Models with Reasoning Capability

2.4 蒸馏：用小模型的推理能力赋能为了使更高效的小型模型具备像DeepSeek-R1这样的推理能力，我们直接使用DeepSeek-R1策划的80万个样本对Qwen和Llama等开源模型进行微调。

原文: To equip more efficient smaller models with reasoning capabilities like DeepSeek-R1, we directly fine-tuned open-source models like Qwen (Qwen, 2024b ) and Llama (AI@Meta, 2024 ) using the 800k samples curated with DeepSeek-R1, as detailed in § 2.3.3 . Our findings indicate that this straightforward distillation method significantly enhances the reasoning abilities of smaller models. The base models we use here are Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-14B, Qwen2.5-32B, Llama-3.1-8B, and Llama-3.3-70B-Instruct. We select Llama-3.3 because its reasoning capability is slightly better than ...

3 Experiment

3 实验基准测试：我们在MMLU、MMLU-Redux、MMLU-Pro、C-Eval、CMMLU、IFEval、FRAMES、GPQA、LiveCodeBench等基准上评估模型。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Benchmarks We evaluate models on MMLU (Hendrycks et al., 2020 ) , MMLU-Redux (Gema et al., 2024 ) , MMLU-Pro (Wang et al., 2024 ) , C-Eval (Huang et al., 2023 ) , and CMMLU (Li et al., 2023 ) , IFEval (Zhou et al., 2023 ) , FRAMES (Krishna et al., 2024 ) , GPQA Diamond (Rein et al., 2023 ) , SimpleQA (OpenAI, 2024c ) , C-SimpleQA (He et al., 2024 ) , SWE-Bench Verified (OpenAI, 2024d ) , Aider 1 1 1 https://aider.chat , LiveCodeBench (Jain et al., 2024 ) (2024-08 – 2025-01), Codeforces 2 2 2 https://codeforces.com , Chinese National High School Mathematics Olympiad (CNMO 2024) 3 3 3 https://ww...

3 Experiment

LiveCodeBench上的性能使用CoT格式评估，数据收集于2024年8月至2025年1月。Codeforces数据集使用来自10个Div.2竞赛的问题和专家制作的测试用例进行评估。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: rmance on LiveCodeBench is evaluated using CoT format, with data collected between August 2024 and January 2025. The Codeforces dataset is evaluated using problems from 10 Div.2 contests along with expert-crafted test cases, after which the expected ratings and percentages of competitors are calculated. SWE-Bench verified results are obtained via the agentless framework (Xia et al., 2024 ) . AIDER-related benchmarks are measured using a "diff" format. DeepSeek-R1 outputs are capped at a maximum of 32,768 tokens for each benchmark. Baselines We conduct comprehensive evaluations against several ...

3 Experiment

表3：DeepSeek-R1与其他模型的比较。 DeepSeek-R1在MMLU上达到91.8%，在MMLU-Pro上达到84.0%，显著超越其他开源模型。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: o1-mini o1-1217 R1 Architecture - - MoE - - MoE # Activated Params - - 37B - - 37B # Total Params - - 671B - - 671B English MMLU (Pass@1) 88.3 87.2 88.5 85.2 91.8 90.8 MMLU-Redux (EM) 88.9 88.0 89.1 86.7 - 92.9 MMLU-Pro (EM) 78.0 72.6 75.9 80.3 - 84.0 DROP (3-shot F1) 88.3 83.7 91.6 83.9 90.2 92.2 IF-Eval (Prompt Strict) 86.5 84.3 86.1 84.8 - 83.3 GPQA Diamond (Pass@1) 65.0 49.9 59.1 60.0 75.7 71.5 SimpleQA (Correct) 28.4 38.2 24.9 7.0 47.0 30.1 FRAMES (Acc.) 72.5 80.5 73.3 76.9 - 82.5 AlpacaEval2.0 (LC-winrate) 52.0 51.1 70.0 57.8 - 87.6 ArenaHard (GPT-4-1106) 85.2 80.4 85.5 92.0 - 92.3 Code ...

3 Experiment

DeepSeek-R1在中文SimpleQA基准上表现不如DeepSeek-V3，主要是因为安全RL后它倾向于拒绝回答某些查询。没有安全RL，DeepSeek-R1可以实现超过70%的准确率。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Seek-R1 performs worse than DeepSeek-V3 on the Chinese SimpleQA benchmark, primarily due to its tendency to refuse answering certain queries after safety RL. Without safety RL, DeepSeek-R1 could achieve an accuracy of over 70%. DeepSeek-R1 also delivers impressive results on IF-Eval, a benchmark designed to assess a model’s ability to follow format instructions. These improvements can be linked to the inclusion of instruction-following data during the final stages of supervised fine-tuning (SFT) and RL training. Furthermore, remarkable performance is observed on AlpacaEval2.0 and ArenaHard, in...

3 Experiment

表4：蒸馏模型的性能比较。 DeepSeek-R1-Distill-Qwen-70B在AIME 2024上达到69.7%，在MATH-500上达到95.0%，展示了蒸馏的有效性。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: nAI-o1-mini 63.6 80.0 90.0 60.0 53.8 1820 QwQ-32B-Preview 50.0 60.0 90.6 54.5 41.9 1316 DeepSeek-R1-Distill-Qwen-1.5B 28.9 52.7 83.9 33.8 16.9 954 DeepSeek-R1-Distill-Qwen-7B 55.5 83.3 92.8 49.1 37.6 1189 DeepSeek-R1-Distill-Qwen-14B 69.7 80.0 93.9 59.1 53.1 1481 DeepSeek-R1-Distill-Qwen-32B 72.6 83.3 94.3 62.1 57.2 1691 DeepSeek-R1-Distill-Llama-8B 50.4 80.0 89.1 49.0 39.6 1205 DeepSeek-R1-Distill-Llama-70B 70.0 86.7 94.5 65.2 57.5 1633 Table 5: Comparison of DeepSeek-R1 distilled models and other comparable models on reasoning-related benchmarks. As shown in Table 5 , simply distilling DeepS...

3.1 DeepSeek-R1 Evaluation

3.1 DeepSeek-R1 评估表3展示了DeepSeek-R1与Claude-3.5-Sonnet、GPT-4o和OpenAI-o1等模型的比较结果。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Benchmark (Metric) Claude-3.5- GPT-4o DeepSeek OpenAI OpenAI DeepSeek Sonnet-1022 0513 V3 o1-mini o1-1217 R1 Architecture - - MoE - - MoE # Activated Params - - 37B - - 37B # Total Params - - 671B - - 671B English MMLU (Pass@1) 88.3 87.2 88.5 85.2 91.8 90.8 MMLU-Redux (EM) 88.9 88.0 89.1 86.7 - 92.9 MMLU-Pro (EM) 78.0 72.6 75.9 80.3 - 84.0 DROP (3-shot F1) 88.3 83.7 91.6 83.9 90.2 92.2 IF-Eval (Prompt Strict) 86.5 84.3 86.1 84.8 - 83.3 GPQA Diamond (Pass@1) 65.0 49.9 59.1 60.0 75.7 71.5 SimpleQA (Correct) 28.4 38.2 24.9 7.0 47.0 30.1 FRAMES (Acc.) 72.5 80.5 73.3 76.9 - 82.5 AlpacaEval2.0 (LC-w...

3.1 DeepSeek-R1 Evaluation

与OpenAI-o1超越GPT-4o的趋势类似。然而，DeepSeek-R1在中文SimpleQA基准上表现不如DeepSeek-V3，主要是因为安全RL后它倾向于拒绝回答某些查询。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: imilar trend is observed where OpenAI-o1 surpasses GPT-4o on this benchmark. However, DeepSeek-R1 performs worse than DeepSeek-V3 on the Chinese SimpleQA benchmark, primarily due to its tendency to refuse answering certain queries after safety RL. Without safety RL, DeepSeek-R1 could achieve an accuracy of over 70%. DeepSeek-R1 also delivers impressive results on IF-Eval, a benchmark designed to assess a model’s ability to follow format instructions. These improvements can be linked to the inclusion of instruction-following data during the final stages of supervised fine-tuning (SFT) and RL tr...

3.2 Distilled Model Evaluation

3.2 蒸馏模型评估表4展示了蒸馏模型的性能比较。DeepSeek-R1-Distill-Qwen-70B在AIME 2024上达到69.7%，在MATH-500上达到95.0%。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Model AIME 2024 MATH-500 GPQA LiveCode CodeForces Diamond Bench pass@1 cons@64 pass@1 pass@1 pass@1 rating GPT-4o-0513 9.3 13.4 74.6 49.9 32.9 759 Claude-3.5-Sonnet-1022 16.0 26.7 78.3 65.0 38.9 717 OpenAI-o1-mini 63.6 80.0 90.0 60.0 53.8 1820 QwQ-32B-Preview 50.0 60.0 90.6 54.5 41.9 1316 DeepSeek-R1-Distill-Qwen-1.5B 28.9 52.7 83.9 33.8 16.9 954 DeepSeek-R1-Distill-Qwen-7B 55.5 83.3 92.8 49.1 37.6 1189 DeepSeek-R1-Distill-Qwen-14B 69.7 80.0 93.9 59.1 53.1 1481 DeepSeek-R1-Distill-Qwen-32B 72.6 83.3 94.3 62.1 57.2 1691 DeepSeek-R1-Distill-Llama-8B 50.4 80.0 89.1 49.0 39.6 1205 DeepSeek-R1-Dist...

4 Discussion

4 讨论 4.1 蒸馏与强化学习的比较表6展示了蒸馏模型与RL模型的比较。蒸馏方法在大多数任务上超越了直接RL训练的小模型。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: 4.1 Distillation v.s. Reinforcement Learning Model AIME 2024 MATH-500 GPQA Diamond LiveCodeBench pass@1 cons@64 pass@1 pass@1 pass@1 QwQ-32B-Preview 50.0 60.0 90.6 54.5 41.9 DeepSeek-R1-Zero-Qwen-32B 47.0 60.0 91.6 55.0 40.2 DeepSeek-R1-Distill-Qwen-32B 72.6 83.3 94.3 62.1 57.2 Table 6: Comparison of distilled and RL Models on Reasoning-Related Benchmarks. In Section 3.2 , we can see that by distilling DeepSeek-R1, the small model can achieve impressive results. However, there is still one question left: can the model achieve comparable performance through the large-scale RL training discussed...

4 Discussion

过程奖励模型（PRM）在解决推理任务方面有三种主要限制：首先，在一般推理中明确定义细粒度步骤具有挑战性。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: el toward better approaches for solving reasoning tasks (Uesato et al., 2022 ; Lightman et al., 2023 ; Wang et al., 2023 ) . However, in practice, PRM has three main limitations that may hinder its ultimate success. First, it is challenging to explicitly define a fine-grain step in general reasoning. Second, determining whether the current intermediate step is correct is a challenging task. Automated annotation using models may not yield satisfactory results, while manual annotation is not conducive to scaling up. Third, once a model-based PRM is introduced, it inevitably leads to reward hacki...

4 Discussion

为了解决这个问题，我们设置了每个节点的最大扩展限制，但这可能导致模型陷入局部最优。其次，价值模型直接影响生成质量。训练细粒度价值模型具有内在困难。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: address this, we set a maximum extension limit for each node, but this can lead to the model getting stuck in local optima. Second, the value model directly influences the quality of generation since it guides each step of the search process. Training a fine-grained value model is inherently difficult, which makes it challenging for the model to iteratively improve. While AlphaGo’s core success relied on training a value model to progressively enhance its performance, this principle proves difficult to replicate in our setup due to the complexities of token generation. In conclusion, while MCT...

4.1 Distillation v.s. Reinforcement Learning

表6：蒸馏模型与RL模型在推理相关基准上的比较。 DeepSeek-R1-Distill-Qwen-32B在AIME 2024上达到72.6%，显著超越QwQ-32B-Preview的50.0%。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Model AIME 2024 MATH-500 GPQA Diamond LiveCodeBench pass@1 cons@64 pass@1 pass@1 pass@1 QwQ-32B-Preview 50.0 60.0 90.6 54.5 41.9 DeepSeek-R1-Zero-Qwen-32B 47.0 60.0 91.6 55.0 40.2 DeepSeek-R1-Distill-Qwen-32B 72.6 83.3 94.3 62.1 57.2 Table 6: Comparison of distilled and RL Models on Reasoning-Related Benchmarks. In Section 3.2 , we can see that by distilling DeepSeek-R1, the small model can achieve impressive results. However, there is still one question left: can the model achieve comparable performance through the large-scale RL training discussed in the paper without distillation? To answer...

4.2 Unsuccessful Attempts

4.2 不成功的尝试在开发DeepSeek-R1的早期阶段，我们也遇到了失败和挫折。我们分享失败经验以提供见解，但这并不意味着这些方法不能开发有效的推理模型。过程奖励模型（PRM）是一种有前景的方法，但在实践中面临挑战。

原文: In the early stages of developing DeepSeek-R1, we also encountered failures and setbacks along the way. We share our failure experiences here to provide insights, but this does not imply that these approaches are incapable of developing effective reasoning models. Process Reward Model (PRM) PRM is a reasonable method to guide the model toward better approaches for solving reasoning tasks (Uesato et al., 2022 ; Lightman et al., 2023 ; Wang et al., 2023 ) . However, in practice, PRM has three main limitations that may hinder its ultimate success. First, it is challenging to explicitly define a f...

4.2 Unsuccessful Attempts

迭代精炼过程面临扩展挑战。首先，与搜索空间相对明确的国际象棋不同，token生成呈现出指数级增长。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: lting question-answer pairs to train both the actor model and the value model, iteratively refining the process. However, this approach encounters several challenges when scaling up the training. First, unlike chess, where the search space is relatively well-defined, token generation presents an exponentially larger search space. To address this, we set a maximum extension limit for each node, but this can lead to the model getting stuck in local optima. Second, the value model directly influences the quality of generation since it guides each step of the search process. Training a fine-graine...

5 Conclusion, Limitations, and Future Work

5 结论、局限与未来工作在这项工作中，我们分享了通过强化学习增强模型推理能力的旅程。DeepSeek-R1-Zero代表了一种不依赖冷启动数据的纯RL方法，在各种任务上取得了强劲性能。DeepSeek-R1更强大，利用了冷启动数据。

原文: In this work, we share our journey in enhancing model reasoning abilities through reinforcement learning. DeepSeek-R1-Zero represents a pure RL approach without relying on cold-start data, achieving strong performance across various tasks. DeepSeek-R1 is more powerful, leveraging cold-start data alongside iterative RL fine-tuning. Ultimately, DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on a range of tasks. We further explore distillation the reasoning capability to small dense models. We use DeepSeek-R1 as the teacher model to generate 800K training samples, and fine-tune sev...

5 Conclusion, Limitations, and Future Work

评估次数影响了RL过程的效率，大规模RL在软件工程任务中没有广泛应用。因此，DeepSeek-R1在软件工程基准上没有展示对DeepSeek-V3的巨大改进。未来版本将通过改进来解决这个问题。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: aluation times, which impact the efficiency of the RL process, large-scale RL has not been applied extensively in software engineering tasks. As a result, DeepSeek-R1 has not demonstrated a huge improvement over DeepSeek-V3 on software engineering benchmarks. Future versions will address this by implementing rejection sampling on software engineering data or incorporating asynchronous evaluations during the RL process to improve efficiency.

Appendix A Contributions and Acknowledgments

附录A 贡献与致谢核心贡献者：Guo Daya, Yang Dejian, Zhang Haowei, Song Junxiao, Zhang Ruoyu, Xu Runxin, Zhu Qihao, Ma Shirong, Wang Peiyi, Bi Xiao, Zhang Xiaokang, Yu Xingkai, Wu Yu, Wu Z.F., Gou Zhibin, Shao Zhihong, Li Zhuoshu, Gao Ziyi

原文: Core Contributors Daya Guo Dejian Yang Haowei Zhang Junxiao Song Ruoyu Zhang Runxin Xu Qihao Zhu Shirong Ma Peiyi Wang Xiao Bi Xiaokang Zhang Xingkai Yu Yu Wu Z.F. Wu Zhibin Gou Zhihong Shao Zhuoshu Li Ziyi Gao Contributors Aixin Liu Bing Xue Bingxuan Wang Bochao Wu Bei Feng Chengda Lu Chenggang Zhao Chengqi Deng Chong Ruan Damai Dai Deli Chen Dongjie Ji Erhang Li Fangyun Lin Fucong Dai Fuli Luo* Guangbo Hao Guanting Chen Guowei Li H. Zhang Hanwei Xu Honghui Ding Huazuo Gao Hui Qu Hui Li Jianzhong Guo Jiashi Li Jingchang Chen Jingyang Yuan Jinhao Tu Junjie Qiu Junlong Li J.L. Cai Jiaqi Ni Jian...

Appendix A Contributions and Acknowledgments

附录A 贡献与致谢在每个角色中，作者按名字字母顺序排列。标记*的名字表示已离职的个人。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: n Zhang Zhewen Hao Zhicheng Ma Zhigang Yan Zhiyu Wu Zihui Gu Zijia Zhu Zijun Liu* Zilin Li Ziwei Xie Ziyang Song Zizheng Pan Zhen Huang Zhipeng Xu Zhongyu Zhang Zhen Zhang Within each role, authors are listed alphabetically by the first name. Names marked with * denote individuals who have departed from our team. ◄ Feeling lucky? Conversion report Report an issue View original on arXiv ►

← 返回首页详细解读