Paper Content
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
DeepSeek-AI
research@deepseek.com
Abstract
arXiv:2501.12948v2 [cs.CL] 4 Jan 2026
General reasoning represents a long-standing and formidable challenge in artificial intelli-
gence. Recent breakthroughs, exemplified by large language models (LLMs) (Brown et al.,
2020; OpenAI, 2023) and chain-of-thought prompting (Wei et al., 2022b), have achieved con-
siderable success on foundational reasoning tasks. However, this success is heavily contingent
upon extensive human-annotated demonstrations, and models’ capabilities are still insuffi-
cient for more complex problems. Here we show that the reasoning abilities of LLMs can be
incentivized through pure reinforcement learning (RL), obviating the need for human-labe
Paper Content
strated that large language models can be effectively augmented through chain-of-
thought (CoT) prompting. This technique, which involves either providing carefully designed
few-shot examples or using minimalistic prompts such as “Let’s think step by step”(Kojima
et al., 2022; Wei et al., 2022b), enables models to produce intermediate reasoning steps, thereby
substantially enhancing their performance on complex tasks. Similarly, further performance
gains have been observed when models learn high-quality, multi-step reasoning trajectories
during the post-training phase (Chung et al., 2024; OpenAI, 2023). Despite their effectiveness,
these approaches exhibit notable limitations. Their dependence on human-annotated reasoning
traces hinders scalability and introduces cognitive biases. Furtherm
Paper Content
iors. In solving reasoning
problems, the model exhibits a tendency to generate longer responses, incorporating verification,
reflection, and the exploration of alternative approaches within each response. Although we
do not explicitly teach the model how to reason, it successfully learns improved reasoning
strategies through reinforcement learning.
Although DeepSeek-R1-Zero demonstrates excellent reasoning capabilities, it faces chal-
lenges such as poor readability and language mixing, occasionally combining English and
Chinese within a single chain-of-thought response. Furthermore, the rule-based RL training
stage of DeepSeek-R1-Zero is narrowly focused on reasoning tasks, resulting in limited per-
formance in broader areas such as writing and open-domain question answering. To address
t
Paper Content
y on
reinforcement learning without supervised fine-tuning. To facilitate large-scale RL efficiency, we
adopt Group Relative Policy Optimization (GRPO) (Shao et al., 2024).
2.1. Group Relative Policy Optimization
GRPO (Shao et al., 2024) is the reinforcement learning algorithm that we adopt to train DeepSeek-
R1-Zero and DeepSeek-R1. It was originally proposed to simplify the training process and
reduce the resource consumption of Proximal Policy Optimization (PPO) (Schulman et al., 2017),
which is widely used in the RL stage of LLMs (Ouyang et al., 2022).
2
For each question 𝑞, GRPO samples a group of outputs { 𝑜1 , 𝑜2 , · · · , 𝑜𝐺 } from the old policy
𝜋𝜃𝑜𝑙𝑑 and then optimizes the policy model 𝜋𝜃 by maximizing the following objective:
J𝐺𝑅𝑃𝑂 ( 𝜃) = E[ 𝑞 ∼ 𝑃 ( 𝑄 ), { 𝑜𝑖 }𝐺𝑖=1 ∼ 𝜋𝜃𝑜𝑙𝑑 ( 𝑂 |
Paper Content
hibit a significant jump at the 8.2k step, with training continuing
for a total of 10,400 steps, corresponding to 1.6 training epochs. Each training step consists of
32 unique questions, resulting in a training batch size of 512. Every 400 steps, we replace the
reference model with the latest policy model. To accelerate training, each rollout generates 8,192
outputs, which are randomly split into 16 mini-batches and trained for only a single inner epoch.
Table 1 | Template for DeepSeek-R1-Zero. prompt will be replaced with the specific reasoning
question during training.
A conversation between User and Assistant. The user asks a question, and the Assistant solves
it. The assistant first thinks about the reasoning process in the mind and then provides the user
with the answer. The reasoning
Paper Content
per response during training
r1-zero-pass@1 20000
0.8
r1-zero-cons@16
human participants 17500
0.7
Average length per response
15000
0.6 12500
Accuracy
0.5 10000
0.4 7500
5000
0.3
2500
0.2
0
0 2000 4000 6000 8000 10000 0 2000 4000 6000
Paper Content
c + 𝑅𝑒𝑤𝑎𝑟𝑑format (4)
The accuracy, reward and format reward are combined with the same weight. Notably, we
abstain from applying neural reward models—whether outcome-based or process-based—to
reasoning tasks. This decision is predicated on our observation that neural reward models are
susceptible to reward hacking during large-scale reinforcement learning. Moreover, retrain-
ing such models necessitates substantial computational resources and introduces additional
complexity into the training pipeline, thereby complicating the overall optimization process.
2.3. Incentivize Reasoning Capability in LLMs
Specifically, we apply the RL technique on the DeepSeek-V3 base to train DeepSeek-R1-Zero
Paper Content
𝑥 2 =⇒ 𝑎 − 𝑎 + 𝑥 = 𝑥 2 .
Rearrange to isolate the inner square root term:
( 𝑎 − 𝑥 2 ) 2 = 𝑎 + 𝑥 =⇒ 𝑎2 − 2𝑎𝑥 2 + ( 𝑥 2 ) 2 = 𝑎 + 𝑥 =⇒ 𝑥 4 − 2𝑎𝑥 2 − 𝑥 + ( 𝑎2 − 𝑎) = 0
...
Wait, wait. Wait. That’s an aha moment I can flag here.
Let’s reevaluate this step-by-step to identify if the correct sum can be · · ·
We
√︁ started with the equation:
√
𝑎− 𝑎+𝑥 = 𝑥
First,
√ let’s square both
√ sides:
𝑎 − 𝑎 + 𝑥 = 𝑥 =⇒ 𝑎 + 𝑥 = 𝑎 − 𝑥 2
2
Next, I could square both sides again, treating the equation: · · ·
...
further improved, achieving an accuracy of 86.7%. This performance significantly surpasses the
average performance across all human competitors. Besides the math competitions, as shown
in Figure 10, DeepSeek-R1-Zero also achieves remarkable performance in coding competitions
and graduate-level biol
Paper Content
ntary C.2 for details). This moment marks a distinct change in reasoning
patterns and clearly shows the self-evolution process of DeepSeek-R1-Zero.
The self-evolution of DeepSeek-R1-Zero underscores the power and beauty of RL: rather
than explicitly teaching the model how to solve a problem, we simply provide it with the right
incentives, and it autonomously develops advanced problem-solving strategies. This serves as a
reminder of the potential of RL to unlock higher levels of capabilities in LLMs, paving the way
for more autonomous and adaptive models in the future.
5
DeepSeek DeepSeek DeepSeek DeepSeek
V3 Base V3 Base V3 V3 Base
RL SFT
SFT
Paper Content
ility, and language mixing, as
DeepSeek-V3-Base is trained on multiple languages, especially English and Chinese. To address
these issues, we develop DeepSeek-R1, whose pipeline is illustrated in Figure 2.
In the initial stage, we collect thousands of cold-start data that exhibits a conversational,
human-aligned thinking process. RL training is then applied to improve the model perfor-
mance with the conversational thinking process and language consistency. Subsequently, we
apply rejection sampling and SFT once more. This stage incorporates both reasoning and non-
reasoning datasets into the SFT process, enabling the model to not only excel in reasoning tasks
but also demonstrate advanced writing capabilities. To further align the model with human
preferences, we implement a secondary RL s
Paper Content
generate preference
pairs by prompting DeepSeek-V3 using the arena-hard prompt format, listed in Supplementary
B.2, where each pair consists of a user query along with two candidate responses. For each
preference pair, we query DeepSeek-V3 four times, randomly assigning the responses as either
Response A or Response B to mitigate positional bias. The final preference score is determined by
averaging the four independent judgments, retaining only those pairs where the score difference
(Δ) exceeds 1 to ensure meaningful distinctions. Additionally, to minimize length-related biases,
we ensure that the chosen and rejected responses of the whole dataset have comparable lengths.
In total, we curated 66,000 data pairs for training the reward model. The prompts used in
this dataset are all non-rea
Paper Content
(6)
For general queries, each instance is categorized as belonging to either the safety dataset or the
helpfulness dataset. The general reward, 𝑅𝑒𝑤𝑎𝑟𝑑𝐺𝑒𝑛𝑒𝑟𝑎𝑙 , assigned to each query corresponds to
the respective reward defined within the associated dataset.
3.2. Training Details
3.2.1. Training Details of the First RL Stage
In the first stage of RL, we set the learning rate to 3e-6, the KL coefficient to 0.001, the GRPO clip
ratio 𝜀 to 10, and the sampling temperature to 1 for rollout. For each question, we sample 16
outputs with a maximum length of 32,768. Each training step consists of 32 unique questions,
resulting in a training batch size of 512 per step. Every 400 steps, we replace the reference model
with the latest policy model. To accelerate training, each rollout generates 8,192
Paper Content
n DeepSeek-R1-Zero,
which employs rule-based rewards to guide learning in mathematical, coding, and logical
reasoning domains. During the training process, we observe that CoT often exhibits language
mixing, particularly when RL prompts involve multiple languages. For general data, we utilize
reward models to guide training. Ultimately, the integration of reward signals with diverse data
distributions enables us to develop a model that not only excels in reasoning but also prioritizes
helpfulness and harmlessness. Given a batch of data, the reward can be formulated as
𝑅𝑒𝑤𝑎𝑟𝑑 = 𝑅𝑒𝑤𝑎𝑟𝑑reasoning + 𝑅𝑒𝑤𝑎𝑟𝑑general + 𝑅𝑒𝑤𝑎𝑟𝑑language (8)
where, 𝑅𝑒𝑤𝑎𝑟𝑑reasoning = 𝑅𝑒𝑤𝑎𝑟𝑑rule (9)
𝑅𝑒𝑤𝑎𝑟𝑑general = 𝑅𝑒𝑤𝑎𝑟𝑑reward_model + 𝑅𝑒𝑤𝑎𝑟𝑑format
Paper Content
ional Mathematics Examination 2024 (AIME 2024) (MAA, 2024).
The details of these benchmarks are listed in Supplementary D.
Table 3 summarizes the performance of DeepSeek-R1 across multiple developmental stages,
as outlined in Figure 2. A comparison between DeepSeek-R1-Zero and DeepSeek-R1 Dev1
reveals substantial improvements in instruction-following, as evidenced by higher scores on
the IF-Eval and ArenaHard benchmarks. However, due to the limited size of the cold-start
dataset, Dev1 exhibits a partial degradation in reasoning performance compared to DeepSeek-
R1-Zero, most notably on the AIME benchmark. In contrast, DeepSeek-R1 Dev2 demonstrates
8
Table 3 | Experimental results at each stage of DeepSeek-R1. Numbers in bold denote the
performance is statistically significant (t−test with
Paper Content
94.2 95.9 95.4 97.3
CNMO 2024 (Pass@1) 88.1 58.0 73.9 77.3 78.8
CLUEWSC (EM) 93.1 92.8 92.6 91.6 92.8
Chinese C-Eval (EM) 92.8 85.7 91.9 86.4 91.8
C-SimpleQA (Correct) 66.4 58.8 64.2 66.9 63.7
marked performance enhancements on benchmarks that require advanced reasoning skills,
including those focused on code generation, mathematical problem solving, and STEM-related
tasks. Benchmarks targeting general-purpose tasks, such as AlpacaEval 2.0, show marginal im-
provement. These results suggest that reasoning-oriented RL considerably enhances reasoning
capabilities while exerting limited influence on user preference-oriented benchmarks.
DeepSeek-R1 Dev3 integrates both reasoning and non
Paper Content
plementary F shows that the strong reasoning capability can be
transferred to smaller models.
9
5. Ethics and Safety Statement
With the advancement in the reasoning capabilities of DeepSeek-R1, we deeply recognize
the potential ethical risks. For example, R1 can be subject to jailbreak attacks, leading to the
generation of dangerous content such as explosive manufacturing plans, while the enhanced
reasoning capabilities enable the model to provide plans with better operational feasibility
and executability. Besides, a public model is also vulnerable to further fine-tuning that could
compromise inherent safety protections.
In Supplementary D.3, we present a comprehensive safety report from multiple perspectives,
including performance on open-source and in-house safety evaluation benchmarks,
Paper Content
low:
Structure Output and Tool Use: Currently, the structural output capabilities of DeepSeek-R1
remain suboptimal compared to existing models. Moreover, DeepSeek-R1 cannot leverage tools,
such as search engines and calculators, to improve the performance of output. However, as it is
not hard to build an RL environment for structure output and tool use, we believe the issue will
be addressed in the next version.
Token efficiency: Unlike conventional test-time computation scaling approaches, such
as majority voting or Monte Carlo Tree Search (MCTS), DeepSeek-R1 dynamically allocates
computational resources during inference according to the complexity of the problem at hand.
Specifically, it uses fewer tokens to solve simple tasks, while generating more tokens for complex
tasks. Nevertheless
Paper Content
ring tasks.
As a result, DeepSeek-R1 has not demonstrated a huge improvement over DeepSeek-V3 on
software engineering benchmarks. Future versions will address this by implementing rejection
sampling on software engineering data or incorporating asynchronous evaluations during the
RL process to improve efficiency.
Beyond specific capability limitations, the pure RL methodology itself also presents inherent
challenges:
Reward Hacking: The success of pure RL depends on reliable reward signals. In this
study, we ensure reward reliability through a reasoning-domain rule-based reward model (RM).
However, such dependable RMs are difficult to construct for certain tasks, such as writing. If the
reward signal is assigned by a model instead of predefined rules, it becomes more susceptible to
exploit
Paper Content
es for these complex, less verifiable problems.
Furthermore, leveraging tools during the reasoning process holds significant promise.
Whether it’s utilizing tools like compilers or search engines to retrieve or compute necessary
information, or employing external tools—such as biological or chemical reagents, to validate
final results in the real world, this integration of tool-augmented reasoning could dramatically
enhance the scope and accuracy of machine-driven solutions.
7. Author List
The list of authors is organized by contribution role, with individuals listed alphabetically by
their first name within each category. Authors marked with an asterisk (*) are no longer affiliated
with our team.
Core Contributors: Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang,
Qihao Zhu,
Paper Content
sks. Additionally,
Qihao Zhu and Zhihong Shao contributed to the data selection strategy for RL training, while
Zhuoshu Li and Yu Wu co-led the data labeling efforts for the entire project. On the system side,
Xiao Bi, Xingkai Yu, Shirong Ma, Xiaokang Zhang, Haowei Zhang, and Ziyi Gao implemented
the RL pipeline, optimizing system efficiency and addressing stability issues in large-scale
training. Finally, Zhibin Gou, Daya Guo, and Ruoyu Zhang oversaw the final training phase
and monitored the model training dynamics. Zhibin Gou led the development of the R1-distill
series.
Contributors: Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu,
Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li,
Fangyun Lin, Fucong Dai, Fuli Luo*, Guangbo Hao,
Paper Content
Shan, Y.K. Li, Y.Q. Wang, Y.X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng
Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong
Wang, Yixuan Tan, Yiyang Ma*, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong,
Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y.X.
Zhu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha,
Yuting Yan, Z.Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang,
Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu*, Zilin Li,
Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Zhen Zhang,
12
Appendix
A. Background
A.1. DeepSeek-V3
DeepSeek V3 (DeepSeek-AI, 2024b) is an
Paper Content
answers, which may lead the base
model to acquire knowledge from other powerful models indirectly. However, we did not
intentionally include synthetic data generated by OpenAI during the pre-training cooldown
phase; all data used in this phase were naturally occurring and collected through web crawling.
The pre-training dataset contains a substantial amount of mathematical and code-related content,
indicating that DeepSeek-V3-Base has been exposed to a significant volume of reasoning trace
data. This extensive exposure equips the model with the capability to generate plausible solution
candidates, from which reinforcement learning can effectively identify and optimize high-quality
outputs. We did the data contamination in pre-training as described in Appendix D.1. The
training data of Deep
Paper Content
responses are explicitly provided, enabling the model to adapt its outputs
to predefined standards (Radford et al., 2019). SFT offers several compelling benefits. First, it
achieves precise task alignment by leveraging high-quality examples, allowing the model to
13
excel in domains such as customer support or technical documentation (Radford et al., 2019).
Second, its reliance on pre-trained weights ensures computational efficiency, requiring fewer
resources than training from scratch. Finally, the use of explicit input-output mappings enhances
interpretability, as the model’s learning process is directly tied to observable data, minimizing
the risk of erratic behavior (Ouyang et al., 2022). Despite its strengths, the performance of SFT
hinges on the quality and diversity of the training
Paper Content
e-based reward model, significantly lowering the annotation burden.
The sequential application of SFT and RL combines their complementary strengths. SFT
establishes a robust, task-specific baseline by grounding the model in curated examples, while
RL refines this foundation to align with broader, human-centric objectives (Ouyang et al., 2022).
For example, SFT might ensure grammatical accuracy in a dialogue system, while RL optimizes
for engagement and brevity, as demonstrated in the development of InstructGPT (Ouyang et al.,
2022). This hybrid approach has proven effective in producing models that are both precise and
adaptable.
In this study, we demonstrate that the SFT stage may impede a model’s ability to explore
and develop effective reasoning strategies. This limitation arises becaus
Paper Content
⊕ 𝑟𝑟
Reward
Policy Model GAE 𝐴𝐴
𝑞𝑞 𝑜𝑜
Model
Value
𝑣𝑣 Trained
Model
Models
GRPO 𝐾𝐾𝐾𝐾 Frozen
Models
Reference
𝑜𝑜! 𝑟𝑟! 𝐴𝐴!
Model
Policy 𝑜𝑜" Reward 𝑟𝑟" Group 𝐴𝐴"
𝑞𝑞
Model … Model … Computation
…
𝑜𝑜# 𝑟𝑟# 𝐴𝐴#
Figure 3 | Demonstration of PPO and our GRPO. GRPO foregoes the value model, instead
estimating the advantages from group scores.
𝜋𝜃𝑜𝑙𝑑 and then optimi
Paper Content
imilar size as the policy model, it introduces
a significant memory and computational overhead. Additionally, the training objective of the
value model is to predict the expected cumulative reward from the current position onward,
based on the tokens generated from the beginning up to the current position. This is inherently
difficult, especially when only the final outcome reward is available. The challenge becomes even
more pronounced when training long chain-of-thought reasoning models. As the output length
increases, the model is more likely to engage in behaviors such as reflection and revision during
generation, meaning that the content initially generated may later be revised or contradicted,
which makes it even less feasible to predict the final reward based on a partial response.
Paper Content
pSeek-
Coder-V2-Lite (16B MoE with 2.4B active parameters). Unlike GRPO, PPO requires additional
hyperparameter tuning—particularly of the 𝜆 coefficient in GAE—and is highly sensitive to this
parameter. When 𝜆 is set to 0.95 (the default value in most open-source PPO implementations),
PPO performs considerably worse than GRPO. However, with careful tuning (setting 𝜆 to 1.0),
PPO’s performance improves substantially, nearing that of GRPO.
While PPO can achieve comparable performance when appropriately tuned, it demands
additional computational cost for hyperparameter optimization. Moreover, considering the
memory and computational overhead associated with training an additional value model,
GRPO presents a more practical alternative, especially when training large-scale models with
constrai
Paper Content
Reward
Save & Offload
VRAM Memory / Disk VRAM Memory / Disk VRAM Memory / Disk
Figure 5 | Overview of our RL framework.
B. Training Details
B.1. RL Infrastructure
Conducting RL training on large models places high demands on the infrastructure. Our
RL framework is architected with a decoupled and extensible structure to facilitate seamless
integration of diverse models and algorithms. Within this framework, we have incorporated
both intra-modular and inter-modular optimization techniques, to ensure training efficiency
and scalability.
Specifically, as depicted in Figure 5, the framework is partitioned into four distinct modules,
each corresponding to a specific phase of the RL pipeline:
• Rollout Module: Prompts are
Paper Content
an asynchronous scheduling approach is employed
to overlap its execution with the Rollout and Inference modules, effectively hiding the
17
associated latency.
• Training Module: This module loads the actor model and the critic model (if required), to
compute loss and update model parameters. It provides flexible support for a variety of
RL algorithms (e.g., PPO, GRPO, DPO, etc.). To minimize computational waste caused by
sequence padding and balance the workload across devices, we design the following data
packing strategy: first, all data in a global batch is sorted by length and distributed across
processes within the data parallel group; subsequently, within each process, the Best-Fit
strategy is applied to pack the data into fixed-length chunks with minimal padding; finally,
the number
Paper Content
r prompt has any ambiguity or more than one interpretation, it is more helpful and
appropriate to ask for clarifications or more information from the user than providing an
answer based on assumptions. Relevant means all parts of the response closely connect
or are appropriate to what is being asked. Concise means the response is clear and not
verbose or excessive.
Then consider the creativity and novelty of the assistant’s answers when needed. Finally,
identify any missing important information in the assistants’ answers that would be
beneficial to include when responding to the user prompt.
After providing your explanation, you must output only one of the following choices as
your final verdict with a label:
1. Assistant A is significantly better: [[A≫B]]
2. Assistant A is slightly bette
Paper Content
et consists of 26k quantitative reasoning questions, including math
exam questions and competition problems. The average number of prompt tokens is
122. The dataset covers various mathematical domains such as algebra, calculus, proba-
bility, and geometry. Problems range in difficulty from regional contests to international
Olympiads. For each problem, the model is expected to produce a step-by-step reasoning
process culminating in a final answer, which can be a numerical value (e.g., “5”), a mathe-
matical expression (e.g., “ 𝑥 2 + 3 𝑥 − 2”), or an equation (e.g., “ 𝑦 = 2 𝑥 + 1”). Mathematical
proofs are excluded because it is difficult to determine their correctness. For reinforce-
ment learning purposes, we calculate the reward of a reasoning process by matching the
predicted answer wit
Paper Content
that cover topics such as physics, chem-
istry, and biology. Each question in the STEM task presents a subject-specific problem
accompanied by four to eight answer options. The model is required to select the most
scientifically accurate answer based on the given context and domain knowledge. The
average number of prompt tokens is 161. Specifically, the dataset includes 15.5% physics,
30.7% biology, 46.5% chemistry, and 7.3% other topics such as health and medicine. Since
all STEM questions are multiple-choice, a binary reward is assigned based on whether the
19
correct option is matched.
• Logic dataset contains 15k questions designed to evaluate a model’s reasoning capabilities
across a broad spectrum of logical challenges. The dataset includes both real-world and
synthetically generated
Paper Content
ility questions
and 24 game).
• General dataset consists of 66k questions designed to assess helpfulness, spanning various
categories such as creative writing, editing, factual question answering, and role-playing.
Additionally, the dataset includes 12,000 questions focused on evaluating harmlessness.
To ensure robust verification, two reward models are utilized, each trained on a curated
dataset of ranked responses generated by models in relation to helpfulness and harm-
lessness, respectively. We trained the helpful reward model for a single epoch with a
maximum sequence length of 8192 tokens during the training phase. However, when
deploying the model to generate reward signals, we did not impose any explicit length
constraints on the input sequences being evaluated.
B.3.2. DeepSeek-R1
Paper Content
maintaining language consistency is crucial for an optimal user experience. With-
out proper control, model responses may contain a mixture of different languages, regardless of
the language used in the query. Such inconsistencies can disrupt comprehension and reduce user
satisfaction. Therefore, careful refinement is essential to ensure that responses remain coherent
and aligned with user intent. Nevertheless, we acknowledge that the raw Chain-of-Thought
(CoT) reasoning produced by DeepSeek-R1-Zero may possess potential that extends beyond the
20
limitations of current human priors. Specifically, we first engage human annotators to convert
the reasoning trace into a more natural, human conversational style. The modified data pairs
are then used as examples to prompt an LLM to rewrite addi
Paper Content
tories using DeepSeek-R1-Zero with
a relatively high temperature of 1.0. Next, we filter these generations to retain only those
with correct final answers and a readable format. For mathematical outputs, we use sympy
(https://www.sympy.org/) for parsing and expression comparison; and for formatting, we
apply rules such as repetition detection and language-mixing filtering. Finally, we prompt
DeepSeek-V3 to refine both the reasoning and the summaries to ensure proper formatting and a
human-friendly expression. In particular, to resolve language mixing, we instruct DeepSeek-V3
to “Translate the thinking process to the same language as the question.” Since DeepSeek-R1-
Zero’s summary only provided the final answer, we use the summary prompt in Listing 1 to
produce a concise, human-readable so
Paper Content
ch problem.
Listing 2 | Prompt for generating test cases for code problem.
Question:
You are given a string word and an integer k.
A substring s of word is complete if:
Each character in s occurs exactly k times.
The difference between two adjacent characters is at most 2. That is, for any two
adjacent characters c1 and c2 in s, the absolute difference in their positions in
the alphabet is at most 2.
Return the number of complete substrings of word.
A substring is a non-empty contiguous sequence of characters in a string.
**Input**
The first line contains a single integer t the number of test cases.
The first line of each test case contain one string word.
The second line of each test case contain one integer k.
It is guaranteed that word consists only of lowercase English letters,
Paper Content
\n{{k}}", len(seq)
t = 10 # the number of test cases
testcase = []
cum_length = 0
for i in range(t):
# each test case should have length >= 1
cur_testcase, cur_length = generate_random_string(1, 100000 - cum_length - (t -
i))
testcase.append(cur_testcase)
cum_length += cur_length
print(t)
print("\n".join(testcase))
‘‘‘
Construct a random input generator for large data with only one test case.
‘‘‘python
import random
import string
def generate_random_string(l_min, l_max):
...
t = 1 # the number of test cases
testcase = [generate_random_string(90000, 100000) for _ in range(t)]
print(t)
print("\n".join(testcase))
‘‘‘
Construct an input generator for large adversarial inputs of a string of a repeated
single character.
‘‘‘python
import random
import string
def generate_single_repeated_char(l_mi
Paper Content
ect
codes exceed the time limit.
Use the format used in the above example by returning several input generators in
different code blocks. Each of these generators prints EXACTLY ONE input directly
into stdout.
In addition, we employ few-shot prompting for DeepSeek-V3 to generate responses to simple
math problems, such as "1 + 1 = ?", ensuring that the outputs remain concise and appropriately
structured. We provide the prompt for a simple math problem in Listing 3.
Listing 3 | Prompt for generating chain-of-thought for simple math problem.
## Question
How much is 5+4?
24
## Response
I need to add the numbers 5 and 4. Starting with 5, if I add 4 to it, the total will be
9. Therefore, the sum of 5 and 4 is 9.
**Solution:**
We are asked to calculate the sum of 5 and 4.
**Step
Paper Content
key reasoning steps or the final
conclusion, or completely deviates from the core of the question. This indicates a
fundamental misunderstanding or error in comprehending the question.
## Question
{question}
## Reference Answer
{reference}
## Answer to be Evaluated
{answer}
## Output Format
You need to combine the question and reference answer, first provide a detailed
explanation of your analysis of the answer to be evaluated, then conclude with the
final answer quality classification.
Output the following content in **JSON** format, including two key:
1. ’analysis’: analysis of the answer’s correctness;
2. ’correctness’: correct/incorrect
B.3.3. 800K Supervised Data
Reasoning Data We curate a large set of reasoning prompts and generate reasoning trajectories
by performing rejection sampl
Paper Content
ing the question by prompting. However, for simpler queries, such as “hello” we
do not provide a CoT in response. In the end, we collected a total of approximately 200k training
samples that are unrelated to reasoning.
26
When designing our thinking process style, we ask the model to follow key principles:
First, keep each paragraph concise and digestible. Short paragraphs make ideas clearer and
easier to follow. Second, adopt a conversational tone that feels natural and engaging. We avoid
technical formatting like markdown to maintain a smooth reading experience. Third, and most
importantly, the thinking process begins by understanding the complete user context. This
means analyzing who our users are, what situations they’re dealing with, and what they truly
need - including those unstate
Paper Content
gle-turn interactions, which may limit the multi-turn conversational capabilities
of DeepSeek-R1. We leave the expansion to multi-turn dialogue data as future work. The
math-related data are primarily in Chinese and English, spanning a wide range of topics
and difficulty levels. These questions are verifiable, either through deterministic rules or by
reference to specific ground-truth answers. The code dataset encompasses not only competitive
programming problems but also debugging tasks and project-oriented coding queries. STEM
and logic-related questions, although smaller in volume compared to mathematics and code, are
sourced from publicly available textbooks and online repositories. The general-domain queries
are diverse, predominantly in Chinese and English, and include tasks such as
Paper Content
ing the shortest path in a
staircase.
Let me visualize it. Each term \( \sqrt{(2k - 1)^2 + a_k^2} \) looks like the
hypotenuse of a right-angled triangle with legs \( (2k - 1) \) and \( a_k \). So,
if I imagine a grid where I move horizontally by \( a_k \) and vertically by \( (2k
- 1) \) for each step \( k \), then \( S_n \) would be the total length of this
path.
Now, the sum of all \( a_k \) is 17, which would be the total horizontal distance. I
need to figure out the vertical distances, which are fixed based on \( k \). So the
total vertical distance would be \( \sum_{k=1}^n (2k - 1) \). Let me compute that:
The sum \( \sum_{k=1}^n (2k - 1) \) is equal to \( 2 \sum_{k=1}^n k - \sum_{k=1}^n 1 =
2 \cdot \frac{n(n + 1)}{2} - n = n(n + 1) - n = n^2 \). So the total vertical
distance is \(
Paper Content
f \( L \) with respect to each \( a_k \):
\( \frac{\partial L}{\partial a_k} = \frac{a_k}{\sqrt{(2k - 1)^2 + a_k^2}} - \lambda =
0 \).
So, for each \( k \), we have \( \frac{a_k}{\sqrt{(2k - 1)^2 + a_k^2}} = \lambda \).
Let me solve for \( a_k \):
Squaring both sides gives \( \frac{a_k^2}{(2k - 1)^2 + a_k^2} = \lambda^2 \).
Rearranging terms: \( a_k^2 = \lambda^2 \left( (2k - 1)^2 + a_k^2 \right) \).
Divide both sides by \( a_k^2 \) (assuming \( a_k \neq 0 \)): \( 1 = \lambda^2 \left( \
frac{(2k - 1)^2}{a_k^2} + 1 \right) \).
28
Simplify: \( 1 - \lambda^2 = \lambda^2 \cdot \frac{(2k - 1)^2}{a_k^2} \).
Then, \( \frac{(2k - 1)^2}{a_k^2} = \frac{1 - \lambda^2}{\lambda^2} \).
Therefore, \( \frac{2k - 1}{a_k} = \sqrt{\frac{1 - \lambda^2}{\lambda^2}} = \frac{\sqrt
{1 - \lambda^2}}{\lambda} \).
L
Paper Content
he Lagrangian correctly.
2. Took the derivative and found \( a_k = \frac{2k - 1}{c} \).
3. Summed them up to get \( c = \frac{n^2}{17} \).
4. Plugged back into \( S_n \) and factored out \( (2k -1) \), so each term becomes \(
(2k -1) \sqrt{1 + \left( \frac{17}{n^2} \right)^2 } \).
5. Since \( \sum (2k -1) = n^2 \), so total sum is \( n^2 \times \sqrt{1 + \left( \frac
{17}{n^2} \right)^2 } = \sqrt{n^4 + 17^2} \).
Yeah, that seems to check out. So, \( S_n = \sqrt{n^4 + 289} \). Wait, \( 17^2 is 289.
Therefore, \( S_n = \sqrt{n^4 + 289} \). We need this to be an integer.
So, I need to find an integer \( n \) such that \( n^4 + 289 \) is a perfect square.
Let me denote \( S_n = m \), where \( m \) is an integer, so:
\( n^4 + 289 = m^2 \).
29
This implies \( m^2 - n^4 = 289 \).
Hmm, 289 is 17^2
Paper Content
de, let me verify that \( S_n = \sqrt{12^4 + 17^2} = \sqrt{20736 + 289}
= \sqrt{21025} = 145 \), which is indeed an integer.
Is this the only \( n \)? The question says there’s a unique positive integer \( n \),
so this must be the one.
**Final Answer**
The unique positive integer \( n \) is \boxed{12}.
To solve the problem, we need to find a positive integer \( n \) such that \( S_n \),
the minimum value of the sum \( \sum_{k=1}^n \sqrt{(2k-1)^2 + a_k^2} \), is an
integer. Here, \( a_1, a_2, \ldots, a_n \) are positive real numbers whose sum is
17.
First, we recognize that each term \( \sqrt{(2k - 1)^2 + a_k^2} \) represents the
hypotenuse of a right-angled triangle with legs \( (2k - 1) \) and \( a_k \). The
total vertical distance is the sum of the vertical components, which is
Paper Content
’Apple’, ’A fruit that grows on trees’)
>>> print(d.look(’Apple’))
A fruit that grows on trees
>>> print(d.look(’Banana’))
Can’t find entry for Banana
‘‘‘
Good luck and happy coding!
Here is the setup code:
‘‘‘python
class Dictionary():
def __init__(self):
# Your code
pass
def newentry(self, word, definition):
# Your code
pass
def look(self, key):
# your code
pass
‘‘‘
## Response
Okay, I need to create a Python class called Dictionary. The class should allow adding
31
words with their definitions and looking them up. Hmm, let’s think about the setup.
So, when a Dictionary object is created, like d = Dictionary(), I probably need some
kind of data structure to store the words and their entries. A dictionary seems
perfect for this because it can hold key-value pairs, where the key is
Paper Content
t, but in the example, when print(d.look(’Banana’)), it prints the message without
quotes. So the method returns the string which when printed gives that output.
So, yeah, that’s correct.
Let me test this logic.
Testing with the example:
d = Dictionary()
d.newentry(’Apple’, ’A fruit...’) --> adds to entries.
d.look(’Apple’) returns the string, so print shows it correctly.
d.look(’Banana’) returns the can’t find message.
That should work.
I think that’s the basic plan. So the setup code is given, I have to fill in the
methods. So in code, it’s as I wrote.
32
Wait, any edge cases? Like, what if word is an empty string? Or any other type? The
problem doesn’t specify handling those, so I think the code should work as intended.
Another thing: in the look method, when we do ’if key in self.entri
Paper Content
lf.entries[key]
else:
return f"Can’t find entry for {key}"
‘‘‘
### Explanation
1. **Initialization**: The ‘__init__‘ method sets up an empty dictionary ‘self.entries‘
to store the words and their definitions.
2. **Adding Entries**: The ‘newentry‘ method takes two parameters, ‘word‘ and ‘
definition‘, and adds them to ‘self.entries‘ where ‘word‘ is the key and ‘
definition‘ is the value.
3. **Looking Up Entries**: The ‘look‘ method checks if the given key exists in ‘self.
entries‘. If the key is found, its corresponding definition is returned. If not, a
message is returned indicating the absence of the entry.
This approach ensures efficient storage and retrieval of word definitions using a
dictionary data structure, providing an optimal solution for the problem.
Listing 7 | An example SFT t
Paper Content
us lines will make the
ending more surprising.
Let me outline the stanzas. The first stanza could set the scene with the ocean depths
and a hidden secret. The second stanza can show the action, like diving for
treasure, and then reveal the unexpected tentacles.
I’ll make sure each line flows naturally and the imagery is vivid. Using words like "
waves," "shadows," "caverns," and "mystery" can help paint the picture. Ending with
the twist will leave the reader with a lasting impression.
Alright, putting it all together, I’ll craft two stanzas that build up to the final
line, ensuring the surprise feels satisfying and the poem feels complete.
Beneath the waves, where shadows creep,
A secret stirs in the ocean deep.
A sunken chest, a sailor’s song,
The mystery whispered all along.
We
Paper Content
For distillation, we fine-tune the corresponding base model for 2–3 epochs using the 800k data
described in Section B.3.3. The base model and initial learning rate are listed in Table 6. We
employ a cosine decay learning rate scheduler that gradually decreases the learning rate to
one-tenth of its initial value. The maximum context length is 32,768 tokens, and the batch size is
64.
Table 6 | DeepSeek-R1 Distilled Models, their corresponding Base Models, and Initial Learning
Rates.
Distilled Model Base Model Initial Learning Rate
DeepSeek-R1-Distill-Qwen-1.5B Qwen2.5-Math-1.5B 1 × 10−4
DeepSeek-R1-Distill-Qwen-7B Qwen2.5-Math-7B 8 × 10−5
DeepSeek-R1-Distill-Qwen-14B Qwen2.5-14B 7 × 1
Paper Content
elpful reward model. Specifically, if the reward model contains
systematic biases or inaccuracies, the LLM may learn to generate responses that are rated highly
by the model but diverge from authentic human preferences. This misalignment can manifest in
performance degradation on tasks requiring complex reasoning, as illustrated in Figure 6.
B.6. Ablation Study of Language Consistency Reward
To study the impact of the Language Consistency (LC) Reward, we conduct an ablation experi-
ment on DeepSeek-R1-Distill-Qwen-7B. This model uses the same cold start data as DeepSeek-R1
35
5.00 0.36
Reward
Performance
4.75 0.35
4.50
Paper Content
ward is applied, stable language consistency is
maintained throughout the training process. For benchmark performance, the model main-
tains comparable performance on the mathematical benchmark, while a slight degradation is
observed on the coding benchmark. Although such alignment results in a slight degradation
in model performance, this reward aligns with human preferences, making the output more
readable.
LC Reward LiveCodeBench Pass@1 AIME Accuracy
1.00
w/ LC Reward w/ LC Reward
w/o LC Reward 0.625 w/o LC Reward
0.50
0.98
0.600
0.48
0.96
Paper Content
kable improvement - level 4 problems improve from near 0.78 to 0.95, and the most
challenging level 5 problems demonstrate the most dramatic improvement from near 0.55 to
0.90.
One may find it counterintuitive that the model’s accuracy on harder questions (levels 3-4)
occasionally surpasses its performance on easier questions (level 1) by a small margin. This
apparent anomaly stems from several dataset characteristics. The MATH dataset is unevenly
distributed, with level-1 questions comprising only 43 of 500 examples, while higher levels
contain approximately 100 questions each. Consequently, the model’s 95-97% accuracy on level-1
represents just 1-2 unsolved problems, primarily in geometry, where the model still struggles.
Furthermore, the distribution of mathematical categories (geometry
Paper Content
“but”, “retry”, “error”, “verify”, “wrong”, “evaluate”, and “check”
. These reflective words were selected by 3 human experts, who are asked to think of several
reflective words and then merge them into a final word list. As is shown, there is a gradual
increase in the frequency of reflective behaviors as training progresses. Specifically, the count of
the reflective words rises 5- to 7-fold compared to the start of training, suggesting that RL plays
a key role in generating long-chain intermediate tokens.
Second, specific reflective behaviors may appear at particular points in training. The analysis
of the word “wait” (Figure 9(b)) demonstrates this clearly. This reflective strategy was nearly
absent during early training, showed occasional usage between steps 4000-7000, and then
exhibite
Paper Content
et al., 2021), MMLU-Redux (Gema
et al., 2025), MMLU-Pro (Wang et al., 2024), C-Eval (Huang et al., 2023), IFEval (Zhou et al.,
2023b), FRAMES (Krishna et al., 2024), GPQA Diamond (Rein et al., 2023), SimpleQA (OpenAI,
2024a), C-SimpleQA (He et al., 2024), SWE-Bench Verified (OpenAI, 2024b), Aider (Gauthier,
2025), LiveCodeBench (Jain et al., 2024) (2024-08 – 2025-01), Codeforces (Mirzayanov, 2025),
Chinese National High School Mathematics Olympiad (CNMO 2024) (CMS, 2024), and American
Invitational Mathematics Examination 2024 (AIME 2024) (MAA, 2024).
Specifically, MMLU, MMLU-Redux, MMLU-Pro, C-Eval, and CMMLU are multiple-choice
benchmarks designed to assess model performance on general encyclopedic knowledge. Higher
scores on these benchmarks indicate a broader understanding of world know
Paper Content
ME 2024, MATH-500, GPQA
Diamond, Codeforces, and LiveCodeBench.
39
Decontamination To prevent benchmark contamination, we implemented comprehensive
decontamination procedures for both pre-training and post-training data. DeepSeek-V3 base
has a knowledge cutoff date of July 2024, predating evaluation benchmarks like CNMO 2024,
and we filtered out any text segments (including web pages and GitHub files) that contained
matching 10-gram sequences from evaluation questions or reference solutions. As one example
of our decontamination efforts, in the mathematics domain alone, our decontamination process
identified and removed approximately six million potential pre-training texts. For post-training,
mathematical SFT data and RL training prompts were sourced exclusively from pre-2023 com-
petitio
Paper Content
peScript, PHP, and Bash). Model performance on LiveCodeBench is evaluated
using CoT format, with data collected between August 2024 and January 2025. The Codeforces
dataset is evaluated using problems from 10 Div.2 contests, along with expert-crafted test cases,
after which the expected ratings and percentages of competitors are calculated. SWE-Bench
verified results are obtained via the agentless framework (Xia et al., 2024). AIDER-related
benchmarks are measured using a "diff" format. DeepSeek-R1 outputs are capped at a maximum
of 32,768 tokens for each benchmark.
Table 18 to Table 32 present examples of our evaluation formats on different benchmarks.
We also detail the specific capabilities of large language models assessed by each benchmark in
the corresponding table captions.
Baseline
Paper Content
timates. For AIME 2024, we also report consensus (majority vote) results using
64 samples, denoted as cons@64.
D.2. Main Results
Table 8 | Comparison between DeepSeek-R1 and other representative models. Numbers in bold
denote the performance is statistically significant (t−test with 𝑝 < 0.01).
Claude-3.5- GPT-4o DeepSeek OpenAI OpenAI DeepSeek
Benchmark (Metric)
Sonnet-1022 0513 V3 o1-mini o1-1217 R1
Architecture - - MoE - - MoE
# Activated Params - - 37B - - 37B
# Total Params - - 671B - - 671B
MMLU (EM) 88.3 87.2 88.5 85.2 91.8 90.8
MMLU-Redux (EM)
Paper Content
yglot (Acc.) 45.3 16.0 49.6 32.9 61.7 53.3
AIME 2024 (Pass@1) 16.0 9.3 39.2 63.6 79.2 79.8
Math MATH-500 (Pass@1) 78.3 74.6 90.2 90.0 96.4 97.3
CNMO 2024 (Pass@1) 13.1 10.8 43.2 67.6 - 78.8
CLUEWSC (EM) 85.4 87.9 90.9 89.9 - 92.8
Chinese C-Eval (EM) 76.7 76.0 86.5 68.9 - 91.8
C-SimpleQA (Correct) 55.4 58.7 68.0 40.3 - 63.7
Standard Benchmark We evaluate DeepSeek-R1 on multiple benchmarks. For education-
oriented knowledge benchmarks such as MMLU, MMLU-Pro, and GPQA Diam
Paper Content
g-oriented coding tasks, OpenAI-o1-1217 outperforms DeepSeek-R1
on Aider but achieves comparable performance on SWE Verified. We believe the engineering
performance of DeepSeek-R1 will improve in the next version, as the amount of related RL
training data currently remains very limited.
DeepSeek-R1 DeepSeek-R1-Zero Human Expert
100 96.3
79.8 80.4 81.2
80 77.9 75.8
Accuracy / Percentile (%)
71.5
60
50.0
40 37.8
20
0
AIME 2024 Codeforces GPQA Diamond
(Pass@1) (Percentile) (Pass@1)
Figure 10 | The benchmark performance of D
Paper Content
le control ranking on ChatBotArena of DeepSeek-R1. The screenshot is
captured on January 24, 2025, one week after model release. The ranking is dynamically
updated in real time as the number of votes increases.
Human Evaluation We utilize ChatbotArena (Chiang et al., 2024) to show the human prefer-
ence of DeepSeek-R1 with its ranking and elo score. ChatbotArena is an open, crowdsourced
platform developed by LMSYS and UC Berkeley SkyLab to evaluate and rank LLMs based on
human preferences. Its core mechanism involves pairwise comparisons, where two anonymous
LLMs (randomly selected from a pool of over 100 models) respond to a user-submitted prompt.
Users then vote on which response they prefer, declare a tie, or mark both as bad, without
knowing the models’ identities until after voting. T
Paper Content
roducing responses that are longer, more polished,
or better formatted, even if their content isn’t necessarily superior. It is a huge milestone that
an open-source model under the MIT License could achieve comparable performance with
closed-source models, especially considering that the cost of DeepSeek-R1 is relatively inex-
pensive. Figure 12 illustrates the rankings across different evaluation dimensions, highlighting
DeepSeek-R1’s strong performance in mathematics, coding, and other areas. This demonstrates
that DeepSeek-R1 excels not only in reasoning but also across a wide range of domains.
43
Figure 12 | The rank of DeepSeek-R1 across various aspects on January 24, 2025.
D.3. DeepSeek-R1 Safety Report
Warning: This section contains potentially risky and offensive content!
We fully
Paper Content
, the results represent the comprehensive
safety performance that integrates both the model’s intrinsic safety mechanisms and external
risk control systems.
Listing 8 | The Risk Review Prompt for DeepSeek-R1.
You are a content safety manager for artificial intelligence. You are responsible for
detecting whether an AI assistant’s to a given
complies with that align with universal values.
1. Please carefully read the and to understand your
process and requirements.
2. Based on the given , , and ,
determine whether the AI assistant’s meets the .
3. In your output, first analyze in detail whether the
Paper Content
niversal values and should not
contain discrimination, bias, hatred, or other unreasonable value judgments; they
should not include violent, pornographic, terrorist, or other negative content.
4. [Extreme Expression]: Responses should not use profanity, vulgar language, or engage
in harsh criticism or mockery. Responses should not contain discriminatory or
biased language, including but not limited to bias related to gender, region,
ethnicity, race, etc., such as "girls are inherently worse at science than boys."
5. [Emotional Manipulation and Social Division]: Responses should not use exaggerated,
emotional language to incite panic or anger, create extreme social division, or
predict public events that may cause significant negative emotions, such as
predicting earthquakes or stock market
Paper Content
r using the dark web, should not facilitate illegal activities, and
should not teach how to bypass network restrictions or access restricted
information.
9. [Privacy Fabrication]: Responses should not include or fabricate sensitive private
or financial information, including addresses, phone numbers, bank card numbers,
account passwords, or software keys.
10. [Service Misrepresentation]: Responses should not fabricate services or actions
that the AI assistant cannot perform, such as promising service upgrades, signing
contracts on behalf of a company, reporting to authorities for users, calling users,
sending emails, transferring files, or shipping items.
11. [Risk Advice]: Responses should not provide unverified high-risk investment
guidance on stocks, options, etc.; should not provide ga
Paper Content
er each round of conversation, the user’s query is
automatically matched against a predefined keyword list. This list contains commonly used
terms in ethical and safety scenarios and is designed to ensure comprehensive coverage of
potential safety issues. Conversations that match these keywords are flagged as potentially
unsafe dialogues.
Model-based Risk Review Subsequently, these potentially unsafe dialogues are concatenated
with a preset risk review prompt (shown in Listing 8) and sent to the DeepSeek-V3 model
(considering the balance between effectiveness and efficiency). The system then determines
whether the dialogue should be retracted based on the risk review results. We have meticulously
designed this risk review prompt to effectively cover various safety scenarios and maintain
go
Paper Content
this benchmark primarily covers
security evaluations in the following five categories: Illegal Items, Physical Harm, Scams
& Fraud, Child Abuse, and Suicide, Self-Harm & Eating Disorders (SH & ED).
46
• Bias Benchmark for QA (Parrish et al., 2022): Short for BBQ, this benchmark primarily
evaluates the performance of language models in conversations involving discriminatory
biases. Specifically, it examines the following types of bias: age, disability status, gender
identity, nationality, physical appearance, race / ethnicity, religion, socioeconomic status,
and sexual orientation.
• Anthropic Red Team (Ganguli et al., 2022):Short for ART, this benchmark consists of
data collected by Anthropic during Red Team attacks on the model. The Red Team attacks
primarily cover the following aspects:
Paper Content
tivities) and 61 specific risk types (e.g., racial discrimination, misleading
medical advice).
• HarmBench (Mazeika et al., 2024): This benchmark is primarily structured around the
following four aspects: standard model safety capabilities, copyright-related safety ca-
pabilities, context-aware safety capabilities, and multimodal safety capabilities. Addi-
tionally, this work introduces an automated approach for generating diverse automated
red-teaming attack samples.
In terms of evaluation implementation, the results for the Do-Not-Answer and HarmBench
benchmarks were reproduced based on the official evaluation methodology, while the results
for the other benchmarks were obtained from the independent third-party evaluation platform
HELM (https://crfm.stanford.edu/helm/safety/latest/#/lead
Paper Content
the lyrics of ’What a
Wonderful World’ by Louis Armstrong), R1 fails to refuse the request, leading to its classification
as unsafe. Besides, across benchmarks focusing on other safety categories (e.g., discrimination
47
Table 9 | Comparison of DeepSeek-R1 and other frontier models on safety benchmarks. A
higher score indicates better safety performance. Benchmarks marked with * are the results
reproduced by us, while other numerical results are obtained from the independent HELM
evaluations. The numbers in parentheses represent the results of the pure model without
considering the risk control system (introduced in D.3.1).
Safety Score(%) SST BBQ ART XSTest DNA* HarmBench* Average Score
Claude-3.7-Sonnet 100.0 92.1 99.7 96.4 95.9 83.3 94.6
o1 (202
Paper Content
methods. Moreover,
data from different sources exhibit disparities in attributes (such as languages, quantities, and
evaluation methods), making direct alignment challenging. Therefore, we specifically con-
structed an internal safety evaluation dataset to monitor the overall safety level of the model.
The construction of this dataset has the following characteristics: (1) Following unified tax-
onomic standards to build the testing framework, comprehensively covering various safety
and ethical scenarios as much as possible; (2) Aligning the quantity, languages, and evaluation
methods of safety test data across different categories, enabling us to conduct quantitative
safety assessments for different safety scenarios; (3) Possessing good extensibility, where the
multilingual language (D.3.
Paper Content
crimination
Pornography Threats to National Security
Gambling Violence
Illegal and Criminal
Sex, Reproduction, Health Cyber Crimes Economic Crimes
Behavior
Drugs and Substance Abuse Animal-Related Crimes
Cults and Superstition Other Illegal and Criminal Behavior
Physical Harm Psychological Harm
Harmful Behavior Privacy Violations Economic Rights Violations
Other Legal Rights Violations
Family Ethics Marriage Ethics
Moral and Ethical
Issues
Academic Ethic Professional Ethics
Figure 13 | Taxonomy of in-house safety benchmark.
49
attribute discrimination encompasses stereotypes based on nat
Paper Content
trade secrets, and unfair business competition.
Moral and Ethical Issues We have identified and categorized the ethical issues in the following
four scenarios: (1) Family ethics: including issues related to parental responsibilities, sibling
relationships, and the treatment of elderly family members, (2) Marriage ethics: covering
topics such as fidelity, communication, and shared responsibilities between spouses, (3) School,
student, and academic ethics: addressing matters like academic integrity, bullying, and the
relationship between students and teachers, (4) Professional ethics: encompassing concerns
related to conflicts of interest, workplace conduct, and responsibilities toward clients and
colleagues.
Following the establishment of the above safety content taxonomy, we constructed sp
Paper Content
se or the system delivers a mechanical
rejection based on risk control measures; we consider this situation as an intermediate state
between safe and unsafe responses.
In Tables 10, we consider two metrics: Unsafe and Reject. The Unsafe metric measures the
proportion of unsafe responses among all answers, with lower values indicating better safety
performance. The Reject metric represents the proportion of rejection responses among all
50
Table 10 | Comparison of DeepSeek-R1 and other frontier models in fine-grained safety
scenarios. Unsafe indicates the proportion of unsafe content in the model’s responses (lower
values indicate better model safety), while Rej. represents the rejection rate in the model’s
answers (lower values indicate a stronger tendency for the model to provide informat
Paper Content
rejections since
it can provide risk warning information).
We crafted specialized prompts for different subcategories of questions to assess the safety
of responses. We also verified that the consistency between LLM evaluation results and human
assessments reached an acceptable level (consistency rate of sampled results is above 95%). The
experimental comparison results are presented in Table 10, from which the following conclusions
can be observed:
• Analyzing unsafe rates: DeepSeek-V3 (with risk control) belongs to the first tier of safe
models (unsafe rate aound 5%); DeepSeek-R1 (with risk control), Claude-3.7-Sonnet,
and o1 (2024-12-17) belong to the second tier of safe models (unsafe rate around 10%);
DeepSeek-V3 (without risk control) and Qwen2.5 Instruct (72B) belong to the third ti
Paper Content