The rapid development of reasoning models, such as OpenAI-o3 and DeepSeek-R1, has led to significant improvements in complex reasoning over non-reasoning large language models. However, their enhanced capabilities, combined with the open-source access of models like DeepSeek-R1, raise serious safety concerns, particularly regarding their potential for misuse. In this work, we present a comprehensive safety assessment of these reasoning models, leveraging established safety benchmarks to evaluate their compliance with safety regulations. Furthermore, we investigate their susceptibility to adversarial attacks, such as jailbreaking and prompt injection, to assess their robustness in real-world applications. Through our analysis, we uncover four key findings: (1) There is a significant safety gap between the open-source R1 models and the o3-mini model, on both safety benchmark and attack, suggesting more safety effort on R1 is needed. (2) The distilled reasoning model shows poorer safety performance compared to its safety-aligned base models. (3) The stronger the model's reasoning ability, the greater the potential harm it may cause when answering unsafe questions. (4) The thinking process in R1 models pose greater safety concerns than their final answers. Our study provides insights into the security implications of reasoning models and highlights the need for further advancements in R1 models' safety to close the gap.
We investigate the safety performance of large reasoning models in handling malicious queries. We begin by analyzing their overall performance, and identifying a distinct safety behavior from them. Then, we analyze their behavioral patterns on selected representative datasets.
Safety classification alone is not sufficient to comprehensively assess models' safety, as not all responses classified as unsafe are equally harmful - some provide minimal information, while others offer detailed, actionable guidance that aids malicious intent. To capture this, we define the harmfulness level of an unsafe response as the degree of helpfulness it provides to a malicious query.
This section evaluates the models' safety performance against two types of adversarial attacks: the jailbreak attack, which forces the model to respond to harmful queries, and the prompt injection attack, which aims to override the models' intended behavior or bypass restrictions.
We compare the safety of the thinking process from R1 models and their final answer when given harmful queries. Specifically, we take the content between <think>
and </think>
from the models' output and use the same evaluation prompt to judge the safety.
We can observe that the safety rate of the thinking process is lower than the final answer.
After investigating the models' responses, we identify two main types of cases where the thinking process contains 'hidden' safety risks that are not reflected in the final answer.
Two examples where the safety of the reasoning content is worse than the final completion.
Left: The reasoning content directly provides techniques that help the malicious query.
Right: The reasoning content provides safe paraphrasing techniques that are relevant to the malicious query. Red text is the potentially unsafe content.
@misc{zhou2025hiddenriskslargereasoning,
title={The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1},
author={Kaiwen Zhou and Chengzhi Liu and Xuandong Zhao and Shreedhar Jangam and Jayanth Srinivasa and Gaowen Liu and Dawn Song and Xin Eric Wang},
year={2025},
eprint={2502.12659},
archivePrefix={arXiv},
primaryClass={cs.CY},
url={https://arxiv.org/abs/2502.12659},
}