Why does Deepseek-R1 hallucinate so much?

Introduction

DeepSeek R1 has been making waves since its launch, and rightfully so! But amidst all the excitement, we at Vectara uncovered a surprising trend: DeepSeek R1 hallucinates significantly more (14.3% hallucination rate) than its predecessor, DeepSeek V3 (3.9% hallucination rate).

You can find more details in our previous blog post here. If you're curious about where DeepSeek R1 stands in the broader LLM landscape, check out Vectara’s hallucination leaderboard.

In this blog post, we'll dive deeper into why DeepSeek R1 exhibits this behavior. Is it a trade-off for its advanced reasoning capabilities? Let’s find out!

Is reasoning the culprit?

In our last blog, we posed a burning question: Is hallucination the price we pay for better reasoning? Looking at Vectara’s hallucination leaderboard, we noticed a pattern. Reasoning LLMs often seemed to have higher hallucination rates than their non-reasoning counterparts.

	Non-reasoning LLM	Reasoning LLM
Hallucination rate	DeepSeek-V3: 3.9%	DeepSeek-R1: 14.3%
	GPT-4O: 1.5%	O1: 2.4%
	Gemini-2.0-Flash-Exp: 1.3%	Gemini-2.0-Flash-Think-Exp: 1.8%
	Qwen2.5-32B-Instruct: 3.0%	Qwen-QwQ-32B-Preview: 16.1%

Table 1: hallucination rates for non-reasoning LLM vs reasoning LLMs.

To understand why, we ran an experiment. For each query, we took DeepSeek R1's “thinking content” (the reasoning steps it takes to generate a summary) and injected it into DeepSeek V3’s prompt.

We then instructed V3 to use this reasoning to generate a concise summary. We also tested V3’s internal reasoning by adding “Let’s think step-by-step” to its prompt.

Here’s what we found:

	Hallucination rate
DeepSeek R1	14.3%
DeepSeek V3	3.9%
DeepSeek V3 + DeepSeek R1’s thinking content	3.3%
DeepSeek V3 + “Let’s think step-by-step”	1.5%

Table 2: hallucination rates for Deepseek models with adaptations.

Surprisingly, both methods reduced DeepSeek V3’s hallucination rate. This suggests that reasoning might not be the primary reason behind DeepSeek-R1’s higher hallucination rate.

So what can be causing this?

A closer look at R1 hallucinations

Let’s look at an example. We gave DeepSeek R1 the following article:

Weekly Idol. The show is hosted by comedian Jeong Hyeong-don and rapper Defconn.

R1 generated this summary:

*Summary:** *Weekly Idol* is a South Korean television show hosted by comedian Jeong Hyeong-don and rapper Defconn.

The phrase "South Korean" is a hallucination (or factual inconsistency), as it’s not in the original text. However, it’s also a “benign” hallucination. FaithBench defines this as “clearly a hallucination, but supported by world knowledge, common sense, or logical reasoning, such that a reader finds it acceptable or welcomed”.

This led us to the following hypothesis: Is DeepSeek R1 “overhelping”? Is it adding information it thinks is relevant, even when asked to stick strictly to the source text?

Validating the findings

To validate our evaluation, we randomly selected 50 pairs of R1 and V3 samples where HHEM’s classifications differed. In other words, for summaries generated from the same article, if HHEM labeled the R1 summary as hallucinated, the corresponding V3 summary was labeled as non-hallucinated, and vice versa. This validation dataset was designed to stress-test which model produces outputs more consistent with the source by focusing on instances where the models are most likely to diverge.

We then had five human annotators review these summaries to determine actual hallucinations, including classifying each hallucination into one of three categories: benign, consistent, and inconsistent (see Faithbench paper mentioned above).

We found that for R1: 46 out of 50 samples were marked as hallucinated by the human annotators, whereas for V3 only 19 out of 50 were marked as hallucinated, as shown in Table 3:

	Hallucinated	Benign hallucinated	Non-benign hallucinated
DeepSeek R1	46	33	13
DeepSeek V3	19	7	12

Table 3: Number of hallucinated samples for Deepseek models in validation.

So from this table, we also see that out of the 46 hallucinations generated by R1, 33 are benign, or 71.7%, whereas for V3 that percentage is 36.8%. The 71.7% may be artificially inflated due to the methodology of choosing our dataset - nevertheless, we were intrigued by the fact that R1 seems to show a much higher level of benign hallucination.

HHEM vs. LLM-as-a-Judge

Benign hallucinations – namely, when a response includes facts that are not present in the source text but are supported by world knowledge – are challenging to identify.

Before we explore this type of hallucination further, let’s look at a few more examples:

Example 1:
- Article: Rio (2011 film). A sequel , Rio 2 , was released on April 11 , 2014 .
- R1 generated summary: The 2011 animated film *Rio* received a sequel titled *Rio 2*, which premiered on April 11, 2014.
- (R1 added “animated”, which is correct but not in the article. See here.)

Example 2:
- Article: The plants were found during the search of a warehouse near Ashbourne on Saturday morning. Police said they were in "an elaborate grow house". A man in his late 40s was arrested at the scene.
- R1 generated summary: Police discovered an elaborate cannabis grow house during a Saturday morning search of a warehouse near Ashbourne, arresting a man in his late 40s at the scene.
- (R1 said "cannabis" instead of "plants", which is also correct according to the source but not mentioned in the given article.)

Example 3:
- Article: As of 14 March, over 150,000 cases have been reported in around 150 countries and territories; more than 5,600 people have died from the disease and over 73,000 have recovered.
- R1 generated summary: As of 14 March, over 150,000 COVID-19 cases were reported globally across approximately 150 countries and territories, resulting in more than 5,600 deaths and over 73,000 recoveries.
- (R1 added "COVID-19", which is supported by world knowledge but not inferable from the given article)

We wanted to explore the effectiveness of identifying benign hallucinations and compare HHEM against LLM-as-judge.

For LLM-as-a-judge we created three variants:

Vanilla: Simply asks whether the summary contains hallucinations given the article.
Reasoning: Adds “think step-by-step” to encourage structured reasoning, using the same prompt as Google’s FACTs grounding.
Sentence-level Reasoning: Adds “analyze the response sentence by sentence” to the reasoning LLM-as-a-judge, modifying the prompt by Google’s FACTs grounding.

Here are the results:

Method	R1 Agreement Rate with Human Annotations	V3 Agreement Rate with Human Annotations
HHEM	88%	70%
Vanilla LLM-as-a-judge	14%	63%
Reasoning LLM-as-a-judge	17%	67%
Sentence-level reasoning LLM-as-a-judge	78%	62%

Table 4: HHEM vs LLM-as-a-judge - agreement of HHEM with human annotation.

The table clearly shows that HHEM has the highest agreement with human annotators on both R1-generated summaries (containing more benign hallucinations) and V3-generated summaries. The vanilla/reasoning LLM-as-a-judge seems unable to detect most of the benign hallucinations. The fine-grained reasoning LLM-as-a-judge seems to help with benign hallucination detection, but it can sometimes be too harsh, causing a drop in detection accuracy on V3-generated summaries.

But is HHEM truly detecting the benign hallucinations, or is it just by chance? To further investigate this, we visualize token importance in HHEM’s predictions using the three examples from the previous section.

In the screenshots below, tokens are highlighted in bold where the opaqueness level reflects the importance of that token and its contribution to the overall hallucination score. The importance score is derived from the gradient norms of HHEM’s prediction scores with respect to input token embeddings.

Image 1: hallucination annotation for Example 1

Image 2: hallucination annotation for Example 2

Image 3: hallucination annotation for Example 3

As you can see from the above screenshots - HHEM captures all of the benign hallucinations correctly!

For instance, in Example 1, the article says “Rio (2011 film). A sequel, Rio 2, was released on April 11, 2014,” and it does not mention that Rio is an animated film which is the main reason to classify this as a benign hallucination. The annotation shown above in Image 1 clearly highlights that the token (“animated”) is in the highest hue of red.

Summary

So, what have we learned?

Reasoning doesn't seem to be the main driver behind DeepSeek R1's higher hallucination rate. Instead, R1 appears to “overhelp,” adding information that’s not in the text, even if it’s factually correct.

Why does R1 do that?

We can only hypothesize. It might be due to the specific training protocol used to create R1. What’s important is to realize that it does demonstrate high hallucination rates for RAG, and thus may not be the best LLM to use until these issues are addressed.

The other thing we learned is that HHEM is particularly good at catching these “benign” hallucinations which seem to be the main culprit of the R1 hallucination spike (whereas LLM-as-a-judge often misses those). This provides yet another good reason to use HHEM over LLM-as-a-judge.

As always, we’d love to hear your feedback! Connect with us on our forums or on our Discord.

To experience the full strength of Vectara’s Responsible Enterprise RAG-as-a-service platform, with commercial strength hallucination detection in every query, sign up for our free trial to try Vectara’s, and experience first-hand the benefits of using Vectara's enterprise RAG platform.