Vectara
Back to blog
Hallucination

DeepSeek-R1 hallucinates more than DeepSeek-V3

DeepSeek has shaken the AI industry with its release of Deepseek-R1, but it turns out that Deepseek-R1 has a hallucination problem.

5-minute read timeDeepSeek-R1 hallucinates more than DeepSeek-V3

Introduction

On January 10th, 2025, Deepseek.AI released their reasoning model called Deepseek-R1. Immediately, everyone was talking about it, since it has shown incredible results for reasoning, similar to the ones previously available with OpenAI’s O1 model.

According to Deepseek, the model was created with a total investment of only $5.5 million (although that fact is highly debated online); more importantly, the model is much cheaper to run at about 25x lower cost than the OpenAI O1 and is also open sourced by Deepseek with an MIT license.

We wanted to check how good Deepseek-R1 is in terms of hallucination rate, as part of our work to add it to Vectara’s HHEM leaderboard.

The results were surprising: Deepseek-R1’s hallucination rate was 14.3% - much higher than its non-reasoning predecessor: Deepseek-V3.

Measuring the hallucination rate for Deepseek-R1

As is our standard protocol for any model on the HHEM leaderboard, we generated summaries from the source articles in our hallucination leaderboard dataset using Deepseek-R1 and Deepseek-V3 respectively. We then evaluated the summaries generated using two methods to evaluate how well the contents of each summary is supported by its corresponding source passage:

  1. Vectara’s HHEM (a dedicated discriminative model for catching hallucinations)
  2. The strategy from Google’s FACTS work, which uses the average from three LLMs (GPT-4o, Claude-3.5-Sonnet, and Gemini-1.5-Pro) as judges

The results are shown in Table 1 below.

DeepSeek R1DeepSeek V3
Vectara’s HHEM 2.114.3%3.9%
Google’s FACTS w/ GPT-4o & Claude-3.5-Sonnet4.37%2.99%
Google’s FACTS w/ GPT-4o & Gemini-1.5-Pro3.09%1.99%
Google’s FACTS w/ Claude-3.5-Sonnet & Gemini-1.5-Pro3.89%2.69%

Table 1: Hallucination rates of DeepSeek R1 and V3 by various hallucination judgment approaches. Lower hallucination rates are better.

Thus our surprise: consistently across all judgment approaches, Deepseek-R1 is shown to be hallucinating at significantly higher rates than Deepseek-V3.

The 4x jump of hallucination rates per HHEM2.1

Per HHEM 2.1, the hallucination rate of R1 jumps about 4x from that of V3. This was surprising to us, so we looked into the predictions from HHEM 2.1.

Table 2 below shows the mean, median, and standard deviation of HHEM scores for summaries generated by Deepseek-R1 vs. Deepseek-V3. When interpreting the results, note that HHEM scores range from 0 (fully unfaithful) to 1 (fully consistent).

DeepSeek R1DeepSeek V3
Mean0.820.92
Median0.910.93
Standard Devitation0.230.06

Table 2: Statistics of HHEM scores of DeepSeek R1 and V3. Lower HHEM scores are better.


Looking at Table 2 above, our observations are:

  1. The means and medians show that R1 does hallucinate more than V3.
  2. However, the means and medians also show that both R1 and V3 are consistent for most samples.
  3. In particular, the medians of R1 and V3 are nearly the same.
  4. However, the HHEM scores on R1 are much more divergent than V3 – 4x higher standard deviation.

Based on the nearly-equal median, but much lower mean and higher standard deviation on R1 than V3, we hypothesize that R1 produces many more samples that are borderline hallucinated than V3 and hence HHEM shows a 4x increase in hallucination rate.

Is hallucination the cost of reasoning?

The observations above may suggest that a reasoning-enhanced LLM may hallucinate more than its generic counterpart. We tried to verify that on the GPT series between GPT-o1 (reasoning-enhanced) and GPT-4o (generic).

GPT-o1GPT-4o
Vectara’s HHEM 2.12.4%1.5%
Google’s FACTS w/ GPT-4o & Claude-3.5-Sonnet1.00%1.49%
Google’s FACTS w/ GPT-4o & Gemini-1.5-Pro0.90%1.39%
Google’s FACTS w/ Claude-3.5-Sonnet & Gemini-1.5-Pro1.39%1.89%

Table 3: Hallucination rates of GPT-o1 and 4o by various hallucination judgment approaches. Lower hallucination rates are better.

According to Table 3, while HHEM 2.1 shows that the reasoning LLM (GPT-o1) has a higher hallucination rate than the generic LLM (GPT-4o), FACTS shows otherwise. The conclusions by FACTS for the GPT series differ from those for the DeepSeek series.

GPT-o1GPT-4o
Mean0.890.90
Median0.930.94
Standard Devitation0.120.11

Table 4: Statistics of HHEM scores of GPT-o1 and 4o. Lower HHEM scores are better.


According to Table 4, consistent with the DeepSeek series, the GPT series also shows a lower mean and median and higher standard deviation of HHEM scores for its reasoning LLM (GPT-o1 here) than its generic LLM (GPT-4o).

Therefore, we feel it is still too early to draw conclusions about any degradation of reasoning-enhanced LLM in terms of hallucination rates.

But it does seem to be the case that the GPT series sacrifices less faithfulness for reasoning ability than the DeekSeek series. In other words, perhaps with more careful training, the Deepseek team could have avoided this degradation, at least to the extent it is right now.

Is HHEM better than LLM-as-a-judge?

HHEM may be better at catching hallucinations than LLM-as-a-judge (such as FACTS). In Tables 1 and 2, when HHEM detects a significant amount of hallucination increases (e.g., 4x hallucination rates, and a dip on mean HHEM score) from V3 to R1, FACTS notices the same trend. But in Tables 3 and 4, when HHEM notices only a slight increase in hallucination rate from 4o to o1, the tide has not been turned by FACTS yet.

The question needs further investigation and Vectara’s ML group is working hard on that. Stay tuned!

We are excited about the promise of reasoning models, and to see what 2025 brings in terms of further advances in this space. At the same time, if you are building a RAG or agentic RAG solution, paying attention to the hallucination rate remains critical when selecting your LLM.

As always, we’d love to hear your feedback! Connect with us on our forums or on our Discord. If you’d like to see what Vectara can offer you for retrieval augmented generation on your application or website, sign up for an account!

Before you go...

Connect with
our Community!