Skip to main content

Blog Post

RAG Wars Blog (Rogger) Hero Image


Reducing Hallucinations in LLMs

Advanced research techniques to reduce hallucinations


Hallucination is one of the major problems in utilizing Large Language Models (LLMs) in applications. It refers to the phenomenon that LLM makes things up in the responses or generates non-factual content. For example, you ask a question to an LLM:

  • LLM Input: What is the capital of Washington state?
  • LLM Output: The capital city of Washington state is Seattle.

The LLM fails to give the correct answer: Olympia. In this case, the LLM is using its parametric knowledge, as captured from its training set, to answer the query. The possible cause is that the knowledge embedded in the LLM’s parameters is outdated or incorrect in the first place.

A solution to the issue is Retrieval-Augmented Generation, also known as RAG. By using RAG, you hook up an external knowledge base to the LLM. Specifically, when a query is issued, the relevant search results are retrieved from the knowledge base and provided to the LLM as context. For example, asking the capital of Washington state with RAG will be like:

LLM Input: 

Context: Washington, officially the State of Washington, is the westernmost state in the Pacific Northwest region of the United States. Olympia is the state capital, and the most populous city is Seattle.

Question: What is the capital of Washington state?

LLM Output:

The capital city of Washington state is Olympia.

In this case, the LLM finds the answer in the context provided without relying on its own parametric knowledge.

So yes, RAG is an excellent way to reduce hallucination. But does RAG solve the hallucination problem altogether?

Well – not always. It really depends on whether the LLM can find the facts needed to respond properly to the question in the context provided, and also how good the LLM is in summarization.

The Hughes Hallucination Evaluation Model (HHEM) Leaderboard evaluates the factual consistency rate of LLMs for generating a summary given a source document. Even the top-1 LLM GPT-4-Turbo on the leaderboard, as of the time this blog is written, has a hallucination rate of 2.5%.

Reducing hallucinations continues to be an area of active research in the LLM community, and in this blog, we share three methods for reducing the hallucinations of LLM:

  • Decoding strategy
  • Factuality alignment
  • Post Editing

Using Mistral-7B-Instruct-v0.1 (shortened as Mistral-7B for simplicity later in the blog) with Greedy Decoding as the baseline, we explore how to reduce the hallucination rate using these methods. The code to reproduce the experiments is publicly available here:

Decoding Strategy

LLMs take a sequence of tokens (words or parts of words) as input. The model predicts the next token in the sequence based on the tokens it has seen so far, the process is called autoregressive generation. At each generation step, the model outputs a probability distribution over all possible tokens for the next position, and the next token is decided using a decoding strategy.

This process repeats to generate a token at a time until the EOS (End-of-Sequence) token is generated or reaches the number of token limits. 

Greedy decoding is a simple yet popular choice for the decoding strategy. At each step of generation, the model selects the most likely next token from the predicted probability distribution. The greedy method optimizes each decision locally at each step based on current information without considering future consequences. Thus, this method can lead to suboptimal sequences (less factual), especially when the highest probability token at each step does not lead to the overall best (more factual) sequence.

Beam search is a decoding strategy designed to improve upon the limitations of greedy decoding by considering multiple potential sequences simultaneously. At each generation step, the model considers a fixed number (num_beams) of the most probable tokens. By considering multiple probable next tokens, the model generates a set of candidate sequences. The top-k sequences with the highest cumulative probabilities are maintained and expanded from the set of candidates. Depending on the num_beams chosen, beam search can produce higher-quality sequences compared to greedy decoding at the cost of increased computational complexity. 

You can find more information on beam search and greedy decoding here

DoLa is a recently developed contrastive decoding technique by Chuang, Yung-Sung, et al. The paper proposes a method for emphasizing the logits of informative entity tokens at inference time for LLMs. They observed that logits of easy functional tokens, such as “the”, “a”, “or”, etc., stop changing in the middle layers of an LLM, while the logits for informative tokens representing entities still change in the later layers of a model. To emphasize the output logits for informative tokens, the DoLA method contrasts the final layer’s log probabilities with a former layer. DoLa helps the model to increase logits for tokens representing factual knowledge. The DoLa method can be stacked with other decoding strategies.

In our experiments, we used the HHEM to compare the hallucination rates of Greedy decoding (as a baseline) to beam search (with num_beams=10) and to DoLa+Greedy, as you can see below:



Hallucination Rate(%)




Num_beams = 10


DoLa + Greedy


Table 1: Hallucination rates with beam search or DoLa vs baseline.

Factuality Alignment

Alignment in an LLM refers to the goal of ensuring that the behavior of an LLM is aligned with human intentions, values, and ethical principles such as safety, honesty, and usefulness. 

The Direct Preference Optimization (DPO) algorithm is a recent technique to perform the alignment task by fine-tuning an LLM with a pairwise preference dataset where each sample has a prompt, a more favored response, and a less favored response. DPO uses a binary cross entropy like loss to increase the likelihood of the favored response and lower the likelihood of the other response.

Tian, Katherine, et al. proposes to use DPO to finetune language models for factuality, let’s see how that works. 

The paper provides methods to generate the preference data automatically by checking the factuality of responses with external reference or with model confidence for biography generation and medical QA. In our experiment, we adopt a similar setting, but for the summarization task.

The first step is to prepare the responses. We ask the Mistral-7B model to generate multiple summary responses using the sources from CNN/Dailymail, XSum/BBC, and VitaminC datasets. Using temperature=1.0, we set the LLM to generate n=6 diverse responses for each source document.

The next step is to construct the preference data by ranking the model responses by their factuality. We employ the HHEM model to score the responses and then construct the preference data by selecting the responses with HHEM score > 0.8 as the more favored responses and the responses with HHEM score < 0.5 as the less favored responses. The generated preference data looks like this:

Factuality alignment data


Using the data constructed, we fine-tune the Mistral-7B model using DPO. The results are shown in table 2 below:



Hallucination Rate(%)




DPO + Greedy


Table 2: Hallucination rate with factuality alignment vs baseline

Post Editing

When humans write text, it is quite common to have a first draft followed by multiple rounds of revisions until we reach the final version that we feel happy with. 

Sharing the idea of humans revising their drafts, post editing is a method of revising an LLM’s initial response using the same or another LLM. To train a post-editing model for correcting the factual errors in an LLM’s response, you need data for transforming the factually inconsistent response into a consistent response.

One recent work to use that approach is FAVA by Mishra, Abhika, et al., which introduces a way for automatically synthesizing the training data by prompting GPT-4 to insert errors in a factual text. The authors then finetune a Llama2-7B model on the synthetic data. The FAVA model takes reference texts and an LLM’s response as input, outputting the edited response. Here is an example:

FAVA Input:

Read the following references:

Banff National Park is Canada’s oldest national park, established in 1885 as Rocky Mountains Park. Located in Alberta’s Rocky Mountains, 110–180 kilometers (68–112 mi) west of Calgary, Banff encompasses 6,641 square kilometers (2,564 sq mi) of mountainous terrain.

Please identify all the errors in the following text using the information in the references provided and suggest edits if necessary:

[Text] Canada’s oldest national park, Banff, was established in 1886. It recently won a Nature’s Choice 2023 award for its beautiful mountainous terrain. It’s the best national park ever.


FAVA Output:

Edited: Canada’s oldest national park, Banff, was established in <entity><mark>1885</mark><delete>1886</delete></entity>. <invented><delete>It recently won a Nature’s Choice 2023 award for its beautiful mountainous terrain.</delete></invented> <subjective>It’s the best national park ever.</subjective>

We experimented with FAVA-based post-editing on the greedy decoding output of the Mistral-7B model, with the results shown in Table 3:



Hallucination Rate(%)




Greedy + FAVA (Greedy)


Table 3: Hallucination rate with FAVA post-editing vs baseline


We have seen four different approaches proposed by the research community to reduce hallucinations in LLMs: beam search, DoLA, DPO for factuality and post-editing

It’s encouraging to see that all these methods provide some improvement over the baseline, and at the same time important to also understand some of of their limitations:

  • DoLA requires changes in the inference code because it requires access to logits across different layers of the LLM. 
  • DPO requires extra computation resources to fine-tune the model you want to deploy.
  • The extent to which hallucination is reduced, for both DPO and Post Editing, tends to be dependent on the dataset.
  • It is quite common to stream the output tokens of an LLM (i.e. send the output one token at a time). Unfortunately, beam search is not friendly for streaming because it tracks more than one decoding sequence during the generation, and the best sequence is not determined until the process reaches the end. The Post Editing method is also problematic for streaming because it needs a second round of inference to refine the initial response.



Code Change


Model Change

Data/Metric Dependent

Bad for streaming

Beam Search









DPO on Factuality




Post Editing





Table 4: The features of each method for hallucination reduction.

It’s also worth mentioning that some of these methods can be combined together to obtain a better effect in reducing the hallucination rate of an LLM. 

Find more information about the experiments and code examples discussed in this blog, please visit the github repository.

Vectara’s RAG-as-a-service platform is already a great platform to help reduce hallucinations, and we are very excited to see more advanced research to further help reduce hallucinations.

If you are interested in trying Vectara’s RAG platform, you can sign-up here.

Recommended Content


Hallucination Mitigation Notebook

A Jupyter notebook with the details of all experiments in this blog about hallucination mitigation techniques.

Example Notebook
Resource Image
Close Menu