Correcting Hallucinations in Large Language Models

Overview

In the context of LLMs, hallucination refers to the phenomenon where the model makes up information when responding to a user’s prompt or question. While the exact causes of hallucinations remain unclear and are the subject of ongoing research, these occurrences can have significant real-world consequences, especially in enterprise applications and even more so with Agentic RAG.

As we mention in our hallucination detection blog post, one of the most effective methods for reducing hallucinations is grounding LLM responses in a set of provided documents (also called references). In other words Augmenting Generation with Retrieval aka RAG. However, even RAG is not the catch-all solution to the hallucination problem, and as Vectara’s Hallucination Leaderboard shows modern LLMs hallucinate anywhere from 1% to nearly 30% of the time even when generating outputs based on reference sources, called open book generation.

Vectara has historically provided the Hughes Hallucination Evaluation Model (HHEM) as a detector of hallucinations for the open book generation use case. Today, we are excited to share our early findings on the next frontier: hallucination correction.

The Hallucination Correction Model

We have been exploring a specific technique for hallucination correction called “post editing”, in which the model corrects hallucinations after a summary has been generated.

Here’s how it works. The Hallucination Correction Model (aka HCM) receives the reference documents and the generated response to the user query, and generates a “corrected” response, as shown in Figure 1 below:

Figure 1: Overview of the proposed pipeline. The Hallucination Correction Model takes reference documents and the original response from an LLM (which may be hallucinated) as inputs, and generates a corrected response.

Note that we do not use the query as input to HCM. Furthermore, HCM is not generating a whole new response, instead, it will attempt to only fix those parts of the original response that are hallucinated, and leave the other parts and the overall response structure intact.

Because our model functions as a post-editing tool, it can be added on as a post-processing stage to any existing RAG (or other open-book generation) pipeline without requiring modifications to other components.

Hallucination Correction Model Performance

We report the performance of HCM on a few different datasets and quantify how well it’s able to correct hallucinations from several different open-weight and commercially available LLMs.

We also report a brief error analysis and some interesting failure modes of the model.

HHEM Leaderboard Benchmark

Vectara maintains a hallucination evaluation leaderboard, which serves as our first data source for evaluating HCM. We select a diverse range of models from the leaderboard and run their generated summaries through HCM to correct any hallucinations, as described above. To assess the effectiveness of our corrections, we use HHEM (specifically HHEM-2.1), to measure the Factuality Rate (FR, defined as the inverse of the hallucination rate, i.e., factuality_rate = 1.0 – hallucination_rate) both before and after the corrections. The results of this evaluation are presented in Figure 2.

As shown in the plot, the summaries corrected by HCM demonstrate a significant improvement in FR. The enhancements are especially notable for models that initially had low FR. Even for models with higher original FR, HCM still provides a boost in scores.

Figure 2: Improvement in HHEM Factuality Rate after hallucination correction for 15 models on HHEM Leaderboard. Falcon-7B-Instruct generates incomplete responses for approximately 100 examples in the HHEM Leaderboard. We exclude these from the analysis.

To measure whether our model introduces unnecessary alterations to the original response during the correction process, we calculate both the ROUGE score and BERTScore between the original and corrected responses. These metrics are a measure of lexical and semantic similarity respectively and range from 0 to 1, where a score of 1 indicates a perfect match, meaning HCM made no changes, and a score of 0 indicates a complete mismatch.

For our purposes, we want to aim for high scores, as we want HCM to minimize changes to the original summaries. This is because the assumption we make is that hallucinations make up only smaller portions of the original summary. Figure 3 illustrates the ROUGE and BERTScore on the HHEM Leaderboard for different models.

Figure 3: ROUGE and BERTScore on HHEM Leaderboard.

We see that ROUGE and BERTScore are both high for most models, which supports the fact that HCM is not changing the majority of the summary, keeping it intact.

FAVABENCH and NonFacts Datasets Benchmark

We also compare our model’s performance on two other publicly available datasets FAVABENCH and NonFactS. One point to note here is that the hallucinations in these datasets are not natural hallucinations produced by LLMs, instead, these datasets are constructed using artificial techniques to inject hallucinations. For example, FAVABENCH prompts ChatGPT to insert errors from different taxonomies and NonFactS uses a fine-tuned BART-base model to inject hallucinations by passing incomplete information during inference. Therefore, these hallucinations do not completely match the distribution of real-world hallucinations.

We asked HCM to correct the hallucinated summaries in these datasets and the FR of the corrected answers are plotted in Figure 4.

Figure 4: HHEM Factuality Rate of HCM on FAVABENCH and NonFactS.

As we can see, HCM shows strong capability in correcting nonfactual information.

Figure 5 plots the ROUGE and BERTScore scores between original and corrected summaries for FAVABENCH and NonFactS dataset. As previously noted, the hallucinations in these datasets are more “artificial” and therefore easier to detect, which allows HCM to make more substantial changes to the original responses. Consequently, this results in lower ROUGE and BERTScore values, as we can see in Figure 5.

Figure 5: ROUGE and BERTScore for FAVABENCH and NonFactS dataset.

RAGTruth Dataset

We also evaluate HCM’s performance on RAGTruth, a publicly available human-annotated hallucination dataset. RAGTruth is arguably the hardest hallucination dataset to detect and correct for.

Unlike some other datasets, like FAVABENCH and NonFactS, which use sophisticated techniques to inject hallucinations into the dataset, RAGTruth contains hallucinations that are naturally generated by LLMs, thus they tend to be subtle and more representative of real-world hallucinations generated by LLMs. Thus hallucination correction techniques that improve performance on this benchmark should indicate generalizable FR gains.

The performance comparison with and without our HCM is plotted in Figure 6. ROUGE and BERTScore scores for individual models in the dataset are plotted in Figure 7.

For this analysis, we utilized the entire test split of the RAGTruth dataset for the summarization task.

Figure 6: Factuality Rate improvement on RAGTruth dataset.

Figure 7: ROUGE and BERTScore for different models in the RAGTruth dataset.

It is clear from Figures 6 and 7 that there is a notable improvement in the HHEM factuality rate of post-edited responses originally generated by Mistral-7B-Instruct and Llama-2-Chat (7B, 13B, and 70B).

GPT models, which typically exhibit a high factuality rate by default, continued to demonstrate strong performance with the application of HCM indicating that the model is able to let factual summaries pass through unchanged.

Analysis

In this section we show some examples where HCM does not succeed in correcting hallucinations, and try to analyze the performance in different scenarios.

One particular scenario where our model fails to achieve a high FR (see Figure 1) is when correcting answers generated by the Falcon-7B-Instruct model on the HHEM Leaderboard. Manual analysis reveals that Falcon-7B-Instruct often deviates from the instructions it is given, and accesses its knowledge to add information while generating answers. This does not necessarily mean that the extra information it adds is hallucinated per se and we usually notice it being grounded in the real world, but those parts of information are not directly inferred from the provided documents.

From the perspective of retrieval augmented generation, where you want your answers to only depend on the supporting documents, this is a form of hallucination. Table 1 below shows a few examples of such cases.

Document	Original Summary	Corrected Summary
Poseidon (film) . Poseidon grossed $ 181,674,817 at the worldwide box office on a budget of $ 160 million.	The film follows the story of a modern-day ocean liner that is struck by a massive wave and capsized, resulting in a series of catastrophic events. Answer: Poseidon is a 2012 film about a modern-day ocean liner that capsizes and is struck by a massive wave, resulting in a series of catastrophic events.	The film Poseidon grossed $181,674,817 worldwide on a budget of $160 million.
Angela Evelyn Bassett ( born August 16 , 1958 ) is an American actress.	The 45-year-old star is best known for her roles in the films Black Panther and A Star is Born. Summary: Angela Bassett is an American actress known for her roles in Black Panther and A Star is Born.	Angela Evelyn Bassett is an American actress born on August 16, 1958.
Paris (Paris Hilton album) . It also incorporates elements of other genres , such as reggae , soul and pop rock , in its production .	The album was released on October 13, 2009, and debuted atop the US Billboard 200 chart. It has since been certified gold by the RIAA (United States), denoting sales of over 1 million copies. Paris Hilton’s album ‘Paris’ (2009) incorporates various genres like reggae, soul, and pop rock, and debuted atop the US Billboard 200 chart. It has been certified gold by the RIAA (United States) due to sales of over 1 million copies.	The Paris album by Paris Hilton incorporates elements of various genres, including reggae, soul, and pop rock, in its production.
Easy A ( stylized as easy A ) is a 2010 American teen comedy film directed by Will Gluck , written by Bert V. Royal and starring Emma Stone , Stanley Tucci , Patricia Clarkson , Thomas Haden Church , Dan Byrd , Amanda Bynes , Penn Badgley , Cam Gigandet , Lisa Kudrow and Aly Michalka .	The film follows the story of a high school student ( Stone ) who, after being dumped by her boyfriend ( Tucci ), decides to embark on a quest to find a new love interest ( Church ) by using a dating website. Summary: A high school student, Emma Stone, uses an online dating website to find a new love interest after being dumped by her current boyfriend.	A high school student, Emma Stone, uses an online dating website to find a new love interest.
Soul Food is a 1997 American comedy-drama film produced by Kenneth “ Babyface ” Edmonds , Tracey Edmonds and Robert Teitel and released by Fox 2000 Pictures .	The film stars Martin Lawrence, Regina King, and John Lee as three friends who reunite for a soul food dinner to celebrate the life of their friend, who has died. Summary: A soul food dinner is held to honor the life of a deceased friend, featuring the reuniting of three friends and the sharing of their favorite recipes.	A soul food dinner is held to celebrate the life of a friend, featuring the reuniting of three friends.

Table 1: Analysis of Falcon-7B-Instruct on HHEM Leaderboard.

As is clearly evident, Falcon-7B-Instruct often generates information that is not directly supported by the provided documents. These particular examples turn out to be problematic because our HCM aims to make the most crucial minimal changes to edit non-factual pieces of information that aren’t directly supported by the provided documents while maintaining the overall structure of the original answer. This, however, means that the model isn’t accustomed to making larger changes to summaries, such as deleting or adding large pieces of text, and thus fails to correct the outputs generated by the Falcon-7B-Instruct model.

Conclusion

In this blog, we presented our Hallucination Correction Model (HCM) and evaluated its performance on several public benchmarks as well as our HHEM leaderboard.

Our analysis revealed a significant improvement in the factuality rate of the generated summaries across all datasets and leading LLMs. Additionally, we examined some edge cases where our model encounters challenges and identified several shortcomings that we intend to address in future iterations.

Reducing hallucinations in LLM, especially in enterprise RAG pipelines, remains an important area of research and we consider HCM a substantial step towards our goal of reducing hallucinations.

As we work to further improve our offering in this field, we are excited to share these initial results with the community.

Appendix

We randomly selected ~100 examples with original and corrected summaries along with references from our evaluation datasets and have made them available on HuggingFace.