HHEM v2: A New and Improved Factual Consistency Scoring Model

Note: The term “hallucination” has various definitions. In this blog article, we mean “factual consistency” and will use it interchangeably with the term “hallucination” thereafter.

Introduction

Hallucinations pose a big challenge in deploying GenAI, particularly for Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems. A hallucination occurs when an LLM generates text containing information not grounded in the provided input. Specifically, in RAG applications, this can lead to summaries including content that was not in the retrieved source documents. As such, detecting the occurrence of hallucinations is crucial in developing and deploying trustworthy RAG-based solutions.

However, measuring hallucinations is not easy. A common practice is to employ an “LLM judge” wherein an LLM is presented with two pieces of information and asked to determine whether the second piece is evidenced by the first. Unfortunately, in practice, not only are LM judges slow and expensive, but, in many cases, they are inaccurate as well.

In November 2023, Vectara open-sourced the Hughes Hallucination Evaluation Model which has received wide community endorsement, including, at the time of writing, over 100,000 downloads. Recently, we announced the release of the Factual Consistency Score, based on an improved version of HHEM, and accessible exclusively through the Vectara platform.

Hallucinations

With RAG, you start with a text dataset, known as the “corpus”. This could be a set of PDF documents, website content, data from an enterprise application like Salesforce or JIRA, or any other meaningful text data.

When issuing a query, results retrieved from that corpus are fed into an LLM, which then generates a summary to answer the query (the “G” in RAG). Therefore, even with RAG, some hallucinations are inevitable because LLMs hallucinate $^1$

Hallucinations appear in many different forms, some subtle and others more obvious. Let’s look at a few examples.

The easiest to spot occurs when the LLM output contains something completely irrelevant to the input. The discrepancy can be true or fabricated. For example, consider this example:

LLM input: “I had a burger for lunch.”
LLM output: “I had lunch. Apollo 11 Mission put astronauts on the moon.”

Then “Apollo 11 Mission put astronauts on the moon” in the output is a hallucination – although it is true. More precisely we can say it is factual but hallucinated. Pagnoni et al. from CMU term these types of hallucinations as “out-of-article errors” while Maynez et al. from Google refer to them as “extrinsic hallucinations” – unfortunately there is no industry/community consensus on hallucination taxonomy yet because LLMs and hallucinations are so new.

Another common type of hallucination is the “discourse error”, which occurs when unrelated pieces of information are “stitched” together. In the example below, taken from the SummaC dataset, the descriptions of goldfish and koi carp are mixed up.

LLM input: Goldfish are being caught weighing up to 2kg and koi carp up to 8kg and one meter in length.
LLM output: Goldfish are being caught weighing up to 8kg and one meter in length.

Yet another common type of hallucination is when an entity, predicate (verb or relation), or the resolution (what is referred to) of pronouns is substituted with something else which may be obviously or subtly different from the original. This is illustrated by the example below:

LLM input: Jian-Yang says his ex-CFO is suing PipedPiper because he was forced out.
LLM output: Jian-Yang is prosecuting Hooli after he was forced out.

This example exhibits hallucinations at the entity, predicate, and resolution levels. The entity name changed from “PipedPiper” to “Hooli”. The predicate changed from “suing” to “prosecuting”, which are subtly different. Finally, the reference of “he” changed from “his ex-CFO” to “Jian-Yang”.

$^1$ See, for example Calibrated Language Models Must Hallucinate (Kalai and Vempala, 2024), which discusses the link between the training objectives of an LLM and hallucinations.

Hallucination Detection

Hallucination detection is the task of determining whether output from an LLM is supported by its corresponding input fed into the LLM. There are generally two approaches to hallucination detection: building a dedicated model/function or using an “LLM-as-a-judge.” The former is much cheaper to operate with a much lower latency than the latter, because its (underlying) model is usually much smaller than an LLM.

Due to the short history of LLMs, we really know very little about detecting hallucinations. It is not a well-solved problem and continues to be an active area of research. To illustrate, consider the conclusions of two well-known research papers in the field:

The AggreFact benchmark, co-developed by SalesForce Research and UT Austin in 2023, reports that the best accuracy of hallucination detectors it benchmarked is 70.2% on its SOTA subset, which covers three large, generative models: T5, Pegasus, and BART. The AggreFact benchmark further found (Table 4 in their paper) that ChatGPT-based LLM judge underperforms the dedicated models by about 10 percentage points in terms of accuracy.

The RAGTruth benchmark reports that no hallucination detectors it benchmarked, including those based on GPT-4, can detect hallucinations with a >50% precision and a > 50% recall at the same time (Table 5 in their paper) for summarization or question-answering. It means that no hallucination detector evaluated in the RAGTruth paper can satisfy both of the statements below:

More than half of the detected hallucinations are actually hallucinations.
More than half of actual hallucinations are caught.

Debuted on December 31, 2023, the RAGTruth benchmark is co-developed by Newsbreak and UIUC, covering RAG summaries generated by GPT-3.5, GPT-4, Llama-2 7/13/70b and Mistral 7B.

In addition, the paper On the Intractability to Synthesize Factual Inconsistencies in Summarization debuted at EACL 2024 reports that the best hallucination detector studied in that paper has an accuracy of 75.8% on the SummaC benchmark, which is another widely used hallucination benchmark and substantially overlaps with AggreFact.

Vectara’s Factual Consistency Score

Vectara released the open-source HHEM in November 2023. Given a pair of texts, HHEM yields a score between 0 and 1. The higher the score, the more likely it is that the LLM output is factually consistent with its input.

Since its release, the HHEM is now receiving over 40,000 downloads per month, indicating the industry’s interest in this topic. We have also developed an internal version of this model, termed HHEM v2, that delivers a few key improvements:

Multilinguality: HHEM v2 works not only in English, but also in German and French, with support for more languages planned. This aligns with Vectara’s vision of providing a truly language-agnostic platform, i.e. one that handles input in all languages equally well.
Unlimited context window: To the extent permitted by hardware, it has an unlimited context window, making it practical for RAG applications which often have lengthy source texts.
Lastly, HHEMv2 produces a calibrated score. This means that the score has a probabilistic meaning: 0.8 means that there is an 80% chance that the LLM output is factually consistent with the corresponding input.

Within the Vectara platform, we’ve deployed HHEM v2 to power the Factual Consistency Score that is automatically calculated and attached to every summary the platform generates.

Why Calibration is Important

Most machine learning models provide a score between 0 and 1 for computing the loss that guides the training of neural networks But such a raw score doesn’t convey meaning in practice. A calibrated score, on the other hand, translates a raw score to a probability that a sample is positive. Here “positive” is defined as factually consistent.

The figure below shows an example of score distributions before and after calibration. Before the calibration, the model’s raw output scores are overwhelmingly close to the two extremes, 0 and 1, showing that the model is overconfident about its prediction and may be a sign of overfitting. After the calibration, the distribution of scores is less extreme. In particular, the peak for the lower end shifts from the lowest interval to the interval between 0.1 and 0.2.

HHEM v2 is calibrated against the test split (a split is a subset of data) of the AggreFact benchmark and the summarization subset of RAGTruth, both of which contain carefully human-annotated groundtruth, to ensure that the scores provided by Vectara are in line with detection probabilities on authoritative data.

Performance of HHEM v2

HHEM v2 is thoroughly tested against two of the latest hallucination benchmarks, AggreFact and RAGTruth, which were discussed above. The performance of HHEM v2 and two LLM judge baselines are given in the table below:

	AggreFact, SOTA subset (AggreFact-SOTA)	RAGTruth, summarization subset (RAGTruth-Summarization)
	Balanced accuracy	Precision for hallucinations	Recall for hallucinations
HHEM v2	73%	81.48%	10.78%
GPT-3.5 *	56.3% – 62.7% **	100%	1.00% ***
GPT-4 *	80%	46.9%	82.2% ***

Notes:* The prompts for GPT-3.5 and GPT-4 are the direct prompt such as those used in Table 10 in the RAGTruth paper. ** Reported by AggreFact paper*** Reported by RAGTruth paper

Please note that the training data of HHEM v2 does not include RAGTruth nor any data generated from other LLMs. This puts it at a disadvantage relative to GPT-3.5 and 4-based LLM judges because about one-third of the data in the RAGTruth data set was generated by these two models, and LLMs have an advantage performing evaluations of their own output.

The bottom line is that despite its efficiency, compared to GPT-based LLM judges, HHEM v2 is clearly superior to the much larger GPT-3.5.

Latency

Another challenge in building a RAG system is maintaining reasonably low latency, which is the time it takes to generate the response to a user query. At Vectara, we work hard to save our customers every bit of time and money by optimizing the latency at each step of the RAG pipeline, including hallucination detection.

Towards this goal, computing the Factual Consistency Score on Vectara’s platform takes less than 50ms in most cases.In comparison, a GPT-3.5/4-based hallucination detector will introduce delays of a few seconds into RAG pipelines, which is unacceptable for many significant enterprise deployments.

Conclusion

HHEM v2 is an improved version of HHEM v1 that powers Vectara’s Factual Consistency Score. It estimates the probability that one piece of text is factually consistent with another.

It has performance superior to GPT-3.5 at a fraction of the cost and latency. Accessible through Vectara’s API, we encourage you to sign up for a free Vectara account today to see why Vectara is the leader in trusted generative AI.