HHEM 2.1: A Better Hallucination Detection Model and a New Leaderboard
The upgraded HHEM-2.1 outperforms both GPT-3.5-Turbo and GPT-4 for hallucination detection and is powering our updated HHEM leaderboard.
8-minute read timeIntroduction
In April 2024, we announced HHEM-2.0, an improved hallucination detection model that is better than HHEM-1.0, supports three languages (English, French, and German), and powers Vectara’s Factual Consistency Score (FCS).
Today we are excited to announce HHEM-2.1, an improved version of HHEM-2.0, along with an open-source version and a new LLM hallucination leaderboard powered by HHEM-2.1.
RAG and Hallucination Detection Models
Large Language Models (LLMs) are known to hallucinate. In direct use, they sometimes just respond with the wrong answer, while when used within a RAG pipeline, they may generate responses that are not grounded in the search results from the retrieval engine.
Hallucination detection models provide GenAI developers with a score that reflects the level of trust in the response from the RAG pipeline. Concretely, it is a Factual Consistency Score (FCS), that can tell us the extent to which the summary is consistent with the facts presented to it by the RAG retrieval engine. This score ranges from 0 (very likely a hallucination) to 1 (summary is very consistent with the facts).
Prior to HHEM, most hallucination detection methods used an LLM (like GPT-4) to decide whether a response is a hallucination or not, an approach called “LLM-as-a-judge”. Although this approach seems to work well in some cases, it has two major drawbacks.
First, it tends to be slow and expensive, which makes it difficult to use in real-world production deployments that require low latency.
Second, it uses an LLM to judge a decision that a similar or different LLM made, which can result in an “echo chamber” effect, or some sort of over-fitting.
Vectara’s HHEM, launched in late 2023, was the first industry-standard hallucination evaluation model to demonstrate excellent performance in hallucination detection while being a pure classification model that does not use an LLM-as-a-judge. This makes HHEM both fast and reliable, making it better suited for enterprise GenAI applications.
Alongside the model, we also built the HHEM leaderboard, which ranks LLMs based on their propensity to hallucinate, measuring a “hallucination rate” for leading commercial and open-source LLMs.
HHEM Version 2
In April of 2024, we announced HHEM-2.0, a new version of HHEM with better accuracy, unlimited sequence length (as opposed to the 512 limit with HHEM-1.0), and support for three languages (English, French, German).
Since that announcement, we’ve been hard at work to make HHEM even better, and today we are excited to share HHEM-2.1, which demonstrates improved hallucination detection accuracy over HHEM-2.0.
HHEM-2.1 is already integrated into Vectara’s RAG-as-a-service platform and is automatically included with every call to Vectara’s Query API (called “FCS” or factual consistency score). This makes it easy for enterprise developers to build trusted GenAI applications with reduced hallucinations.
Along with HHEM-2.1, we are releasing HHEM-2.1-Open as a publicly available open-source model available on Hugging Face and Kaggle. HHEM-2.1-Open significantly improves upon our open-source model offering (HHEM-1.0-Open) in terms of hallucination detection performance. While there are other hallucination detection models out there, HHEM-2.1-Open can easily be run on consumer-level GPUs such as RTX 3080 without sacrificing computational precision. You can even run HHEM-2.1-Open on modern desktop CPUs – the latency on an Intel Xeon w7-3445 is under 1.5 seconds for a 2k combined length of the premise and the hypothesis.
A Qualitative Analysis on HHEM-2.1
HHEM-2.1 is more accurate than its predecessors not only in identifying hallucinations where they occur (“recall”) but also in being correct when it predicts that a summary response is a hallucination (“precision”).
Let’s take a look at an example. Consider the following source context:
Boyhood (film). Filmed from 2002 to 2013, Boyhood depicts the childhood and adolescence of Mason Evans, Jr. (Coltrane) from ages six to eighteen as he grows up in Texas with divorced parents (Arquette and Hawke).
With the following summary:
The film is directed by Richard Linklater and written by Linklater and Vanessa Taylor. Boyhood is a coming-of-age film that follows Mason Evans, Jr. (Coltrane) as he grows up in Texas with divorced parents (Arquette and Hawke).
Notice that in the source paragraph, there is no mention of who directed the film, nor who wrote it, yet that is articulated in the summary – clearly a hallucination. With HHEM-1.0-Open, the hallucination score for this pair of source/summary is 0.9788 (suggesting it is factually consistent, close to 1.0), whereas with HHEM-2.1 it is 0.2576. This clearly demonstrates the superior ability of HHEM-2.1 to detect the hallucination.
Benchmarking HHEM-2.1
To evaluate the performance of HHEM-2.1 in detecting hallucinations more broadly, we benchmarked it directly against GPT-3.5-Turbo and GPT-4 (used via the “LLM-as-a-judge” methodology in the zero-shot fashion) as well as against RAGAS (using GPT-3.5-Turbo and GPT-4 internally). Note that in all cases we use GPT-3.5-Turbo version 01-25 and GPT-4 version 06-13.
We used AggreFact and RAGTruth, two prominent hallucination benchmark datasets. In particular, on AggreFact, we picked its SOTA subset (AggreFact-SOTA). We also report the performances on the summarization (RAGTruth-Summ) and question-answering (RAGTruth-QA) subsets of RAGTruth. All evaluation sets contain English-only data.
When using an LLM to detect hallucinations, we used the zero-shot template described in “ChatGPT as a Factual Consistency Evaluator for Text Summarization”.
We measure performance using the common metrics of precision, recall, and F1, as defined here, applied to the task of detecting hallucinations. In all cases, higher values are better.
In terms of F1 score (Figures 1 to 3), HHEM-2.1 outperforms all baselines by significant margins on all evaluation sets, except when compared with RAGAS-Faithfulness using GPT–4 as the underlying model. RAGAS has the downside of being relatively slow – it may take RAGAS as much as 35 seconds to make a judgment with GPT-4 for a 4096-token context (and a substantial cost) whereas in HHEM-2.1, it’s only 0.6 seconds on a consumer-level RTX 3090 GPU. According to F1 score, HHEM-2.1 is at least 1.5x better than GPT-3.5-Turbo on RAGTruth’s Summarization and Question Answering subsets, respectively, and over 30% (relative) better than GPT-4.
Next, let’s look at precision and recall (Figures 4 to 6).
We notice that using GPT-4 or GPT-3.5-Turbo with zero-shot judgment has a very unbalanced performance in terms of precision and recall. Despite using the same prompt template, the zero-shot performances of GPT-3.5-Turbo and GPT-4 measured by Vectara differ significantly from those reported in the RAGTruth v2 paper. In fact, if you compare the numbers measured by Vectara to those reported in the RAGTruth v2 paper, the precision and recall nearly flipped for both GPT-3.5-Turbo and GPT-4. Furthermore, the RAGTruth v1 and v2 papers report moderate differences.
We hypothesize that the differences in precision and recall are due to constant model updates from OpenAI and the different times at which the experiments were carried out.
These observations may indicate that the “LLM-as-a-judge” technique may not be robust for hallucination detection in that a model update can flip the assessment, or may result in otherwise unreliable outcomes.
In contrast, HHEM performs better than both GPT-3.5-Turbo and GPT-4 (in terms of both balanced accuracy and F1 score), with a more balanced precision/recall trade-off, making it reliable and consistent.
Note that all benchmarks above evaluate HHEM-2.1’s ability to detect hallucinations in English only, since these datasets (AggreFact and RAGTruth) provide only English language data.
HHEM Leaderboard v2
Last but not least – we are also releasing today a new HHEM leaderboard, powered by HHEM-2.1 that improves the accuracy of LLM rankings by their likelihood to hallucinate (also available on GitHub).
Here is the ranking of the top 25 LLMs as of the time this blog is published:
As you can see, OpenAI models continue to perform well with impressively low hallucination rates, followed by Orca-2, Intel Neural-chat, Snowflake, Phi-3, and Llama-3.1-405B.
This new ranking reflects HHEM-2.1’s ability to evaluate hallucinations on longer sequences (up to 4096 tokens) as well as its improved precision and recall, making the leaderboard able to better reflect the true hallucination rate of LLMs more accurately overall.
We will continue to evaluate new LLMs as they become available and add them to the leaderboard. If you have a specific LLM you want us to review, please submit your request here (under “Submit here”).
Conclusion
Detecting hallucinations continues to be a critical component in building trusted enterprise RAG applications, and at Vectara we continue to invest our research efforts into building better and more accurate hallucination detection models.
The release of HHEM-2.1 (available immediately to all Vectara customers through Vectara’s API), provides higher accuracy in detecting hallucinations while maintaining low latency. This enables developers to build more trusted GenAI solutions, while not being subject to the long latency, extra cost, or unstable performance of “LLM-as-a-judge” approaches.
With this release, we also provide an open-source variant of HHEM-2.1 hosted on Hugging Face and Kaggle, as well as a new and more accurate HHEM leaderboard.
The work to reduce hallucinations in LLMs continues, and there are quite a few additional approaches being tested by academia and industry alike. You can read more about some of these techniques in our blog post that describes these in more detail.
To use HHEM-2.1 and experience its full capability in hallucination detection, we encourage you to sign up for a free Vectara account today and build your own AI assistant using Vectara.