Measuring Hallucinations in RAG Systems

Today, we’re happy to announce the release of our open-source Hallucination Evaluation Model!

One of the top concerns of enterprises considering adopting generative AI has been the possibility of LLMs generating hallucinations. Hallucinations come in many forms that could negatively affect a business:

Large errors where, instead of answering an end-user question, the generative model goes completely off the rails and potentially causes reputational damage.
The generative system draws on its body of knowledge and produces copyrighted works in its output.
More nuanced and harder to spot errors where the model takes liberties in its response, for example, by introducing “facts” that are not based in reality.
The introduction of specific biases due to the training data.

While many businesses acknowledge the benefits of generative AI, the risks of these hallucinations have held many back. Some attempts have been made in the past to quantify or at least qualify when/how much a generative model is hallucinating. However, many of these have been too abstract and based on subjects that are too controversial to be useful to most enterprises.

At Vectara, we believe that the real power of LLMs in the enterprise comes from Retrieval Augmented Generation (RAG). RAG helps mitigate all of the above classes of hallucinations on its own (relative to fine-tuning a single generative model) by feeding only relevant data into the model at query time and telling the LLM only to use the data provided to it from the retrieval step. However, this introduces its own challenge: how do you know that the LLM is actually only using the data provided to it in generating its output? This is where Vectara’s Hallucination Evaluation Model (HEM) comes in.

Open Source

With this release, we have produced an open-source model – HEM – that can evaluate how well generative LLMs can summarize a series of results in a RAG system. We defined “well” here as a response that accurately summarizes the results without producing hallucinations in the process. You can use this model to help you evaluate the trustworthiness of your RAG system, including which LLMs are best for your particular use case. Details on how exactly we’ve produced this model and the corresponding scores can be found in our corresponding technical blog, and you can always find the most up-to-date version of the model on our Hugging Face account, here.

Our idea is to empower enterprises with the information they need to have the confidence they need to enable generative systems through quantified analysis. We’ve open-sourced the model in an Apache 2.0 license so that you can tune it to your own specific needs.

Like our users and customers, at Vectara we’re also invested in the quality of these generative LLMs since we do currently use third parties for our summarization functionality. Today, we offer a custom-tuned prompt on top of GPT-3.5 for our Growth users and another on top of GPT 4 for our Scale customers. In the future, we expect to build and deploy other generative LLMs for our customers, but we want you to know we’re improving the quality (and reducing the hallucinations) of the generative output in the process.

Keeping Current

In the future, we may offer additional generative LLMs to our Growth and Scale users. In light of that, and in the general interest of our customers and the public, we’ve created an evaluation scorecard for some of the top most used models with respect to how often they hallucinate. You can think of this as something like a FICO score for hallucinations in RAG systems.

Below is the current summary of the scores, but we’re going to periodically update this scorecard as the latest information arrives and as the various LLMs are updated – as well as adding new LLMs as they are released. There are four numeric columns in this scorecard. To explain this:

The “Answer Rate” on the right is how often the model attempted to summarize the results in response to the question. Sometimes, models incorrectly surmise that they do not have enough information from the retrieved results to summarize the question.
The “Accuracy” and “Hallucination Rate” numbers are the inverse of one another: the Hallucination Rate is what percentage of summaries included some hallucination, and then the Accuracy is 100% less that number. Details of exactly how these hallucinations were evaluated can be found in the technical blog post.
The “Average Summary Length” is how many words the summaries were. We include this because if you’re looking for concise summaries, you may look to optimize for this number as well and consider it as a tradeoff.

Model	Answer Rate	Accuracy	Hallucination Rate	Average Summary Length
GPT4	100%	97.0%	3.0%	81.1 words
GPT3.5	99.6%	96.5%	3.5%	84.1 words
Llama 2 70B	99.9%	94.9%	5.1%	84.9 words
Llama 2 7B	99.6%	94.4%	5.6%	119.9 words
Llama 2 13B	99.8%	94.1%	5.9%	82.1 words
Cohere-Chat	98.0%	92.5%	7.5%	74.4 words
Cohere	99.8%	91.5%	8.5%	59.8 words
Anthropic Claude 2	99.3%	91.5%	8.5%	87.5 words
Mistral 7B	98.7%	90.6%	9.4%	96.1 words
Google Palm	92.4%	87.9%	12.1%	36.2 words
Google Palm-Chat	88.8%	72.8%	27.2%	221.1 words

Table 1: Leaderboard of LLM Hallucination data from the Hallucination Evaluation Model (HEM)

As new and updated models get re-evaluated, we’ll keep the scorecard up to date on the hallucination leaderboard GitHub repository.

We are not stopping at releasing this model and leaderboard. As mentioned, we will continue to maintain and update the leaderboard regularly so we can track improvements in the models over time. However, we will also continue to improve upon our open-source model, and release updated versions as it improves. We look forward to collaborating with the community on keeping the scorecard and model open source and up to date.

Our focus at Vectara is on producing reliable search, so in the coming months we will be adding the capabilities of this model into our platform: providing factual consistency scores with the answers Vectara provides powered by the latest HEM. As we go forward, we will also be looking to develop our own models for summarization that lower the hallucination rates further over those offered by GPT 3.5 and GPT 4. We know that the ability to quantify and reduce hallucinations will be key to making sure we offer our customers the best possible generative capabilities.

As always, we’d love to hear your feedback! Connect with us on our forums or on our Discord. If you’d like to see what Vectara can offer you for retrieval augmented generation, sign up for an account!