Cut the Bull... Detecting Hallucinations in Large Language Models

(Hero Image Credit: Andrey Suslov/Shutterstock enhanced by Dreamstudio)

Arguably the biggest barrier to everyday adoption of new AI technologies, such as ChatGPT and other LLMs, is the problem of hallucination. A hallucination is where the model makes up information when responding to a user’s question or prompt. Hallucinations are most prevalent in the generative AI models, such as text-to-image models like Dall-E, and LLMs like PaLM 2, Claude, and GPT4. In image generation models this can result in unwanted behavior such as the model hallucinating the wrong number of fingers.

Often hallucinations can be very subtle and may go unnoticed by the user, for example spot the hallucination in this image that Bing Chat generated for me the other week when I asked for an image of “Kirby swallowing donkey kong”:

Figure 1:

Bing Chat-generated image from the prompt “Kirby swallowing donkey kong”

Did you spot the hallucination?

Kirby does not have teeth, which Bing Chat will gladly tell you if you ask it:

Bing Chat Question Does Kirby Have Teeth

Figure 2: Bing Chat answer to the question “Does Kirby have teeth”

Thus these two parts of GPT 4, the vision and language models, don’t seem to agree with one another. This sort of contradiction is not one a person would usually make. While these image examples may seem funny and harmless, what happens when a language model hallucinates in response to a text query?

Figure 3: ChatGPT hallucinating for the question “what is heavier: kilo of water or a kilo of air?”

Image Credit: Reddit

People use search engines such as Google and Bing to get factual answers to questions they need answered to guide their decision-making process. That may be for work when they are researching a new project or product line, or to get answers to personal questions such as a medical inquiry, or when seeking legal advice. When they get advice that is made up or ‘hallucinated’ by a language model, this can have very real-world consequences, as one lawyer found when he cited legal precedent based on cases made up by Chat GPT:

Figure 4: Newspaper headline of a lawyer citing cases that were made up, or hallucinated, by ChatGPT

See: https://www.forbes.com/sites/mollybohannon/2023/06/08/lawyer-used-chatgpt-in-court-and-cited-fake-cases-a-judge-is-considering-sanctions/?sh=44f28a6d7c7f

Arguably the best approach for reducing hallucinations in LLM responses is to ground the responses in an existing knowledge source, such as a Wikipedia article or a business application, or in some document from your knowledge base. At Vectara, we call this concept “Grounded Generation,” but it’s also commonly known as “Retrieval Augmented Generation” (RAG) in academic literature. This has been shown in a number of studies to reduce the rates of hallucinations in LLMs (Benchmarking Large Language Models in Retrieval-Augmented Generation, Mitigating the Hallucinations of Large Language Models with Retrieval Augmentation, Retrieval Augmentation Reduces Hallucination in Conversation).

Using RAG to answer questions changes the paradigm of question answering from that of a closed book setting to that of an open book setting. In a closed book question answering system, such as ChatGPT, the LLM generates an answer using only its own knowledge acquired via pre-training. This treats the LLM as the knowledge source that generates the answer to the user’s question. This is illustrated by the diagram on the left below:

Figure 5: Closed vs Open Book Q&A Systems

In a RAG system, the role of the LLM is changed from that of a knowledge source to a reader of the retrieved information to perform open book question answering, as illustrated in Figure 5 above (on the right). The LLM reads and parses the retrieved information, summarizing it to concisely answer the user’s original question. Thus in a RAG system, to eliminate hallucination, it’s essential that the LLM provides an answer that is factually consistent with the information provided by the retrieval system, and correctly summarizes the retrieved content. While Vectara offers a platform that allows companies to perform RAG on their own data, it’s important to point out that this is also how Bing Chat and Google’s chat integration for web search also work. Thus if we can measure how accurate an LLM is at summarizing data, i.e., acting as a reader model, we can estimate how accurate these systems are when provided with accurate search results.

To help evaluate the rate at which a reader model hallucinates, we looked into the academic literature on factual consistency in summarization models. This research area predates the release of ChatGPT and looks into methods for training models to detect factual inconsistencies in abstractive summaries, i.e., summaries that paraphrase the original source material. There are two main datasets used in the literature to evaluate factual consistency, the SummaC Dataset and the TRUE Dataset (see links for details). Leveraging this research, we fine-tuned a small language model (184M parameters) as a binary classifier to classify a summary as factually consistent (or not) with the source document. You can find this model on Hugging face here (available for commercial usage under the Apache 2 license). We then evaluated the performance of our hallucination evaluation model against the two SummaC Models, the TrueTeacher Model, and the AlignScore Model as these models currently achieve the highest scores on the SummaC and TRUE benchmarks according to their authors. We achieved the following results:

Hallucination Evaluation Model	TRUE (AUC Score)	TRUE Summary (AUC Score)	SummaC (Balanced Accuracy)	SummaC (AUC Score)	AnyScale Ranking Task (Accuracy)	Overall
Vectara	0.872	0.850	0.764	0.831	86.7 %	0.837
SummaC Conv	0.795	0.775	0.724	0.743	83.1 %	0.774
SummaC ZS	0.796	0.751	0.721	0.778	85.0 %	0.779
TrueTeacher	0.864	0.868	0.739	0.819	85.2 %	0.828
AlignScore – Large	0.838	0.854	0.745	0.785	85.2 %	0.815

Table 1: Results from the performance of various summary consistency benchmarks

Best scoring model in bold for each dataset.

The TRUE dataset metrics are computed on the 9 out of the 11 TRUE datasets, with Fever and Vitamin C metrics removed as the base model was trained upon those datasets. The TRUE Summary dataset is the subset of 5 of these datasets selected in the TrueTeacher paper. For the SummaC benchmark scores, we used the test split of the SummaC dataset and computed the balanced accuracy ourselves based on tuning the thresholds per dataset on the SummaC validation dataset, as described in the original paper. This is because we were unable to reproduce the much larger scores on that dataset claimed by the AlignScore authors, so we downloaded their model and computed the scores ourselves for all models, using the sci-kit learn balanced accuracy metric and the sci-kit-learn AUC score metric.

To compare LLMs by hallucination rate, we then took around one thousand documents of varying length, including a set of news articles from the CNN / Daily mail corpus, and asked a selection of the top Open Source and proprietary LLMs to provide summaries of those documents, without deviating from the source material (i.e. do not provide additional information). Using these summaries and our hallucination evaluation model, we then computed a hallucination score for each model allowing us to construct a leaderboard of LLMs based on their predicted hallucination rate, as follows:

Model	Accuracy	Hallucination Rate	Average Summary Length	Answer Rate
GPT4	97.0%	3.0%	81.1 words	100%
GPT3.5	96.5%	3.5%	84.1 words	99.6%
Llama 2 70B	94.9%	5.1%	84.9 words	99.9%
Llama 2 7B	94.4%	5.6%	119.9 words	99.6%
Llama 2 13B	94.1%	5.9%	82.1 words	99.8%
Cohere-Chat	92.5%	7.5%	74.4 words	98.0%
Cohere	91.5%	8.5%	59.8 words	99.8%
Anthropic Claude 2	91.5%	8.5%	87.5 words	99.3%
Mistral 7B	90.6%	9.4%	96.1 words	98.7%
Google Palm	87.9%	12.1%	36.2 words	92.4%
Google Palm-Chat	72.8%	27.2%	221.1 words	88.8%

Table 2: Hallucination score for various Large Language Models (LLMs)

Note: Computed as of November 1st, 2023, please see the Vectara Hallucination Leaderboard for the latest leaderboard rankings and the summaries used to compute these rankings. This also contains details about the API calls made and which specific model versions were used.

The prompt we used to generate the summaries for the leaderboard was:

You are a chat bot answering questions using data. You must stick to the answers provided solely by the text in the passage provided. You are asked the question ‘Provide a concise summary of the following passage, covering the core pieces of information described.’ <PASSAGE>’

We then replaced <PASSAGE> with the source document. You can find the source documents and the LLM generated summaries here.

It’s important to stress that while our model is highly accurate and competitive with the state-of-the-art models, it is still a model and does not classify hallucinations with 100% accuracy. In Table 2, “Accuracy” refers to the proportion of documents that were correctly summarized (without factual errors/additions), “Hallucination Rate” is 100 – Accuracy, and the “Answer Rate” is the proportion of documents the LLMs summarized. This is because, for some documents, references to a porn site or violent acts (from the news articles) would result in the models refusing to provide a response, despite the content being pretty innocuous. To ensure this is a fair comparison and not influenced by the answer rate, the final accuracy numbers are computed only on documents that every model provided a summary for.

Some interesting patterns appear to emerge from that data. It appears that some of the models with the lower answer rate were amongst the highest hallucinating models. It appears that the ability to correctly reject content is correlated with the ability to correctly provide a summary, as none of the content was inappropriate, especially in light of the fact that the documents were taken primarily from published newspaper articles. The PaLM models also show a huge difference in the length of their responses, with PaLM chat being very verbose and often adding in a lot of additional information not present in the source document (e.g. see the examples below). In contrast, the regular PaLM model (text-bison-001) was very concise, and while it had a much lower hallucination rate than the chat model (chat-bison-001), it was still quite far behind the other models.

To help better understand the process described above, and the ways in which the different models hallucinate, it’s helpful to include some examples of hallucinations uncovered by our model. Note that I am selecting smaller documents for ease of illustration, but most of the source texts in the dataset were much longer.

Original Passage

The plants were found during the search of a warehouse near Ashbourne on Saturday morning. Police said they were in “an elaborate grow house.” A man in his late 40s was arrested at the scene.

Summary from PaLM

Police have arrested a man in his late 40s after cannabis plants worth an estimated £100,000 were found in a warehouse near Ashbourne.

You can see PaLM inferred the article was about a cannabis plant, and added the estimated street value, neither of which was in the source text.

Original Passage

Veeru Devgan is an Indian stunt and action choreographer and film director in Bollywood.

Summary (PaLM Chat)

Veeru Devgan is an Indian stunt and action choreographer and film director in Bollywood. He is the father of actor Ajay Devgn and director Anil Devgan. He has worked in over 100 films, including some of the biggest blockbusters of Indian cinema. He is known for his innovative and exciting action sequences, which have helped to make him one of the most sought-after action choreographers in the industry. He is also a successful film director, having directed several hit films, including “Phool Aur Kaante” and “Zakhm”.

You can see the considerable extra details in red added by the PaLM model to the original text, despite it being told to stick to the original source text. The added text content also contains a number of hallucinations, for instance, according to Google web search, Mr Devgan was not the director of Zakhm (that is Mahesh Bhatt), and he is the brother of Anil Devgan and not his father.

The models would also flip some relationships, for instance in one article about the Boxer Manny Pacquiao, the source text states that Mark Wahlberg is one of his fans, whereas the LLM stated that Manny was a fan of Wahlberg’s.

What’s Next?

We are not stopping at releasing this model and leaderboard. We will continue to maintain and update the leaderboard regularly so we can track improvements in the models over time. We will also continue to improve upon our open-source model, and release updated versions as it improves. As our focus at Vectara is on producing reliable search, in the coming months we will be adding the capabilities of this model into our platform: providing factual consistency scores with the answers Vectara provides powered by our latest hallucination evaluation model. We will also add additional leaderboards focused on measuring hallucinations in other RAG tasks, such as measuring citation accuracy. One of the issues we have noticed with these models is that they sometimes mis-attribute sources provided to them when answering a question, and we’d like to be able to quantify and detect that. Finally, we will be looking to develop our own models for summarization that lower the hallucination rates further over those offered by GPT 3.5 and GPT 4, which is what we currently offer our customers through our platform.

If you’d like to get started with Vectara and see how our RAG platform can help your business succeed, sign up for a free account today.