Celebrating 2 Million Downloads of HHEM

Introduction

Since our early launch of the Hughes Hallucination Evaluation Model, less than a year ago, we have seen tremendous interest in its ability to detect and score hallucinations in enterprise RAG pipelines, and it continues to be the #1 hallucination detection model on HuggingFace.

Since the initial launch, we’ve continued to improve HHEM, with the most recent update being HHEM-2.1 (launched along with an open source version, also known as HHEM-2.1-Open) and a new LLM hallucination leaderboard powered by HHEM-2.1.

Today we are excited to share an exciting milestone: HHEM has been downloaded more than 2 million times by users around the world.

LLM Hallucinations

We are at the beginning of a foundational technology innovation cycle powered by Large Language Models (LLMs) such as OpenAI’s GPT-4, Anthropic’s Claude 3.5, Google Gemini, and MetaAI’s Llama-3.2.

Yet, with all the power packed into these models, their tendency to hallucinate remains one of the biggest challenges for enterprise adoption.

RAG (Retrieval-Augmented Generation) is a technique to address and reduce these hallucinations, accelerating enterprise adoption of LLM technology, and driving successful deployments of LLM-powered applications.

So what is a hallucination detection model?

It’s a model that provides a score that quantifies the level of trust in the response from the RAG pipeline. Before HHEM, most other approaches to hallucination detection used an LLM (like GPT-4) to decide whether a response is a hallucination or not – an approach called “LLM-as-a-judge”. This approach has two major flaws: high cost and high latency, making them not practical for enterprise production deployments. On top of that, recent academic research points to concerns around LLM bias and overall reliability for this purpose.

In contrast, HHEM is fast and inexpensive to use, making it perfectly suitable for practical use in production-grade RAG applications.

Vectara customers enjoy HHEM being applied to every RAG query API call in the form of the factual consistency score (FCS), allowing them to use it in real time to reduce hallucinations in their RAG or Agentic-RAG applications.

HHEM Adoption for Mission-critical RAG

We are excited to see significant growth in the usage of Vectara’s hallucination evaluation model.

Since its launch in late 2023, HHEM has been downloaded more than 2 million times from Hugging Face, and 1 million of those in September 2024 alone.

This recent acceleration is a very clear indication that the RAG market is maturing and moving into more production mission-critical use-cases where the accuracy of the results is paramount.

In addition, we continue to see interest in the HHEM leaderboard, which provides visibility into the relative strength of various commercial or open-source LLMs regarding their propensity to hallucinate. Here is the ranking of the top 25 LLMs as of the time this blog is published:

Conclusion

Detecting hallucinations continues to be a critical component in any enterprise RAG application, and we see a growing interest in an accurate, low latency and low cost solution to this problem, especially from enterprise customers.

At Vectara we remain committed to helping reduce hallucinations, not only through better hallucination detection, but also via techniques for hallucination correction (see this blog post for example).

HHEM’s model weights are available on Hugging Face and Kaggle, and the leaderboard is available here. You can also find these notebooks on Kaggle useful in demonstrating how to use HHEM.

To use HHEM in production RAG, we encourage you to sign up for a free Vectara account today and build your own AI assistant. You can see examples of how HHEM/FCS can be integrated into a user experience in some of our question-answering demos, like ask-legalaid.

As always, for any questions or comments, you can find us in the Vectara discussion forum or on Discord.