Towards a Gold Standard for RAG evaluation

Introduction

Most organizations implement RAG solutions without a systematic approach to measuring the quality of their responses.

Not using an evaluation framework isn’t just a technical oversight; it's a business risk that can undermine your entire AI strategy. And it’s important.

Your RAG stack includes multiple complex components, and each of these can be configured in multiple ways. There is no clear guide for what configuration settings will result in the highest quality responses to users.

For example:

Whether to use fixed token chunking or semantic chunking;
Whether to use hybrid or vector search and what value to use for lambda in hybrid search deployments;
Which LLM to use and how to optimize RAG prompts;
Which threshold to use for hallucination detection and correction, and more.

As the saying goes: “If you cannot measure it, you cannot improve it.” – and if you cannot measure response quality properly, not only are you blind to quality issues that do exist in your implementation, but it is also difficult to know which lever to pull once you do identify a problem.

Enter RAG evaluation.

With a reliable RAG evaluation framework, you can compare various RAG stacks or different configurations of the same stack to see which one works better.

The result is more relevant and accurate responses to the users of your RAG application, leading to higher user satisfaction, increased productivity, reduced hallucinations, and stronger returns on your AI investments.

What is RAG evaluation?

The basic idea behind RAG is to augment the LLM with a powerful information retrieval sub-system that pulls relevant information from a large dataset (or corpus) and presents this information to the LLM for response generation, as shown in Figure 1:

Figure 1: Basic RAG architecture

From a user point of view, the response quality is basically determined by “how accurate and helpful is the response to the question I asked”. Users want the answer to be correct (based on accurate facts) and also helpful.

This boils down to two major types of metrics:

Retrieval metrics: these metrics tell us if the facts retrieved from the source data are relevant to the question and can be successfully used as grounding data for the response.
Generation metrics: these metrics evaluate the response itself, conditioned on the facts - is the response properly grounded in the facts?

When something goes wrong, and response quality is low, inspecting these types of metrics often provides valuable insights and helps you adjust your RAG implementation. For example, if the relevance of facts is low - it might mean that your retrieval pipeline is not good enough, or perhaps you don’t actually have any relevant information in your dataset. If your facts are relevant, but the generated LLM response is hallucinated, perhaps you should use a stronger LLM, update the RAG prompt, or both.

The point is: RAG evaluation can be very actionable, and can help you not only see the issues but also fix them.

Introducing Open-RAG-Eval

Today we are excited to share open-rag-eval - Vectara’s open source RAG evaluation package.

As part of our development effort, we have been thinking a lot about metrics. It’s important to have not only metrics that seem “reasonable”, but also metrics that are robust, validated and practical to implement for large enterprise deployments.

In collaboration with Professor Jimmy Lin's team at the University of Waterloo, with their extensive experience spanning decades in creating world-class benchmarks and datasets for information retrieval evaluation, we designed open-rag-eval around a set of robust retrieval and generation metrics that the group developed with collaborators for the TREC RAG challenge. This evaluation drew participation from around two dozen groups from around the world, who collectively submitted over 100 runs in total. To our knowledge, this is the largest community-wide evaluation of RAG (see results here and here). These metrics have been shown to correlate with human perceptions of quality in rigorous studies and we will continue building on them with open-rag-eval.

We also designed open-rag-eval to be easy to use for the evaluation of any RAG pipeline, whether it’s Vectara’s trusted GenAI platform or a custom RAG solution based on open-source components like LangChain or LlamaIndex.

This is just the first step in a journey - we made open-rag-eval open source and Apache 2.0 licensed for two main reasons:

So that the implementation of each metric is well understood and open for review and improvements.
To enable additional contributions from the AI community.

Figure 2: open-rag-eval architecture

As shown in Figure 2, open-rag-eval includes two steps.

First, we take a list of queries, and retrieve the context (facts) and response for each of those queries. This is accomplished via the RAG connector which connects to Vectara, or any other RAG pipeline of your choice. Alternatively, if you already have the RAG input/output data, you can just use that directly.

Once the RAG outputs are collected, in the second step open-rag-eval runs on each query/facts/response instance, to generate a “RAG evaluation” result. Each such evaluation uses one or more evaluators, which provide a set of metrics.

Metrics are the core of the evaluation. Each metric has a different focus and is used to evaluate different aspects of the RAG system. Each metric can utilize one or more “models” - each model can be used to judge the quality of the RAG system in a certain aspect. They may be LLMs, classifiers, rule-based systems, etc.

Metrics in Open-RAG-Eval

Okay, so let’s get to the cool stuff - the metrics.

In this first release of open-rag-eval, we have included the following metrics.

Retrieval metrics

UMBRELA

These retrieval metrics are based on the UMBRELA paper. In a nutshell, an LLM is used as a judge to rate each retrieved passage from the corpus in conjunction with the query. For each passage, the LLM is asked to assign a score from 0-3 based on the following criteria:

💡 UMRELA score criteria

Given a query and a passage, you must provide a score on an integer scale of 0 to 3 with the following meanings:
0 = represents that the passage has nothing to do with the query,
1 = represents that the passage seems related to the query but does not answer it,
2 = represents that the passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information and
3 = represents that the passage is dedicated to the query and contains the exact answer.

In technical parlance, these are called relevance judgments, and from them it is possible to compute precision@k, recall@k, nDCG, or whatever other metric to evaluate your response quality.

In our implementation, we simply use the UMBRELA scores directly, and average the scores across all the retrieval results for a query and in the detailed result output we include the score assigned to each sample in case manual inspection is desired. If users so desire they can use UMBRELA in their own pipeline to exhaustively label ALL passages in their dataset for a given query and thus obtain the more traditional retrieval metrics like NDCG, recall@K, precision@k.

As an example, consider the query “accredited investor definition”. Let’s look at some of the passages and the assigned UMBRELA scores:

The passage “The federal securities laws define the term accredited investor in Rule 501 of Regulation D as: a bank, insurance company, registered investment company, business development company, or small business investment company; an employee benefit plan, within the meaning of the Employee Retirement Income Security Act, if a bank, insurance company, or registered investment adviser makes the investment decisions, or if the plan has total assets in excess of $5 million; a charitable organization, corporation, or partnership with assets exceeding $5 million; a director, executive officer, or general partner of the company selling the securities; a business in which all the equity owners are accredited investors; a natural person who has individual net worth, or joint net worth with the person’s spouse, that exceeds $1 million at the time of the purchase, excluding the value of the primary residence of such person; a natural person with income exceeding $200,000 in each of the two most recent years or joint income with a spouse exceeding $300,000 for those years and a reasonable expectation of the same income level in the current year; or a trust with assets in excess of $5 million, not formed to acquire the securities offered, whose purchases a sophisticated person makes. No citizenship/residency requirements'” receives a score of 3 meaning it contains the exact answer.

In contrast the passage “The (U.S) "accredited investor"" laws apply to investments in the U.S. Foreign countries may or may not have their own laws regarding investment in startups, and if so, the foreign laws apply. One way around the net worth minimum is to be a member of the management team. ""Active"" (management) investors don\'t need to be accredited because they can see what's going on on a day to day basis. The accredited investor laws apply to the target companies, not to the investors. Basically, a start-up company can\'t take ""other people's money"" from a non-accredited investor. But you can invest ""your own"" money in it if you are a manager.” receives a score of 2.

and “Does me holding stock in the company make me an accredited investor with this company in particular? No. But maybe the site will let you trade it your shares to another accredited investor. Just ask, if the site operators have a securities lawyer they should be able to accommodate" receives a score of 1.

It’s important to note that UMBRELA generates relevance judgments automatically without human effort, and yet has been shown to correlate well with manual (human) judgments. This means that evaluations can happen without human intervention or a “golden answers” set (which is nearly impossible to get in most practical circumstances).

Generation metrics

AutoNugget

These metrics are based on the AutoNuggetizer paper. They score the generated response in a two-step process, which works as follows:

1. Nugget Creation: An LLM judge is asked to create a set of ‘Nuggets’ of information based on retrieved passages. One way to understand nuggets is as ‘atomic facts’ that should appear in your responses. Each nugget gets assigned a ‘vital’ or ‘okay’ rating depending on whether it must absolutely be present in the generated response to the query or whether it’s good to have but not absolutely necessary.

2. Nugget Assignment: In this step, the nuggets from step 1 are treated as a list of desirable facts in the output and an LLM judge then determines whether the nugget appears in the generated response from the RAG system or not. In the AutoNuggetizer terminology, this answers the question of whether the generated response supports the nugget or not.

Using the nugget assignment and the nugget importance measure (‘vital’ or ‘okay’) we can compute a variety of scores described in the paper about how good the generated response is. open-rag-eval implements all the scores, and they differ only in how much weight they assign to nuggets with different labels and nugget-matching criteria.

As an example - for the query from above (“accredited investor definition”) and the passages that we show above in the UMBRELA section, here is a subset of example nuggets.

Note that the nuggets are derived from the retrieved passages based on the query and the support assignment is done based on the generated response from the RAG pipeline.

In our example, the generated response is:

💡 Generated Response

Based on the provided sources, an accredited investor is defined as:
* A bank, insurance company, registered investment company, business development company, or small business investment company,
* An employee benefit plan with assets exceeding $5 million,
* A charitable organization, corporation, or partnership with assets exceeding $5 million,
* A director, executive officer, or general partner of the company selling the securities,
* A business in which all the equity owners are accredited investors,
* A natural person who has individual net worth, or joint net worth with the persons spouse, that exceeds $1 million at the time of the purchase, excluding the value of the primary residence of such person",
* A natural person with income exceeding $200,000 in each of the two most recent years or joint income with a spouse exceeding $300,000 for those years and a reasonable expectation of the same income level in the current year,
* A trust with assets in excess of $5 million, not formed to acquire the securities offered, whose purchases a sophisticated person makes,

Additionally, accredited investors are required to have:
* $1 million in assets (not including primary residence) or $200,000/yr income for the last 3 years, Its worth noting that being an accredited investor does not necessarily mean that one is a member of the management team and holding stock in a company does not automatically make one an accredited investor with that company,

And the nuggets and their scores are:

Nugget: 'Defined in SEC Rule 501 of Regulation D',

Importance: Vital, Support: No

Nugget: 'Net worth over $1 million, exclude primary home',