Vectara
Back to blog

Towards a Gold Standard for RAG Evaluation

Vectara's new open-source framework for comprehensive RAG evaluation, open-rag-eval, represents a significant leap forward in ensuring that AI systems deliver accurate, relevant responses.

15-minute read timeTowards a Gold Standard for RAG Evaluation

Introduction

Most organizations implement RAG solutions without a systematic approach to measuring the quality of their responses.

Not using an evaluation framework isn’t just a technical oversight; it's a business risk that can undermine your entire AI strategy. And it’s important.

Your RAG stack includes multiple complex components, and each of these can be configured in multiple ways. There is no clear guide for what configuration settings will result in the highest quality responses to users.

For example:

  • Whether to use fixed token chunking or semantic chunking;
  • Whether to use hybrid or vector search and what value to use for lambda in hybrid search deployments;
  • Which LLM to use and how to optimize RAG prompts;
  • Which threshold to use for hallucination detection and correction, and more.

As the saying goes: “If you cannot measure it, you cannot improve it.” – and if you cannot measure response quality properly, not only are you blind to quality issues that do exist in your implementation, but it is also difficult to know which lever to pull once you do identify a problem.

Enter RAG evaluation.

With a reliable RAG evaluation framework, you can compare various RAG stacks or different configurations of the same stack to see which one works better.

The result is more relevant and accurate responses to the users of your RAG application, leading to higher user satisfaction, increased productivity, reduced hallucinations, and stronger returns on your AI investments.

What is RAG evaluation?

The basic idea behind RAG is to augment the LLM with a powerful information retrieval sub-system that pulls relevant information from a large dataset (or corpus) and presents this information to the LLM for response generation, as shown in Figure 1:

Figure 1: Basic RAG architecture

From a user point of view, the response quality is basically determined by “how accurate and helpful is the response to the question I asked”. Users want the answer to be correct (based on accurate facts) and also helpful.

This boils down to two major types of metrics:

  1. Retrieval metrics: these metrics tell us if the facts retrieved from the source data are relevant to the question and can be successfully used as grounding data for the response.
  2. Generation metrics: these metrics evaluate the response itself, conditioned on the facts - is the response properly grounded in the facts?

When something goes wrong, and response quality is low, inspecting these types of metrics often provides valuable insights and helps you adjust your RAG implementation. For example, if the relevance of facts is low - it might mean that your retrieval pipeline is not good enough, or perhaps you don’t actually have any relevant information in your dataset. If your facts are relevant, but the generated LLM response is hallucinated, perhaps you should use a stronger LLM, update the RAG prompt, or both.

The point is: RAG evaluation can be very actionable, and can help you not only see the issues but also fix them.

Introducing Open-RAG-Eval

Today we are excited to share open-rag-eval - Vectara’s open source RAG evaluation package.

As part of our development effort, we have been thinking a lot about metrics. It’s important to have not only metrics that seem “reasonable”, but also metrics that are robust, validated and practical to implement for large enterprise deployments.

In collaboration with Professor Jimmy Lin's team at the University of Waterloo, with their extensive experience spanning decades in creating world-class benchmarks and datasets for information retrieval evaluation, we designed open-rag-eval around a set of robust retrieval and generation metrics that the group developed with collaborators for the TREC RAG challenge. This evaluation drew participation from around two dozen groups from around the world, who collectively submitted over 100 runs in total. To our knowledge, this is the largest community-wide evaluation of RAG (see results here and here). These metrics have been shown to correlate with human perceptions of quality in rigorous studies and we will continue building on them with open-rag-eval.

We also designed open-rag-eval to be easy to use for the evaluation of any RAG pipeline, whether it’s Vectara’s trusted GenAI platform or a custom RAG solution based on open-source components like LangChain or LlamaIndex.

This is just the first step in a journey - we made open-rag-eval open source and Apache 2.0 licensed for two main reasons:

  • So that the implementation of each metric is well understood and open for review and improvements.
  • To enable additional contributions from the AI community.

Figure 2: open-rag-eval architecture

As shown in Figure 2, open-rag-eval includes two steps.

First, we take a list of queries, and retrieve the context (facts) and response for each of those queries. This is accomplished via the RAG connector which connects to Vectara, or any other RAG pipeline of your choice. Alternatively, if you already have the RAG input/output data, you can just use that directly.

Once the RAG outputs are collected, in the second step open-rag-eval runs on each query/facts/response instance, to generate a “RAG evaluation” result. Each such evaluation uses one or more evaluators, which provide a set of metrics.

Metrics are the core of the evaluation. Each metric has a different focus and is used to evaluate different aspects of the RAG system. Each metric can utilize one or more “models” - each model can be used to judge the quality of the RAG system in a certain aspect. They may be LLMs, classifiers, rule-based systems, etc.

Metrics in Open-RAG-Eval

Okay, so let’s get to the cool stuff - the metrics.

In this first release of open-rag-eval, we have included the following metrics.

Retrieval metrics

UMBRELA

These retrieval metrics are based on the UMBRELA paper. In a nutshell, an LLM is used as a judge to rate each retrieved passage from the corpus in conjunction with the query. For each passage, the LLM is asked to assign a score from 0-3 based on the following criteria:

💡 UMRELA score criteria

Given a query and a passage, you must provide a score on an integer scale of 0 to 3 with the following meanings:
0 = represents that the passage has nothing to do with the query,
1 = represents that the passage seems related to the query but does not answer it,
2 = represents that the passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information and
3 = represents that the passage is dedicated to the query and contains the exact answer.

In technical parlance, these are called relevance judgments, and from them it is possible to compute precision@k, recall@k, nDCG, or whatever other metric to evaluate your response quality.

In our implementation, we simply use the UMBRELA scores directly, and average the scores across all the retrieval results for a query and in the detailed result output we include the score assigned to each sample in case manual inspection is desired. If users so desire they can use UMBRELA in their own pipeline to exhaustively label ALL passages in their dataset for a given query and thus obtain the more traditional retrieval metrics like NDCG, recall@K, precision@k.

As an example, consider the query “accredited investor definition”. Let’s look at some of the passages and the assigned UMBRELA scores:

The passage “The federal securities laws define the term accredited investor in Rule 501 of Regulation D as: a bank, insurance company, registered investment company, business development company, or small business investment company; an employee benefit plan, within the meaning of the Employee Retirement Income Security Act, if a bank, insurance company, or registered investment adviser makes the investment decisions, or if the plan has total assets in excess of $5 million; a charitable organization, corporation, or partnership with assets exceeding $5 million; a director, executive officer, or general partner of the company selling the securities; a business in which all the equity owners are accredited investors; a natural person who has individual net worth, or joint net worth with the person’s spouse, that exceeds $1 million at the time of the purchase, excluding the value of the primary residence of such person; a natural person with income exceeding $200,000 in each of the two most recent years or joint income with a spouse exceeding $300,000 for those years and a reasonable expectation of the same income level in the current year; or a trust with assets in excess of $5 million, not formed to acquire the securities offered, whose purchases a sophisticated person makes. No citizenship/residency requirements'” receives a score of 3 meaning it contains the exact answer.

In contrast the passage “The (U.S) "accredited investor"" laws apply to investments in the U.S. Foreign countries may or may not have their own laws regarding investment in startups, and if so, the foreign laws apply. One way around the net worth minimum is to be a member of the management team. ""Active"" (management) investors don\'t need to be accredited because they can see what's going on on a day to day basis. The accredited investor laws apply to the target companies, not to the investors. Basically, a start-up company can\'t take ""other people's money"" from a non-accredited investor. But you can invest ""your own"" money in it if you are a manager.” receives a score of 2.

and “Does me holding stock in the company make me an accredited investor with this company in particular? No. But maybe the site will let you trade it your shares to another accredited investor. Just ask, if the site operators have a securities lawyer they should be able to accommodate" receives a score of 1.

It’s important to note that UMBRELA generates relevance judgments automatically without human effort, and yet has been shown to correlate well with manual (human) judgments. This means that evaluations can happen without human intervention or a “golden answers” set (which is nearly impossible to get in most practical circumstances).

Generation metrics

AutoNugget

These metrics are based on the AutoNuggetizer paper. They score the generated response in a two-step process, which works as follows:

1. Nugget Creation: An LLM judge is asked to create a set of ‘Nuggets’ of information based on retrieved passages. One way to understand nuggets is as ‘atomic facts’ that should appear in your responses. Each nugget gets assigned a ‘vital’ or ‘okay’ rating depending on whether it must absolutely be present in the generated response to the query or whether it’s good to have but not absolutely necessary.

2. Nugget Assignment: In this step, the nuggets from step 1 are treated as a list of desirable facts in the output and an LLM judge then determines whether the nugget appears in the generated response from the RAG system or not. In the AutoNuggetizer terminology, this answers the question of whether the generated response supports the nugget or not.

Using the nugget assignment and the nugget importance measure (‘vital’ or ‘okay’) we can compute a variety of scores described in the paper about how good the generated response is. open-rag-eval implements all the scores, and they differ only in how much weight they assign to nuggets with different labels and nugget-matching criteria.

As an example - for the query from above (“accredited investor definition”) and the passages that we show above in the UMBRELA section, here is a subset of example nuggets.

Note that the nuggets are derived from the retrieved passages based on the query and the support assignment is done based on the generated response from the RAG pipeline.

In our example, the generated response is:

💡 Generated Response

Based on the provided sources, an accredited investor is defined as:
* A bank, insurance company, registered investment company, business development company, or small business investment company,
* An employee benefit plan with assets exceeding $5 million,
* A charitable organization, corporation, or partnership with assets exceeding $5 million,
* A director, executive officer, or general partner of the company selling the securities,
* A business in which all the equity owners are accredited investors,
* A natural person who has individual net worth, or joint net worth with the persons spouse, that exceeds $1 million at the time of the purchase, excluding the value of the primary residence of such person",
* A natural person with income exceeding $200,000 in each of the two most recent years or joint income with a spouse exceeding $300,000 for those years and a reasonable expectation of the same income level in the current year,
* A trust with assets in excess of $5 million, not formed to acquire the securities offered, whose purchases a sophisticated person makes,

Additionally, accredited investors are required to have:
* $1 million in assets (not including primary residence) or $200,000/yr income for the last 3 years, Its worth noting that being an accredited investor does not necessarily mean that one is a member of the management team and holding stock in a company does not automatically make one an accredited investor with that company,

And the nuggets and their scores are:

Nugget: 'Defined in SEC Rule 501 of Regulation D',

Importance: Vital, Support: No

Nugget: 'Net worth over $1 million, exclude primary home',

Importance: Vital, Support: Yes

Nugget: 'Income $200,000 or $300,000 with spouse',

Importance: Vital, Support: Yes

Nugget: 'Trust with over $5 million assets qualifies',

Importance: Vital, Support: Yes

Nugget: 'Includes banks, insurance, investment companies',

Importance: Vital, Support: Yes

Nugget: 'Businesses with all accredited equity owners',

Importance: Vital, Support: Yes

Nugget: 'SEC regulations indicate high-risk investing',

Importance: Okay, Support: No

Nugget: 'Critical for investments in U.S startups',

Importance: Okay, Support: No

Nugget: 'About 1% of U.S. are accredited investors',

Importance: Okay, Support: No

Citation

The citation metric uses an LLM judge to determine if the citations in the generated RAG response are supported by the passage the generative model says they came from. The generated response from the RAG pipeline is broken down into sentences that have a citation, then each sentence is compared by the LLM judge to the passage that the citation claims it came from. Three levels of support are assigned to each group:

  • Full Support: All of the information in the response sentence is factually consistent and fully supported in the cited passage.
  • Partial Support: Some of the information in the response sentence is factually consistent in the cited passage, but other parts of the information are not supported.

No Support: The cited passage is completely irrelevant and does not support any part of the response sentence.

Hallucination

The hallucination metric uses Vectara’s Hallucination Evaluation Model, a leading model for detecting RAG hallucinations, as described in this blog post.

Note that this metric has some similarities with the Citation metric above. The difference is that the citation metric is computed per cited sentence and only considers if the information in the sentence came from the passage that was cited, whereas the hallucination metric treats the generated summary holistically to determine if it is consistent with the retrieved passages.

You can see complete examples for the Citations and Hallucination metrics in our full walk-through example notebook.

Running Open-RAG-Eval

To use open-rag-eval, you first have to run your RAG pipeline on a set of queries and record both the chunks created and the responses, in a CSV file. If you are using Vectara, we have already created a connector to a Vectara corpus that automates this process and makes it easy.

Once you have this file (let’s call it rag_results.csv), you can run the evaluation. Using the TRECEvaluator class is quite simple:


This evaluation may take a while depending on how many queries you are testing.

Once it’s done, you can look at the scored_results for a detailed query by query view, or create a graph to show them more visually (using TRECEvaluator.plot_metrics).

The full code walk-through is available in this example notebook.

You have now completed a full run of open-rag-eval. To compare multiple configurations, you can generate additional RAG results output files (e.g. rag_results1.csv, rag_results2.csv), each corresponding to a different RAG configuration, and then compare the results to see which configuration provides the best response quality for your RAG application.

RAG evaluation is a journey

The initial use of RAG evaluation frameworks like open-rag-eval is to understand your current state, in terms of retrieval accuracy and quality of generated responses, and that in itself is very important.

But RAG is often a dynamic system where the dataset is regularly updated with new files and the capabilities of your RAG stack continue to improve. This is why using a RAG evaluation system is not only useful at a point in time, to compare RAG system A with RAG system B, it's also impactful on a day-to-day basis.

Imagine for example transitioning from OpenAI’s GPT-4o to GPT-5o, or from Llama3 to Llama4. Do you need to keep all the other configurations as they are or would you change anything to get better responses? With reliable evaluation, you can quickly run a test to understand if a new operating point for your RAG system is valuable.

Open-rag-eval can and should be used regularly, to provide a continuous view of retrieval and generation metrics, and ultimately response quality.

Conclusion: From RAG implementation to RAG excellence

For today's technology leaders, deploying AI systems is a critical initiative, allowing any business to take advantage of the immense power of LLM technology and remain competitive. RAG is at the heart of most enterprise AI initiatives.

The open-rag-eval framework represents a significant advancement in this critical aspect of enterprise AI deployment, offering the scientific rigor and practical utility needed to build high-performing RAG systems that deliver on their promises. By addressing the full spectrum of evaluation needs—from retrieval quality to factual consistency—our framework helps bridge the gap between technical capabilities and business outcomes.

The stakes are high. Organizations that implement robust evaluation methodologies will build RAG systems that drive competitive advantage through superior information access and utilization. Those who neglect evaluation risk building on uncertain foundations, potentially wasting significant investment and eroding trust in their AI initiatives.

Looking ahead, we're exploring several promising directions:

  1. Domain-Specific Evaluation: Tailoring evaluation methodologies to the unique requirements of different industries and use cases.
  2. Multi-Modal RAG Assessment: Extending evaluation capabilities to RAG systems that incorporate content from tables, images, audio, and other non-textual content.
  3. Continuous Learning: Developing evaluation approaches that assess how RAG systems adapt and improve over time
  4. User Experience Correlation: Strengthening the connection between technical metrics and actual user satisfaction
  5. Agentic metrics: what high-quality metrics can we add to evaluate Agentic RAG workflows, that include LLM decision-making, tool use, and more.

Through our ongoing collaboration with the University of Waterloo and other partners in the AI ecosystem, we're committed to pushing the boundaries of what's possible in RAG evaluation, ensuring that our customers always have access to the most advanced tools for optimizing their information access capabilities.

Are you ready to try open-rag-eval?

Give it a try by following the guidance in this blog post or contact our team and we can help.

And if you want to suggest a new metric please submit a PR.

Get Started!Create your Vectara account today!Sign up for Vectara!
Get Started!

Before you go...

Connect with
our Community!