Evaluating RAG with RAGAs

Introduction

One of the emerging challenges in building enterprise-scale RAG solutions like chatbots and question-answering applications is the ability to measure the quality of the solution in terms of response quality and choose the optimal settings (for example, whether or not to use hybrid search or max marginal relevance, the LLM to use and the best generative prompt) for optimal performance over your data.

This is important for the initial MVP, and becomes increasingly critical as you evolve your application, and increase the amount, type, and diversity of the data supporting your RAG pipeline.

What does “optimal performance” mean in the context of RAG?

Well, it’s really about the quality of the responses that users see to their questions. Is the response correct given the data it is grounded on? Does it hallucinate or does it ultimately provide the right answer to the end-user?

Enter RAG quality metrics: a set of metrics commonly used to measure the quality of a RAG pipeline.

Once measurement is in place, you can not only understand the performance of your RAG pipeline, but also compare and optimize its performance over your data, and pick the best configuration that provides optimal quality to your end users.

RAGAs: an Open Source Framework for RAG Metrics

In this blog post, we will show how to use RAGAs, an open-source RAG evaluation tool, to measure the performance of Vectara’s RAG-as-a-service and optimize the quality over a specific dataset.

RAGAs provides two main areas of functionality: RAG metrics evaluation and synthetic data generation for evaluation.

RAG Metrics

To explain how RAG metrics work, we first need to define some terminology. In a RAG pipeline we have 4 components:

Question: the query from the user
Context: an array of relevant text chunks (or facts) returned by the retrieval engine, and presented to the LLM in order to generate its response.
Answer: the generated response at the output of the RAG pipeline.
Ground truth answer: a generated response we consider to be the “correct response” that we hope the RAG pipeline will generate.

There are nine types of RAG metrics included in RAGAs. We won’t cover all of them, and instead mention the key ones that we would use in this blog post:

Faithfulness measures the factual consistency of the generated response against the retrieved context: if all the claims that are made in the answer can be inferred from the given context then the response is considered “faithful” to the provided context.
Answer similarity measures the semantic resemblance between the generated answer and the ground truth answer. This is done via cosine similarity between the embedding vectors of the ground truth answer and the generated answer.
Answer relevancy measures how pertinent the generated question is to the given question. This is computed by generating a number of artificial questions based on the answer and measuring the similarity between the original question and those artificial questions.
Answer correctness measures how accurate the generated answer is relative to a “golden” answer that is deemed to be the correct answer. It is based on a weighted sum of factual consistency and the semantic similarity between the ground-truth answer and the generated response.

Synthetic Data Generation

One of the challenges in RAG evaluation is that you need, at a minimum, a golden set of questions and “curated answers” to run the evaluation against.

As the RAG developer, you can spend some time coming up with these question/answer pairs, but it is a tedious task and one that often requires a lot of knowledge of the dataset.

Synthetic data generation is a relatively new capability in RAGAs, but it’s quite useful exactly for this reason: given a dataset of documents, it generates a set of artificial questions and “curated answers” that can be used for RAG evaluation. Inspired by this paper, RAGAs can generate pretty diverse and comprehensive question/answer pairs that can be used as-is or adapted by a human reviewer.

RAGAs Faithfulness and Vectara’s HHEM

The idea behind the “faithfulness” score is a key capability for LLMs in the context of RAG: can the LLM, given the right facts, respond to the user query while being “faithful” to these facts?

That is exactly the idea behind Vectara’s recently released HHEM (Hughes Hallucination Evaluation Model) which has become super popular for evaluating LLMs for their tendency to hallucinate. As this leaderboard demonstrates – not all LLMs are equal and some hallucinate more than others.

The RAGAs faithfulness is computed using an LLM-as-a-judge approach, whereas HHEM is a classification model, making it more reliable, robust, and an overall better way to judge hallucinations and compute the factual consistency score.

HHEM is available as an open source model and has recently also been integrated into Vectara’s RAG-as-a-service platform. You can now get the FCS (factual consistency score; also known as HHEM v2) as part of our generative summary Query API call.

Running RAGAs with Vectara RAG

As a Vectara developer, it’s super easy to use RAGAs to evaluate the performance of your RAG application. Even though there are many metrics in RAGAs, here we will focus on four key metrics: faithfulness, answer similarity, answer relevance, and answer correctness.

For this example, we have downloaded all the pages in the Vectara documentation website as HTML pages and indexed them into our Vectara corpus. The first step of course is to create a set of question/answer pairs based on this dataset – we use RAGAs synthetic data generation capability.

We load the documents using LangChain’s DirectoryLoader() and then use the TestsetGenerator class to generate 50 question-and-answer pairs.

let’s look at some of the results:

The first question is: “What are some examples of use cases for the Vectara platform?” and the generated “ground truth” answer is: “The Vectara platform has a unique ability to understand and process information, using hybrid search to find the most relevant products, support cases, and documents that answer user’s questions first. It can power chatbots, Q&A systems, conversational applications, and websites based on relevant information. Vectara also provides result recommendations and enables global collaboration through cross-language search.”

The 2nd question is: “How can developers customize prompts with metadata using Vectara’s Custom Retrieval Augmented Generation (RAG) Prompt Engine?” with the response: “Vectara empowers developers with a flexible way of customizing prompts with metadata through the Custom Retrieval Augmented Generation (RAG) Prompt Engine. Developers can use available prompt variables and functions to customize prompts based on their needs.”

This is pretty good. The questions seem very relevant to the content, are interesting, and responses are correct and informative.

Of course, it’s important to review the question/answer pairs to ensure the synthetic generation does not result in duplicate questions. You can also add question/answer pairs of your own, which is often a great way to test additional questions that RAGAs did not come up with.

Now let’s run a RAGAs evaluation using these question/answer pairs. For this evaluation, we will only focus on question/answer pairs, and ignore metrics related to context. Let’s keep the question and ground_truth answer and turn this into a Pandas data frame:

We will use the VectaraClient helper class from vectaraClient.py to call the Vectara API.

First, let’s define credentials:

Now we define our main RAG evaluation function.

Note that we use a regular expression

regex_pattern = '\[\d+(,\s*\d+)*\]'

To replace and remove citations from the Vectara response.

The eval_rag function takes in a data frame and applies the get_response function to call Vectara, get the response and context, and store them for evaluation. Then Ragas evaluate()function computes the metrics we want, in this case: faithfulness, answer_relevancy, answer_similarity and answer_correctness.

We ran this code on the Vectara docs corpus mentioned above. An initial naive run resulted as follows:

With the results:

{‘faithfulness’: 0.9772, ‘answer_relevancy’: 0.9427, ‘answer_similarity’: 0.9422, ‘answer_correctness’: 0.5045}

As you can see here we chose to enable MMR, disable hybrid search (lambda=0) and used the basic Vectara prompt with GPT-3.5.

If we enable hybrid search and use a lambda value of 0.025, we get:

{‘faithfulness’: 0.9493, ‘answer_relevancy’: 0.9469, ‘answer_similarity’: 0.9487, ‘answer_correctness’: 0.5640}

This is where optimization can happen. Re-running the same evaluation with other values to control retrieval and generation, including an “advanced“ prompt that replaces GPT-3.5 with GPT-4-Turbo (available to Vectara Scale customers), we can see how the metrics change based on the settings:

Run	Lambda	MMR	Prompt	Faithfulness	Answer relevancy	Answer Similarity	Answer Correctness
1	0	True	Basic	0.9772	0.9427	0.9422	0.5045
2	0	False	Basic	0.9580	0.9395	0.9436	0.5135
3	0.025	False	Basic	0.9493	0.9469	0.9487	0.5640
4	0.025	False	Advanced	0.9744	0.9200	0.9550	0.5941

Table 1: Faithfulness, Answer relevancy, answer similarity, and answer correctness based on the choice of retrieval and generation parameters in Vectara.

As we can see, changing retrieval and generation parameters in Vectara can improve our answer correctness from 0.5045 to 0.5941 – a pretty significant improvement.

Conclusions

Vectara’s serverless RAG-as-a-service provides application developers with an easy-to-use platform for building RAG applications.

In this blog post, we’ve seen how to use RAGAs, an open-source RAG metrics framework, to evaluate your Vectara RAG pipeline and measure its performance using metrics like faithfulness, answer correctness, and others. We have made the full code used in this blog available in this notebook.

To get started with Vectara you cansign up for a free Vectara account and inspect our documentation or look at some example demos. If you need help you can find us in the Vectara discussion forum or on Discord.