Retrieval Augmented Generation (RAG) Done Right: Retrieval
In RAG, retrieving the right facts from your data is crucial, and choosing the right embedding model to power your retrieval matters!
6-minute read timeIn part 1 of “RAG Done Right,” we covered the importance of text chunking and how this apparently simple operation during data ingestion might have significant implications on the performance of your RAG pipeline.
We learned that text chunks need to be small enough so as to have focused semantic meaning and not too much noise (often one or more complete sentences), and can be augmented with sentences before or after during LLM generation to accomplish an optimal setting.
In this, we assumed that retrieving the best-matching text chunk at query time is a black box that just works. As it turns out, the embedding model that is used in neural (vector) search during the retrieval step can have a significant impact on overall RAG performance, and not all embedding models are created equal.
In this blog post, we introduce Vectara’s new embedding model Boomerang and demonstrate the differences when using other embedding models such as the ones provided by OpenAI or Cohere.
Let’s dig in.
What is an Embedding model?
An embedding model is a type of model that converts words, phrases, or even entire sentences into fixed-size vectors of numbers (floats). These vectors capture the semantic meaning of the words or phrases, making it easier for machine learning algorithms to understand and process natural language.
In the context of RAG, we often focus on the use of embedding models to power semantic search as part of the “retrieval” in retrieval-augmented-generation.
Specifically, embedding models are used as follows:
- During data ingestion, every chunk of text (see chunking) is converted into a vector and stored in a vector store alongside the text itself
- At query time, the user query (or prompt) is itself converted into a vector embedding and we use a form of similarity search to match similar vectors that have similar semantic meaning
There are two commercially available embedding models (OpenAI’s Ada2 and Cohere’s co:embed) and a few publicly available embedding models (SBERT, USE-QA, mContriever, etc).
At Vectara, we’ve been hard at work to improve the performance of our embedding model, and have recently announced our new embedding model: Boomerang.
Our retrieval benchmarks demonstrate significant gains in Boomerang over multiple benchmarks, including English and non-English datasets.
Retrieval benchmarks are important, but how does a good retrieval engine impact overall end-to-end results in RAG? Let’s explore this with some example implementations.
Embedding model benchmarks
To compare some end-to-end RAG implementations, we’ll construct a question-answering pipeline using the contents of the LLAMA2 paper, which has 77 pages.
We consider the following queries:
- What learning rate was used for pre-training?
- Was RLHF used?
- Which models are released for commercial use?
- Was red teaming used?
We start with an implementation of RAG using LLamaIndex and the SentenceWindowNodeParser, which, together with MetadataReplacementPostProcessor, provides a great RAG implementation for LlamaIndex users. For embedding we will use both the OpenAI and Cohere models.
First, we extract the text from the PDF file using the unstructured.io PDF extractor (during early experiments we noticed that PyPDF did rather poorly in extracting text from the PDF):
Then, we create a VectorStoreIndex using the SentenceWindowNodeParser node parser with a window size of 3 (in the example below using the OpenAIEmbeddings):
Now that the index is constructed, we can create a query engine while using the MetadataReplacementPostProcessor post processor to ensure surrounding information is provided to the LLM for generation:
Next, we construct a similar RAG pipeline with Vectara. Here we simply generate a VectaraIndex and provide the same extracted document text as input. The text is indexed into Vectara and embedded using Boomerang.
Then we can directly query the Vectara index:
We run each of our queries in 3 languages: English, Hebrew, and Turkish, and the results are shown in tables 1, 2, and 3, respectively. See the full code in this notebook for additional languages.
Table 1: Results in English
Table 2: Results in Hebrew
Table 3: results in Turkish
As we can see – in English, even though some of the responses differ slightly between Cohere, OpenAI or Vectara, the responses to each of the 4 questions are reasonably good across the board with OpenAI, Cohere and Vectara. The only exception is the response to “What learning rate was used for pre-training?” with OpenAI which is not as good as with Cohere or Vectara. This reflects the fact that the embedding models work pretty well in English, and retrieve relevant information from the paper.
For Hebrew and Turkish – things are different.
Let’s look first at the query “Which models are released for commercial use?”. In Hebrew, for both the OpenAI and Cohere embedding model, the RAG pipeline is not able to answer this question, whereas Vectara’s Boomerang model does pretty well, with an accurate response from the RAG pipeline “Llama 2, Llama 2-Chat, and their variants with 7B, 13B, and 70B parameters are released for commercial use”.
When we look at the retrieved text we see why this is happening (see notebook for details).
Forth both OpenAI and Cohere the matching text chunks are:
“Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, and Omer Levy. “
“Kilem L. Gwet.”
“Yarden Tal, Inbal Magar, and Roy Schwartz.”
“Noam Shazeer. “
“Noam Shazeer.”
Clearly, none of these chunks are relevant for the question posed. In contrast, for Vectara, the first most relevant chunk is:
“We are releasing the following models to the general public for research and commercial use.”
…which is clearly relevant.
As another example, let’s look at the question “Was red teaming used?” in Turkish.
Here the RAG setup with Cohere fails with
“There is no information in the given context about whether the red team was used or not.”
In this case, both OpenAI and Vectara successfully pull the relevant information to then respond correctly to the question.
Conclusions
Creating a robust RAG pipeline that provides good responses, in multiple languages, is often more complicated than it initially appears, and requires a good chunking strategy, a state-of-the-art embeddings model, and a good implementation of the various steps involved in putting it all together.
In this blog post we witnessed how Vectara’s Boomerang model, integrated into Vectara’s “RAG as a service” architecture, helps our users build effective GenAI applications. When compared to OpenAI and Cohere’s embedding models Boomerang is on par with these other models in English but seems to outperform those models in some examples when using other languages like Hebrew or Turkish.
To try Boomerang with Vectara:
- Sign up for a free account if you don’t have one already
- Follow the quickstart guide to create a corpus and API key. Boomerang is enabled by default for new corpora.
- Ingest your data into the corpus using Vectara’s Indexing API or use the open source vectara-ingest project.
If you need help, check out our forums and Discord server.
The full code for this blog is available here.