Skip to main content

Retrieval Augmented Generation: Everything You Need to Know

RAG Illustration


Retrieval Augmented Generation, also known as RAG, is a methodology and flow for building Generative AI applications over private or custom datasets that is becoming increasingly common in the enterprise, and is used for a variety of use-cases like chatbots, question-answering, as well as research and analysis.In this deep dive we will cover what RAG is, the mechanics involved in building RAG pipelines and discuss some of the benefits of this approach.

Understanding Retrieval Augmented Generation

LLMs are trained on a vast amount of textual data, and their capabilities are based on the knowledge they acquire from this data. This means that if you ask them a question about data that is not part of their training set, they will not be able to respond accurately, resulting in either a refusal (where the LLM responds with “I don’t know”) or worse, a hallucination.

So, how can you build a GenAI application that would be able to answer questions using a custom or private dataset that is not part of the LLM’s training data?

RAG is one of the best ways to accomplish this task.

The main idea behind RAG is to augment the information of the LLM with additional facts. Whether our data is a set of documents (e.g. PDFs or DOC/PPT files), JSON data, or data extracted from a database or a data lake, the RAG flow allows the LLM to craft responses to user queries that are grounded in facts from this data.

When the RAG flow is built with a retrieval engine that is highly accurate in matching facts to the user query, the ability to augment the LLM with relevant facts becomes a good way to use the LLM to answer questions on your own data.

RAG Flow: A Step by Step Representation

Figure 1 demonstrates the steps involved in building a RAG pipeline:

RAG Flowchart

The blue arrows demonstrate the data-ingestion flow, wherein the data used for Retrieval Augmented Generation is processed and prepared for querying.

This data may originate from various sources such as databases or cloud stores, S3, Google Drive, a local folder, or enterprise applications like Notion, JIRA, or other internal applications.

Since we are talking about text-based GenAI applications, we need to translate any input data into an appropriate text document format. If the source data is in some binary format like PDF, PPT, or DOCX, then we first extract the actual text from these documents. If the data is in a database, the text is derived from one or more columns or documents in the database. This kind of document processing is often application-specific and depends on the format of the source data.

Once the text is ready, it is split into “chunks” (or segments) that are appropriate for retrieval. Using an embedding model (like Vectara’s Boomerang model) a “vector embedding” is computed for each chunk of text, and the text and embedding are both stored, to be used for efficient semantic retrieval later on.

The green arrows demonstrate the query-response flow, whereby a user issues a query and expects a response based on the most relevant information available in the ingested data.

First, we encode the query itself with the (same) embedding model and use the approximate nearest neighbor (ANN) search algorithm to retrieve a ranked list of the most relevant chunks of text available in the vector store.

With the most relevant chunks in hand, we construct a comprehensive prompt for the LLM, including the user question and all the relevant information. That complete prompt is sent to a generative LLM like OpenAI, Cohere, Anthropic, or one of the open-source LLMs like Llama2.

With the user question and the relevant facts, the LLM can now ground its response to the question in the facts provided and thus avoid a hallucination.

Once a response is generated, it can optionally be sent to a “validation” service (like Nvidia’s Nemo Guardrails), and finally back to the user.

The red arrow depicts a final optional step: taking action based on the response via enterprise automation and integration. If the response generated is trusted to be correct, we can use the response to take action on our behalf – for example, send an email to a colleague or add a task to JIRA.

RAG vs Fine-Tuning

When people consider building a GenAI application with their data, in addition to RAG one technique that is often mentioned is fine-tuning. The common question is “can I use fine-tuning with my custom data to adapt the base LLM to work better for my use-case?”

Fine-tuning in machine learning refers to the process of taking a pre-trained model (a model trained on a large, general dataset) and continuing its training on a smaller, specific dataset to adapt its knowledge and optimize its performance for a particular task or domain.

As an analogy, imagine you are learning to play the guitar. At first, you learn basic chords, scales, and maybe a few simple songs. The initial pre-training phase is similar to that – you learn a broad set of guitar-playing skills. Fine-tuning would be akin to learning to play Jazz guitar – while your basic skills are essential you would need to learn jazz techniques, rhythms, and nuances.

Practitioners tend to think that this can help improve their LLM results: I can just take my data, press the “fine-tuning” button and the model quality improves.

If only it was that simple.

Fine-tuning does help with adapting the LLM to perform a whole different task (like learning how to classify a tweet into positive, negative, or neutral sentiment), but it’s not as good in learning new information from your data. This is where RAG is a much better choice.

Let’s look at some of the challenges of fine-tuning and how it compares to RAG:

  1. Overfitting and catastrophic forgetting: when you fine-tune on a specific dataset, there’s a risk that the model will “memorize” the smaller dataset rather than “understand” it. A related risk called catastrophic forgetting is that at the fine-tuning stage the model can forget tasks it previously knew how to solve in favor of new ones.
  2. Hallucinations: one of the key issues of LLMs is hallucinations. When you fine-tune a base model with new data, even if it integrates this new data without overfitting, the issue of hallucinations remains a key challenge for the fine-tuned model
  3. No Explainability: In the same way it’s hard to explain the outputs of a general LLM like LLAMA2, it’s as hard to explain the outputs of a fine-tuned model. With RAG, part of the process includes providing references/citations from the retrieved facts that help explain the output of the RAG pipeline.
  4. Requires MLE expertise: fine-tuning involves continued training of the model with a new (often smaller) dataset. It does require significant expertise in deep learning and transformer models to get right. For example, how do you decide on the number of epochs to train, so that you avoid overfitting?
  5. High Cost: Fine-tuning is expensive and relatively slow. Let’s say your data changes daily – are you going to fine-tune the base model on the new version of the data every day? That might become expensive really fast.
  6. No Access Control: because in RAG the set of relevant facts is retrieved from the source documents and included in real-time in the LLM prompt, it is possible to apply access controls. For example, if one of the facts comes from a document that an employee does not have access to, it can be removed from the set of facts before those are sent to the LLM. This is impossible to do with fine-tuning.
  7. Data Privacy: when you fine-tune an LLM with your data, all the data included in the dataset you use for fine-tuning is integrated into the output model as part of its weights, including any confidential information or intellectual property you own. It’s impossible to separate out the confidential from the non-confidential data – it’s just a single updated set of weights. With RAG, similar to how access control works, you have fine control over what facts are used in the process.

Why Use RAG?

RAG is quickly becoming the predominant methodology for building GenAI-based applications for the enterprise.


There are a number of benefits – let’s look at those in more detail:

  1. It all but eliminates Hallucinations: It eliminates those hallucinations that result from the fact that the core LLM does not have access to your data. By accurately retrieving the most relevant facts from the data, and feeding those to the LLM at run-time, the RAG pipeline ensures that the LLM has the most useful data to answer the question. This works extremely well in practice.
  2. Low cost: RAG does not require any training or fine-tuning, which means there is no high cost associated with it, and it does not require specialized machine learning expertise.
  3. Explainability: LLM responses generated with RAG are highly explainable – Vectara’s RAG implementation provides citations along with the response, so that the reader can understand which facts were used to ground the LLM’s response, and may even go to one of these sources to investigate further.
  4. Enterprise Ready: with RAG, you can implement fine-grained permissioning on the facts retrieved, and design controls to ensure confidential material does not make it into the facts that generate the GenAI response.

As shown in figure 1, implementing RAG on your own (DIY) involves configuring various components and carefully following multiple steps of integration. This can quickly become complex as you scale, when you have to consider not just a simple one-off demo but need to deal with low latency SLAs, enterprise-grade security, data privacy, and other enterprise readiness considerations.

Vectara: RAG as a Managed Service

Vectara implemented RAG (or as we sometimes call it “Grounded Generation”) as a managed service. This “RAG in a box” approach, shown in Figure 2, makes building applications easy and scalable, and significantly reduces the complexity of managing enterprise-ready RAG:

RAG flow with Vectara

With Vectara, as a developer, you can focus on the specifics of your application, namely ingesting the data using the Indexing API, and then building a user experience that queries the data with the Query API.

All the complexity underneath is handled by Vectara for you, helping you be successful from the initial prototype to any enterprise scale.

Vectara’s RAG implementation chooses the best defaults for you in terms of chunking strategy, document pre-processing, embedding model, retrieval and summarization, which results in an accelerated starting point for most projects. This efficiency help developers avoid having to experiment with an almost unlimited amount of combinatorial options in the DIY approach, inherent with pitfalls, lackluster results, painful time-consuming learnings, and other financial or temporal hiccups.

But there is also the option to customize your RAG implementation with Vectara, for example:

  • You can ask Vectara to pre-process and chunk your input data, but it’s also possible to do this on the client side (before indexing) to better control how data is ingested into your corpus
  • During retrieval, you can opt to use Hybrid search (which combines the strengths of Semantic search with keyword search), or apply max-marginal-relevance (MMR) reranking
  • You can control how many matching results and how much surrounding text around each matching text is provided to the LLM for summarization
  • Scale customers can choose whether to use GPT-3.5-turbo or a more powerful LLM like GPT-4
  • Scale customers can customize their prompt to generate responses that fit their needs, like generating the response in the form of an email or as a set of bullet points.

All said, Vectara provides a best-in-class RAG implementation that works well from small demos to large-scale enterprise deployments.


RAG is quickly becoming the standard framework to implement enterprise applications powered by large language models.

Implementing RAG yourself requires a significant level of knowledge and expertise, and a continued investment in DevOps, MLOps, and keeping up to date with all the latest innovation in LLMs and RAG.

Vectara provides “RAG in a box” – an API-first system empowering developers to build enterprise-ready LLM applications. Both at ingest time, and at query time, Vectara’s trusted GenAI platform does all the heavy lifting for you – from data pre-processing, chunking, and embedding to managing the text and vector databases, managing prompts, and calling the LLM to create a response. All done while ensuring enterprise-grade security and data privacy, low latency and high availability of the service, so you don’t have to worry about it.

As RAG continues to evolve, new research introduces novel ways to improve the performance of RAG applications. At Vectara we keep monitoring all those innovations and integrating them into our API-based offering, so you can benefit by accelerating the value your GenAI app delivers in the least amount of time.

Close Menu