RAG: Everything you need to know

Vectara’s mission is to drive for truth and relevance in generative AI.

Introduction

Retrieval Augmented Generation (RAG) is a method for building Generative AI applications on private datasets, increasingly popular in enterprises for use in building AI assistants and AI agents. This deep dive covers RAG, its pipeline mechanics, and key benefits.

Understanding Retrieval Augmented Generation

LLMs are trained on a vast amount of textual data, and their capabilities are based on the knowledge they acquire from this data. This means that if you ask them a question about data that is not part of their training set, they will not be able to respond accurately, resulting in either a refusal (where the LLM responds with “I don’t know”) or worse, a hallucination.

So, how can you build an AI assistant or agent that would be able to answer questions using a custom or private dataset that is not part of the LLM’s training data? RAG is one of the best ways to accomplish this task.

Benefits of RAG

When RAG employs an accurate retrieval mechanism to match facts with queries, it effectively enhances the LLM with relevant information, leading to precise answers to user queries.

The core idea of RAG

RAG enhances LLMs with additional facts, allowing them to generate responses based on data from documents, JSON, or databases.

RAG flow: a step by step representation

Figure 1 demonstrates the steps involved in building a RAG pipeline:

The blue arrows show the data-ingestion flow, where data for Retrieval Augmented Generation is processed for querying. This data can come from sources like databases, cloud stores (S3, Google Drive), local folders, or enterprise apps like Notion or JIRA.

For text-based GenAI apps, input data must be converted into text. If the data is in a binary format (PDF, PPT, DOCX), we extract the text. From databases, text comes from columns or documents. Once processed, the text is split into chunks. Each chunk is embedded using a model like Vectara’s Boomerang, and both the text and embedding are stored for retrieval.

The black arrows depict the query-response flow, where a user submits a query expecting a response. The query is encoded, and an approximate nearest neighbor (ANN) search retrieves relevant chunks. These chunks are combined with the user’s question to create a prompt for a generative LLM (OpenAI, Cohere, etc.), grounding the response in facts to avoid hallucination.

The green arrow shows an optional step: taking action based on the response, like sending an email or adding a task to JIRA.

RAG vs Fine-Tuning

When building AI assistants and agents with custom data, people often ask if fine-tuning can improve the base LLM for their use case. Fine-tuning involves taking a pre-trained model and continuing its training on a smaller, specific dataset to optimize it for a particular task or domain.

It’s like learning guitar basics first (pre-training), then focusing on jazz techniques (fine-tuning). Many assume fine-tuning will improve LLM results by simply applying their data. But it’s not that simple.

Fine-tuning helps adapt an LLM to a different task, but it’s less effective for learning new information. RAG is a better approach in such cases.

Let’s look at some of the challenges of fine-tuning and how it compares to RAG:

Overfitting and forgetting

Fine-tuning can lead to the model “memorizing” the dataset rather than understanding it, and forgetting previously learned tasks.

Hallucinations

One of the key issues of LLMs is hallucinations. Fine-tuning with new data doesn’t eliminate the challenge of hallucinations in the model.

No explainability

It’s hard to explain the outputs of a fine-tuned model. With RAG, part of the process includes providing references from the retrieved facts that help explain the output.

Why use RAG?

RAG is rapidly becoming the leading method for developing GenAI applications in enterprises. Why? Let's explore its key benefits in detail.

No hallucinations

It prevents hallucinations by providing the LLM with relevant facts in real-time, ensuring accurate responses based on your data.

Low cost

RAG needs no training or fine-tuning, avoiding high costs and specialized ML expertise.

Explainability

RAG-generated LLM responses are explainable, with Vectara providing citations to show the sources used for the answers.

Enterprise ready

With RAG, you can set fine-grained permissions to ensure confidential material is excluded from GenAI responses.

Implementing RAG do-it-yourself involves configuring multiple components and steps, which can get complex with scaling due to low latency SLAs, enterprise security, and data privacy needs.

Vectara: RAG as a managed service

Vectara implemented RAG as a managed service. This “RAG in a box” approach, shown in Figure 2, makes building applications easy and scalable, and significantly reduces the complexity of managing enterprise-ready RAG:

With Vectara, developers can focus on application specifics like data ingestion and user experience while Vectara handles the underlying complexity. Vectara’s RAG implementation optimizes chunking, pre-processing, embedding, retrieval, and summarization, offering a streamlined start and avoiding the pitfalls of do-it-yourself experimentation.

But there is also the option to customize your RAG implementation with Vectara, for example:

Data pre-processing and chunking options: You can ask Vectara to pre-process and chunk your input data, or handle it on the client side to better control how data is ingested into your corpus.
Retrieval methods: During retrieval, you can choose between hybrid search (combining Semantic and keyword search) or apply max-marginal-relevance (MMR) reranking.
Result and context control: Control the number of matching results and the amount of surrounding text provided to the LLM for summarization.
LLM selection: Customers can choose between GPT-3.5-turbo or a more powerful LLM like GPT-4.
Customizable prompts: Customers can customize prompts to generate responses in preferred formats, such as emails or bullet points.

All said, Vectara provides a best-in-class RAG implementation that works well from small demos to large-scale enterprise deployments.

Conclusion

RAG is becoming the standard for enterprise LLM applications. Implementing it yourself requires deep expertise and significant investment in DevOps and MLOps.

Vectara provides "RAG in a box," simplifying the process with an API-first system. It manages data processing, chunking, embedding, and LLM interactions while ensuring security, privacy, and performance.

Vectara also integrates the latest RAG innovations, helping you quickly boost your GenAI application's value.