The blue arrows demonstrate the data-ingestion flow, wherein the data used for Retrieval Augmented Generation is processed and prepared for querying.
This data may originate from various sources such as databases or cloud stores, S3, Google Drive, a local folder, or enterprise applications like Notion, JIRA, or other internal applications.
Since we are talking about text-based GenAI applications, we need to translate any input data into an appropriate text document format. If the source data is in some binary format like PDF, PPT, or DOCX, then we first extract the actual text from these documents. If the data is in a database, the text is derived from one or more columns or documents in the database. This kind of document processing is often application-specific and depends on the format of the source data.
Once the text is ready, it is split into “chunks” (or segments) that are appropriate for retrieval. Using an embedding model (like Vectara’s Boomerang model) a “vector embedding” is computed for each chunk of text, and the text and embedding are both stored, to be used for efficient semantic retrieval later on.
The green arrows demonstrate the query-response flow, whereby a user issues a query and expects a response based on the most relevant information available in the ingested data.
First, we encode the query itself with the (same) embedding model and use the approximate nearest neighbor (ANN) search algorithm to retrieve a ranked list of the most relevant chunks of text available in the vector store.
With the most relevant chunks in hand, we construct a comprehensive prompt for the LLM, including the user question and all the relevant information. That complete prompt is sent to a generative LLM like OpenAI, Cohere, Anthropic, or one of the open-source LLMs like Llama2.
With the user question and the relevant facts, the LLM can now ground its response to the question in the facts provided and thus avoid a hallucination.
Once a response is generated, it can optionally be sent to a “validation” service (like Nvidia’s Nemo Guardrails), and finally back to the user.
The red arrow depicts a final optional step: taking action based on the response via enterprise automation and integration. If the response generated is trusted to be correct, we can use the response to take action on our behalf – for example, send an email to a colleague or add a task to JIRA.
RAG vs Fine-Tuning
When people consider building a GenAI application with their data, in addition to RAG one technique that is often mentioned is fine-tuning. The common question is “can I use fine-tuning with my custom data to adapt the base LLM to work better for my use-case?”
Fine-tuning in machine learning refers to the process of taking a pre-trained model (a model trained on a large, general dataset) and continuing its training on a smaller, specific dataset to adapt its knowledge and optimize its performance for a particular task or domain.
As an analogy, imagine you are learning to play the guitar. At first, you learn basic chords, scales, and maybe a few simple songs. The initial pre-training phase is similar to that – you learn a broad set of guitar-playing skills. Fine-tuning would be akin to learning to play Jazz guitar – while your basic skills are essential you would need to learn jazz techniques, rhythms, and nuances.
Practitioners tend to think that this can help improve their LLM results: I can just take my data, press the “fine-tuning” button and the model quality improves.
If only it was that simple.
Fine-tuning does help with adapting the LLM to perform a whole different task (like learning how to classify a tweet into positive, negative, or neutral sentiment), but it’s not as good in learning new information from your data. This is where RAG is a much better choice.
Let’s look at some of the challenges of fine-tuning and how it compares to RAG:
- Overfitting and catastrophic forgetting: when you fine-tune on a specific dataset, there’s a risk that the model will “memorize” the smaller dataset rather than “understand” it. A related risk called catastrophic forgetting is that at the fine-tuning stage the model can forget tasks it previously knew how to solve in favor of new ones.
- Hallucinations: one of the key issues of LLMs is hallucinations. When you fine-tune a base model with new data, even if it integrates this new data without overfitting, the issue of hallucinations remains a key challenge for the fine-tuned model
- No Explainability: In the same way it’s hard to explain the outputs of a general LLM like LLAMA2, it’s as hard to explain the outputs of a fine-tuned model. With RAG, part of the process includes providing references/citations from the retrieved facts that help explain the output of the RAG pipeline.
- Requires MLE expertise: fine-tuning involves continued training of the model with a new (often smaller) dataset. It does require significant expertise in deep learning and transformer models to get right. For example, how do you decide on the number of epochs to train, so that you avoid overfitting?
- High Cost: Fine-tuning is expensive and relatively slow. Let’s say your data changes daily – are you going to fine-tune the base model on the new version of the data every day? That might become expensive really fast.
- No Access Control: because in RAG the set of relevant facts is retrieved from the source documents and included in real-time in the LLM prompt, it is possible to apply access controls. For example, if one of the facts comes from a document that an employee does not have access to, it can be removed from the set of facts before those are sent to the LLM. This is impossible to do with fine-tuning.
- Data Privacy: when you fine-tune an LLM with your data, all the data included in the dataset you use for fine-tuning is integrated into the output model as part of its weights, including any confidential information or intellectual property you own. It’s impossible to separate out the confidential from the non-confidential data – it’s just a single updated set of weights. With RAG, similar to how access control works, you have fine control over what facts are used in the process.
Why Use RAG?
RAG is quickly becoming the predominant methodology for building GenAI-based applications for the enterprise.
There are a number of benefits – let’s look at those in more detail:
- It all but eliminates Hallucinations: It eliminates those hallucinations that result from the fact that the core LLM does not have access to your data. By accurately retrieving the most relevant facts from the data, and feeding those to the LLM at run-time, the RAG pipeline ensures that the LLM has the most useful data to answer the question. This works extremely well in practice.
- Low cost: RAG does not require any training or fine-tuning, which means there is no high cost associated with it, and it does not require specialized machine learning expertise.
- Explainability: LLM responses generated with RAG are highly explainable – Vectara’s RAG implementation provides citations along with the response, so that the reader can understand which facts were used to ground the LLM’s response, and may even go to one of these sources to investigate further.
- Enterprise Ready: with RAG, you can implement fine-grained permissioning on the facts retrieved, and design controls to ensure confidential material does not make it into the facts that generate the GenAI response.
As shown in figure 1, implementing RAG on your own (DIY) involves configuring various components and carefully following multiple steps of integration. This can quickly become complex as you scale, when you have to consider not just a simple one-off demo but need to deal with low latency SLAs, enterprise-grade security, data privacy, and other enterprise readiness considerations.
Vectara: RAG as a Managed Service
Vectara implemented RAG (or as we sometimes call it “Grounded Generation”) as a managed service. This “RAG in a box” approach, shown in Figure 2, makes building applications easy and scalable, and significantly reduces the complexity of managing enterprise-ready RAG: