Building a RAG Pipeline is Difficult

Overview

The best RAG systems utilize many different types of models (embedding model, generative LLM) to achieve the best, and highest quality results.

When you build a small RAG POC, the complexity involved in building, maintaining, and tuning your RAG pipeline is not always obvious. It is not surprising, therefore, that after the initial POC and demonstration of value from RAG, it becomes obvious to the technical team responsible for this that building a scalable and secure RAG pipeline is not only harder than it looks but also requires continued expertise in LLMs, retrieval, specialized MLOps, and much more.

In this blog post, we dive into some of the challenges one often faces when creating RAG pipelines.

The RAG Pipeline

As shown in Figure 1, there are two major flows in any RAG pipeline:

Ingest flow: this flow (in blue) describes the steps involved in ingesting your data into the RAG systems, including text extraction, chunking, encoding, and storage of your text and vector data.
Query flow: this flow (in green) describes the steps involved in responding to a user query, including encoding, retrieval, reranking, calling the generative LLM, as well as hallucination detection.

Ingest Flow

RAG works primarily on text data: it could be from files in S3 or Box, documents on Google Drive, text stored in a database like Snowflake or Redshift, or data inside SaaS applications like Sharepoint, Notion, JIRA, or Confluence, or even from text in web pages.

If the input is in the form of files (e.g. PDF or PPT), we would first want to extract the relevant text, along with metadata from those files. Otherwise, the input comes already as text.

The next step is called chunking, where text is broken down into reasonably sized and semantically coherent chunks. Those chunks are then stored in a text database, and at the same time encoded (using an embedding model like Boomerang) as dense vectors and stored in the vector database.

All of these steps may seem easy, but in practice, there are many engineering challenges:

Extracting text from arbitrary documents is not a completely “solved problem” since every file format having its own idiosyncrasies. This complexity is sometimes compounded when you need to handle non-English languages (at Vectara we support more than 100 languages). There are many open-source document parsers, but they are far from perfect and are often hard to deploy at scale.
The ingestion data pipeline requires coordination between many different types of databases to achieve the best quality and low latency responses.
Implementing advanced parsing techniques for dealing with tables inside a document, and making sure the information in the table is available at query time, involves careful design due to the dependencies between machine learning components and real time serving systems.

Query Flow

When a user issues a query to your RAG pipeline, the pipeline first processes the query to retrieve the highest quality results (or facts), and then those facts are sent to the generative LLM (like GPT-4, Anthropic or Llama 3.2), along with an appropriate prompt, to generate the final response grounded in those facts.

In the simplest form, the first step is to encode the query using the embedding model so it turns into a form of a vector and can be used to retrieve the most relevant facts from the vector database. In some use cases, however, it’s useful to first send the query to an LLM for rephrasing or other types of query pre-processing, and then proceed with the encoding. Additionally, it can be useful to do some light text processing for special characters to get the best results.

With the vector representation of the query in hand, you can just run a pure vector (aka semantic) search operation (essentially using similarity search between the vector-encoding of the query and that of all chunks ingested), and get a list of the top N most relevant search results. However, in many cases a much more refined search mechanism is required for the best quality and relevance, which may include capabilities like:

Hybrid search: the ability to combine semantic search with keyword search.
Contextual Augmentation: augmenting matched chunks with contextual information such as sentences before and after the chunk.
Reranking: applying one or more re-rankers such as Vectara’s cross-lingual relevance reranker, MMR, or UDF re-ranking. And in many cases, you want to chain together multiple re-rankers.
Filtering: using metadata filtering for fine-grained control of search results.

The point here is that building an accurate retrieval engine in your RAG pipeline is often more complex than it initially might seem. You need to not only integrate semantic search with keyword search, and support various reranking strategies and multiple languages, but you also need to do this with very low latency (under 100ms is common), eliminate potential choke points, and ensure this all works well as you scale the number of documents to fit the needs of your organization.

When the retrieval step is done, the query flow has two more steps. First, you craft a RAG prompt and call an LLM. This might be GPT-4o, Anthropic, Llama-3.2, or Vectara’s Mockingbird. Then, once the generative response is provided back, you run it through a validation step (which typically involves calling a hallucination detection model like HHEM, PII detection, and bias reduction) to ensure low hallucination rates and a final response that complies with organizational policies.

Smaller models in RAG

Since the beginning of 2024, we’ve seen increasing research work demonstrating that specialized smaller models can be created to achieve superior performance. This is best exemplified by OpenAI’s release of Gpt-4o mini and Gemini-1.5 Flash (which are most likely smaller in size than their full-scale counterparts) as well as open source ones like Phi-3.5 or Llama-3.2 3B or 11B.

At Vectara we recently launched Mockingbird, which is a small and fast model specialized in RAG. A state-of-the-art RAG pipeline can take a few different specialized (smaller) models, and combine them with highly tuned data orchestration, to retrieve the best context as input to the LLM for generating its response.

These smaller models and algorithms compare in performance to LLMs for their specialized tasks, but are much faster and allow the LLM to spend its compute on the critical context.

Summary

In this blog post, we’ve shown what’s under the hood in a RAG pipeline, and highlighted a common initial expectation that RAG is relatively easy to build. This cannot be further from the truth.

Enterprise use of RAG elevates the importance of these aspects even further, whereby all of these steps result in a RAG stack that requires careful design, constant tuning, and frequent updates and maintenance.

Like many other types of applications, building a RAG system is a journey, not a single step in time. It requires continuous investment in systems engineering, a deep understanding of the complex and constantly changing landscape of embedding models, generative LLMs, search and retrieval, as well as continuous updates and improvements.

Vectara provides an end-to-end RAG platform, where all this complexity is abstracted behind an easy-to-use API. We continuously manage this complexity so you don’t have to.

To build your own RAG application with Vectara, simply sign up for a free Vectara account, upload your data, and get started in minutes. If you need help you can find us in the Vectara discussion forum or on Discord.