Retrieval Augmented Generation: Making Generative AI Safe, Trustworthy, and More Relevant

Generative Artificial Intelligence, or Gen AI, is the technology that sits behind many of the amazing systems, such as ChatGPT and Stable Diffusion, that have captured attention in recent months. It creates content based on simple instructions from users. This is revolutionizing the way humans interact with computer systems.

But there are serious drawbacks to Gen AI once you take a deeper look at the data privacy, trust, and cost implications. Vectara addresses these drawbacks with an approach called “Retrieval Augmented Generation,” which we also call “Grounded Generation,” which lets you offer your users the benefits of Gen AI, but in a more trustworthy and cost-effective manner.

What Is Generative AI?

Gen AI systems generate media in response to prompts from users. They use generative models, such as large language models (LLMs), to output data based on the input training data set that was used to create them.

Thanks to the popularity of tools like ChatGPT and Bing Chat, most people are familiar with the text generation flavor of Gen AI, whereby emails, summaries, or simply chatbot responses are generated. While there are other Gen AI tools that create images, source code, audio, or video, in this article we’re going to focus on the usage of Gen AI to create textual answers, summaries, or conversation responses.

Drawbacks of Generative AI

There are three main drawbacks with using generative AI to provide information and answers to your users:

Hallucinations
Data Leakage
High Costs

Hallucinations

While Gen AI is a powerful technique, it frequently creates content that is false, inconsistent, or semantically incorrect. This is called hallucination, and it comes from the underlying model providing a confident response using incorrect, incomplete, or outdated facts.

The effects of this range from humorous to annoying, and could even be dangerous if the user blindly accepts the response as true without looking at it with a critical eye.

Data Leakage

There is also a very real risk of data leakage, because the models powering the Gen AI systems are regularly trained on the prompts that users issue. For example, what if the Gen AI vendor trains the next version of their model on data you provide in a prompt, then later on another user issues a similar prompt? The system will potentially include information learned from your data in the results sent to the other user. Some companies, such as Samsung, have already experienced data leaks.

Many others, including CitiGroup, Bank of America, Deutsche Bank, Goldman Sachs and Wells Fargo have reportedly banned usage of ChatGPT due to this and other risks. Even Italy has banned ChatGPT, at least for the time being.

Costs

Finally, Gen AI systems are relatively costly to use, due to the heavy compute requirements associated with a single generation operation, as well as the need to amortize the extremely high cost of model training across all subsequent model users.

Introducing Retrieval Augmented Generation (RAG), the Solution to Generative AI’s Drawbacks

Grounded Generation, sometimes referred to as Retrieval Augmented Generation, is an approach that avoids these drawbacks associated with generation of textual answers, summaries, etc. It does this by first retrieving the facts from your data (and only your data) that are most relevant to your prompt, and then summarizing only those facts.

Therefore it “grounds” the generated text in facts from your data. This gives your user the best of both worlds – an easy to digest summarized response and also a detailed list of the relevant facts that were retrieved and crafted into that response.

The following diagram shows the key components and data flows in this architecture.

Figure 1. Vectara’s Grounded Generation Architecture

Grounded Generation is different from pure Gen AI systems, such as ChatGPT and Bard, and generative LLMs like PaLM and LLaMA. In these systems the text generation solely uses information learned during the training phase and stored within the underlying model’s “brain”.

Grounded Generation only works if the retrieval of the relevant information is incredibly effective. All the relevant facts should be retrieved, they should be precise, and false positives should be avoided. This must hold true regardless of the specific words used in the source data and the user prompt.

Furthermore, it should hold true regardless of the language used in the source data and the user prompt, as seen below with the cross-language summarization example in Figure 2. This gives the system a robustness that allows it to work well regardless of how the user interacts with it and what the source data looks like.

It is for this reason that Vectara invests so much into its core retrieval capability, which is optimized both for semantic understanding and for the exact, non-semantic matches that keyword search systems have long focused on.

Having top notch information retrieval makes our Grounded Generation incredibly effective.

Figure 2. Example of Vectara’s Cross-Language Answer Summarization

Why Retrieval Augmented Generation Matters

For widespread adoption of Gen AI there must be trust by end users. Grounded Generation is the safest way to introduce such a powerful technique into your organization. Indeed, there will soon be many regulatory and legal requirements related to how Gen AI is used, which will force architectures like Grounded Generation to be adopted.

We’ll see rules like:

Organizational and customer data cannot be used to train a third party’s model
GDPR – if a user’s data has been used to train a model, how can that be communicated to a user and how can that data be proven to have been deleted?
Any data that was generated by an AI system must be identified as such, with provenance information provided for context or so it can be fact checked

How Retrieval Augmented Generation Solves Generative AI’s Drawbacks

Retrieval Augmented Generation solves the drawbacks of Gen AI by:

Reducing risk of data leaks
Reducing hallucinations
Keeping up with the volume of your data
Lowering cost
Keeping up with the speed of your data

Reduce Risk of Data Leaks

Perhaps the most important quality of Grounded Generation is that the underlying generative model does not need to be trained on your data in order for you to do generation using your data. The LLM that Vectara uses for retrieval is a zero-shot model – it is trained only on publicly available data, and never on user data. So there is absolutely no risk of your data leaking out to other users or systems via the model. Providing this guarantee is table stakes in most organizations.

Aside from the data protection aspects, there is a considerable strategic benefit as well. Your data is an asset, and it is valuable. This is why Reddit is now charging money to use their data to train models. By preventing your data from being used to train another organization’s models, you do not let them benefit from your data. Also, you preserve your ownership of that data should you want to monetize it at some point in the future.

Increase Trust by Reducing Hallucinations

The models that power Gen AI systems have vast knowledge about the world embedded in the weights and the connections in the underlying neural networks. But they do not know everything, and they can sometimes get confused or respond with incorrect assumptions, at which point they hallucinate.

Grounded Generation significantly reduces the probability of hallucination because it relies only on the relevant facts that were retrieved from your data based on the user’s query/question/prompt. It is not relying on the long term memory built up solely from the generative model’s training data when it responds. This keeps the system dialed in on your data. View the Avoiding hallucinations in LLM-powered Applications blog post to get more details.

Additionally, we provide transparency into the generated response because citations are provided to explain which facts were used for the different parts of the answer. Users can always fact-check by viewing the source data that is fed into the generated content. Bing Chat takes this approach, seen in Figure 3 below, and so does Vectara, as seen in Figure 2 above.

Figure 3. Example of Citations in Bing Chat’s Search [Image courtesy of Microsoft]

Keep Up with the Volume of your Data

Because pure Gen AI systems rely on the generative model’s knowledge that was established at training time, in order to incorporate current data the system must let users provide additional information as context within a prompt. While this is commonly allowed, only a relatively small amount of contextual data can be supported, even with GPT-4 which supports up to about 50 pages worth of text.

This limits the amount of data that can be used, forces the user or client application to select which data to provide as context, and increases cost because more tokens are processed in the prompt (see below for more details).

The end result is a dramatic reduction in the scope and quality of what you can achieve with pure Gen AI.

Allowing more data to be provided in the prompt helps the system to generate more acceptable responses, but only to a certain extent. What if there is a large amount of data available as context – e.g. many MBs or GBs or even TBs? What if the user or client application is not able to identify exactly which subset of the available data should be provided as context?

Because Grounded Generation places so much emphasis on first retrieving the facts that are most relevant to the prompt, then only summarizing those specific facts, there is essentially no upper limit on the volume of data that the system can work with. In fact, Vectara regularly tests on TB scale raw data sets, maintaining high retrieval metrics (precision, recall, F1, etc) with low latency (p50 < 100ms).

This means that with Grounded Generation, you can use orders of magnitude more data in the application to which you will be applying generative AI capabilities.

Lower Costs

The costs associated with using Gen AI fall into two categories: cost of inference at runtime and cost of training. Grounded Generation helps reduce both types of cost.

Cost of Inference

The amount of compute required at inference time to do generation tends to be very high, due to the size of the generative models. This is exacerbated when a large amount of data is provided in the prompt as context, as mentioned in the previous section.

Because Grounded Generation first whittles down the entire source data set to the very small percentage of relevant facts, the amount of data provided to the generative model for summarization is very small. That means that the unit cost per generation is much lower relative to doing a similar summarization operation using pure Gen AI, where you have to provide a larger amount of data as context in the prompt.

Cost of Training

Models used for Gen AI require a vast amount of data and compute resources to train. That equates to very high training costs. For point of reference, GPT-3 is estimated to have cost $4.6M to train. This cost is ultimately passed along to users when they consume the model.

Thankfully, with Grounded Generation the generative model does not need to be trained on your specific data in order to provide relevant responses. As a result there is no need for the end user to eat their share of the cost of training the model on their data.

Keep Up with the Speed of your Data

Because pure Gen AI relies solely on the knowledge within the underlying model, if new data arrives then the model must be retrained or extended using that data before it can incorporate that data into a response. This is a time consuming process, requiring weeks if not months for the largest models, so there is a lag time between when your new data is available and when it can be used in generation.

Vectara’s Grounded Generation approach does not suffer from this limitation because our retrieval model does not need to be trained on new data in order to be able to effectively find the facts relevant to what you are looking for. We support Instant Indexing so new info is reflected in the results, answers, conversations, etc within just a few seconds.

This property also increases the trustworthiness of Gen AI – in order to have widespread adoption, users must believe that the system has a completely up to date view of the world. Grounded Generation engenders that trust.

An important final point about freshness of data deals with removing data from the system. When data no longer becomes relevant or correct, or if a user opts to have their data deleted (e.g. a GDPR right to erasure request) the system must immediately remove it.

With pure Gen AI this is a very time-consuming process requiring removal of the data in question from the training set then retrain the model.

With Grounded Generation this is trivial – the data to be removed is simply deleted from the search index that is used in the initial retrieval step. Within less than one second the system is compliant.

Conclusion

The art of the possible just took a quantum leap forward with the advent of Gen AI. It will forever change how humans interact with their computer systems and their data. But organizations can only achieve what’s possible if they focus on trustworthiness, cost, and data security. Because Grounded Generation places the all important retrieval step before the generative step, it is able to provide these all-important attributes, letting your end users benefit with the end result.

We are in the midst of incredible times. Not only can you access the massive technological muscle of Gen AI through some of the tools we’ve mentioned, you actually have the opportunity to take the next step and experience the power of Vectara’s Grounded Generation by signing up for our “free forever” Growth plan. This offers generous data limits that reset every month (at no cost). Join us as we build the future of AI responsibly together.