A Reference Architecture for Grounded Generation

Introduction

Grounded Generation (GG) powers a category of GenAI applications that is becoming increasingly common, with use-cases like question-answering and chatbots. Also known as retrieval-augmented-generation (RAG), these types of applications derive answers to user queries by incorporating the strength of pre-trained large language models (LLMs) with a strong retrieval engine that picks the most relevant contextual text to answer that specific question.

Grounded Generation all but eliminates hallucinations, and allows developers to build GenAI applications that are based not only on publicly-available data, but also data that is internal to the company and custom to the use-case.

With the recent explosion in GG applications, an obvious question to ask is: “what components do I need in this emerging stack, and how do I use them effectively?”

In this blog post we share our reference architecture for “Grounded Generation”, and describe how GenAI-platforms like Vectara reduce the complexity of this architecture, allowing developers to focus on creating their business application using LLMs.

Grounded Generation Reference Architecture

Ok, let’s jump right into it.

Figure 1 shows the reference architecture for Grounded Generation, highlighting two distinct flows: the data-ingestion flow and the query-response flow.

Figure 1: Grounded Generation Reference Architecture

The arrows demonstrate the data-ingestion flow, wherein the data used for Grounded Generation is processed and prepared for querying.

This data may originates from:

Cloud data stores such as AWS Redshift, Google BigQuery, Microsoft CosmoDB, DataBricks, Snowflake, MongoDB, DataStax, couchDB and many others.
Files stored in S3, DropBox, Google Drive, Microsoft OneDrive or simply on local storage
Data stored in SaaS enterprise applications like Notion, Coda, Salesforce, Workday, Hubspot, Asana, Monday, JIRA, Confluence, Docusaurus, or internal wikis.

Since GenAI applications deal with text, we need to translate any input data into an appropriate text document format. If the source data is in some binary format like PDF, PPT or DOCX, then we first extract the actual text from these documents. If the data is in a database, the text is derived from one or more columns or documents in the database. This kind of document processing is often application specific and depends on the format of the source data.

Once a text document is ready, we need to “chunk” the text into reasonably-sized “chunks” (or segments) that are appropriate for retrieval. This can be just chunks of a certain size (e.g. 2000 characters) or actual sentences in more advanced implementations.

Using an embedding model, we then compute a “vector embedding” representation for each chunk of text, and store both the vector and text in a vector store, which allows for efficient semantic retrieval later on.

This process of data-ingestion is often performed once when we deploy the application and all data is processed for the first time. After that – incremental updates provide an efficient mechanism to update the stored vector embeddings, or add new ones.

green

The arrows demonstrate the query-response flow, whereby a user issues a query and expects a response based on the most relevant information available in the ingested data.

First, we encode the query itself with the (same) embedding model, and use the approximate nearest neighbor (ANN) search algorithm to retrieve a ranked list of the most relevant chunks of text available in the vector store.

It is important to get this step right: if your implementation is not able to retrieve the most relevant information, then all downstream operations will suffer in quality, since the Generative LLM won’t have the right facts to base its response on. In other words: garbage-in equals garbage-out.

With the most relevant chunks in hand, we construct a comprehensive prompt for the LLM, including the user question and all the relevant information. That complete prompt is sent to a generative LLM like OpenAI, Cohere, Anthropic or one of the open source LLMs like Llama2. Once a response is generated, it can optionally be sent to a “validation” service (like Nvidia’s Nemo Guardrails), and finally back to the user.

red

Finally, it’s important to highlight another optional step, depicted by the arrow: the ability to take action based on the response via enterprise automation and integration. If the response generated is trusted to be correct, we can use the response to take action on our behalf – for example, send an email or add a task to JIRA. This often involves integration with enterprise/SaaS applications like JIRA, Notion, Asana, email or Google Drive.

Simplifying the Grounded Generation Architecture with Vectara

Creating a GenAI application using the do-it-yourself (DIY) stack for Grounded Generation shown in figure 1 may seem easy at first. It is, if your goal is a first working prototype.

However, moving from a prototype to a scalable production deployment of a GenAI application requires learning and understanding multiple components as well as specialized expertise in retrieval engines and embedding models, generative LLMs, prompt engineering, and vector databases. And then there is additional effort needed to productionize the system, as well as continuous DevOps and SRE (Site Reliability Engineering) efforts to keep the system up and running.

This makes the whole experience quite challenging.

What’s more, the GenAI space continues to evolve at a break-neck speed, with more components and choices, new types of models, and new benchmarks every week. Huggingface alone has more than 270,000 models!

How can one keep track of all these things, while focusing on their business application?

Frameworks like LangChain or LlamaIndex have recently gained popularity since they provide a programming framework that helps integrate all these components together.

This helps with prototyping, but is not enough.

First, users of such frameworks need to be well-versed in ML to know which models to use, and have to continuously keep up with the state of the art models to know when to upgrade or change models. Second, moving applications built with such frameworks to production is difficult, and requires integration with other production tools for monitoring, alerts, and change management.

This is where GenAI platforms like Vectara really shine – by encapsulating a lot of the functionality of the GG stack into a single platform. With a team of experts in machine learning, Vectara builds and uses the best components for various parts of the pipeline, handles all upgrades and keeps up with the state of the art. This is shown in figure 2:

Figure 2: Building “Grounded Generation” Applications with the Vectara Platform

Vectara’s platform provides two easy-to-use API endpoints – one for indexing documents and the other for running queries against the data previously indexed.

Data still originates from enterprise data stores, and as an application developer you don’t have to implement and test the complex (and often language-specific) pre-processing and data extraction. Instead you call the indexing API, and Vectara processes the input files or text data, encodes each chunk of text with our state-of-the-art embedding model and stores the results in the internal vector store.

To respond to a user query – you just call the query API. Vectara then encodes the query into its vector embedding, retrieves the most relevant chunks of text from the retrieval engine, creates the prompt, and calls the generative LLM to construct the actual response to the user query.

The Emergence of LLM platforms

The emergence of GenAI platforms like Vectara is not surprising. In fact, if we look at the history of technology, we’ve seen quite a few cases where complex technology stacks are simplified with end-to-end platforms.

A good example of this is Heroku, which had a significant impact on the development community. Before Heroku, developers or organizations had to spend significant resources on system administration to manage their servers, databases, and networking, both for deployment and maintenance. Deploying a web application would often involve tasks such as setting up a server, configuring the operating system, installing a web server like Apache or Nginx, setting up the database, and configuring network settings.

With Heroku, developers simply push their application code (written in any of several supported languages) to the platform using Git. Heroku then automatically handles deployment, from provisioning and managing servers, to deploying the web application and database, to handling scaling and load balancing. Developers don’t have to worry about system administration or infrastructure, and can focus on developing their applications.

The similarity with Grounded Generation is striking. Instead of having to deal with the complexity of setting up an end-to-end GenAI application stack, GenAI application developers can use platforms like Vectara to just focus on developing their application, knowing they can easily deploy it into production safely and at scale, with security and privacy built in.

Let’s inspect some of the critical tasks that the Vectara platform handles:

Data processing. Vectara supports various file types for ingestion including markdown, PDF, PPT, DOC, HTML and many others. At ingestion time, the text is automatically extracted from the files, and chunked into sentences. Then a vector embedding is computed for each chunk, so you don’t need to call any additional service for that.

Vector and text storage: Vectara hosts and manages the vector store (where the document embeddings are stored) as well as the associated text. Developers don’t need to go through a long and expensive process of evaluation and choice of vector databases. Nor do they have to worry about setting up that Vector database, managing it in their production environment, re-indexing, and many other DevOps considerations that become important when you scale your application beyond a simple prototype.

Query flow: When issuing a query, calculating the embedding vector for that query and retrieving the resulting text segments (based on similarity match) is fully managed by Vectara. Vectara also provides a robust implementation of hybrid search and re-ranking out of the box, which together with a state of the art embedding model ensures the most relevant text segments are returned in the retrieval step.

Response Generation: Vectara constructs the right prompt to use for the generation step, and calls the generative summarization LLM to return the response to the user’s query.

Security and Privacy: Vectara’s API is fully encrypted in transit and at rest, and supports customer-managed-keys (CMK). We never train on your data, so you can be sure your data is safe from privacy leaks.

Summary

“Grounded Generation” is quickly becoming the prominent application category of GenAI applications, and with it a new reference stack is emerging.

As application developers experiment with this new stack, they often realize that the do-it-yourself approach is complex and requires a lot of expertise.

GenAI platforms like Vectara provide a powerful yet easy-to-use set of APIs that allow developers to focus on building their application, instead of having to specialize in the increasingly complex and constantly evolving set of skills required to build such applications on your own.

To demonstrate the power of the Vectara platform, we have built a few sample applications like AskNews and AskHBS.

To see how easy it is to build your own app – get a free account today by signing up to a free starter account, and create a robust, scalable and enterprise-ready GenAI application within hours.