Retrieval Augmented Generation (RAG) Done Right: Database Data

Introduction

Retrieval Augmented Generation (or RAG) pipelines are increasingly becoming the common way to implement question answering and chatbot applications using Large Language Models (LLMs) with your own data.

Some of your data may reside on S3 or Google-Drive as files in PDF format or MSOffice documents, but in many cases your data is stored in structured tables within a database like Snowflake, Redshift or Postgres or in document databases like MongoDB or CouchDB.

This data is often critical to your business, and applying RAG to this data allows your users to interact with this data in entirely new ways, such as question answering, chatbots, or summarization. The end result is increased productivity, sales conversions, or improved user engagement.

How do I build a RAG pipeline with this kind of structured or semi-structured data?

In this blog post we’ll cover the process to do just that, and review some best practices. We will focus on structured data commonly stored in RDBMS systems, as reflected in the code, but the approach described here works as well for document databases.

Using GenAI with Databases

Most business critical data in the enterprise is organized and stored in relational, and SQL is still the primary way people query this data to gain insights.

When we are discussing databases in the context of Generative AI, one of the first questions that always comes to mind is: “can’t I just tell the database what I need, instead of having to craft a complicated (and often multi-page) SQL query?”

This is a very important use-case for LLMs. It turns out that existing LLMs like OpenAI’s GPT-4 or Anthropic’s Claude-2 can do a pretty decent job for simple Text-to-SQL tasks, but often fail with more complex use-cases. This is unfortunate, although I truly expect that Text-to-SQL will improve dramatically in the coming years and we will have a good way to use analytics by simply issuing natural language queries.

In this blog post I’m going to focus on a different use-case, namely building RAG applications using data that resides in the database.

For example, consider the following publicly available dataset of Airbnb listings for the city of Barcelona, which I’ve uploaded into a Snowflake instance, in two tables: listings and reviews.

The listings table has 75 different columns, describing various aspects of each Airbnb listing in Barcelona, such as: description, number of beds, neighborhood as well as the geographic coordinates – longitude and latitude.

The reviews table has only 6 columns, including listing-ID (mapping the review to the listing it refers to in the listings table), reviewer name, date of review and “comments” (the main review text).

User comments tend to include extremely valuable information that can be used to help future visitors to Barcelona. However, it is notoriously difficult for RDBMS systems and keyword-based indexes to tap into this type of user-generated content. We are going to build a RAG application, using Vectara, to allow users to ask questions such as:

What is the best museum for kids?
Which neighborhood has the best Tapas places?
Where can I find night clubs?

You could imagine that Airbnb conversion rates would get much better if users could tap into this qualitative information as part of their property search process. It would be a win for Airbnb and a win for its users.

Let’s get started.

Ingesting Data into Vectara

Our first step would be to ingest the data from Snowflake into Vectara.

Data within a database table is structured into columns, and when preparing data for use in generative AI, one has to consider the data schema and decide how to best prepare it for use in the context of RAG.

So before we do any ingestion of data we need to design a “document construction plan”, whereby we decide how to translate each entity of interest in the database into a Vectara JSON document to be ingested.

For example, in our case we will construct such a JSON document from each review (i.e. each row in the reviews table), which will include a title and a few text sections, and then add metadata fields to support filtering.

Specifically:

We will construct a title for this document as follows: “Review by <REVIEWER_NAME> on <DATE>”
The first section will be a sentence we construct from some fields about the property: “<ROOM_TYPE> property in <NEIGHBOURHOOD_CLEANSED>, described as <DESCRIPTION>”
The second section will include the actual <COMMENT> (the actual review) about the property.

There are many fields in the listing database that we can use for metadata, and we chose for this demo a few to include:

LATITUDE
LONGITUDE
DATE
NEIGHBORHOOD_CLEANSED

This plan we just outlined above for handling each column in the database tables demonstrates a common pattern when ingesting database tables for use within a RAG application:

Some columns which are textual in nature, such as the “COMMENTS” column, are used directly – in this case as a section onto itself.
One can construct text by creating an “artificial sentence” from one or more columns and their values. For example, the title is constructed in this way as well as the 2nd section.
Some fields are used as metadata (like LONGITUDE and LATITUDE). This allows us to filter results using these fields.
Some fields are just ignored and not used in the RAG flow.

Although we’re dealing here with a database system like Snowflake or Redshift, it’s worth mentioning that this strategy of following a “document construction plan” applies equally if your files reside in CSV files or any other format that behaves similar to structured data in a database.

In order to ingest the data we first create a new corpus in Vectara, and define each of the fields in the metadata (date, longitude, latitude, and neighborhood) as filter attributes.

Next we use Snowflake’s Python connector to download the data from the tables into a pandas dataframe: “`

Here we are joining the two tables REVIEWS and LISTINGS to get all the information we need per each of the reviews.

And this is how we create one Vectara “JSON Document” per each review:

Note here how for each row in df_chunk we do the following:

Create metadata including date, reviewer, latitude, longitude and neighborhood
Create the doc variable, including documentID, title, metadata, and a section with two parts: the artificial sentence discussed above and the review comments.

The index_doc() function then indexes the document in Vectara, by calling the indexing API – you can see the full code here.

An example of the resulting JSON document is:

Test add_chunk() function receives a set of rows in the Dataframe and creates a single JSON document per row. This way we can use Ray to easily parallelize and speed up data ingest:

And that’s it.

After finishing this ingest process, we can now use Vectara to build an application for question-answering or a chatbot using this data.

Asking Questions about Barcelona

Okay, now that all the data has been ingested, we can try a few interesting queries with this data. AirBNB reviews tend to have not just reviews on the properties but often have a lot of valuable information about the neighborhood, things to do, and much more.

We have provided the full code for querying the data here.

Let’s try a few questions:

First question is “What is the best museum for kids?”

Vectata’s query API responds with:

“The best museum for kids in Barcelona appears to be the science museum, as mentioned by a visitor. It was the highlight of their trip according to their 6 and 8-year-old children. Additionally, the Recinte Modernista St Pau, MACBA, and Picasso Museum are also recommended. These museums offer engaging and educational experiences for children. Moreover, there are parks and playgrounds near the beach where kids can enjoy outdoor activities. Overall, these attractions provide a range of options for entertaining and educating children in Barcelona.”

When we run this query: “Which neighborhood has the best Tapas places?”

We get:

“The neighborhood of Gracia in Barcelona is highly recommended for its excellent tapas places. Visitors have praised the great bars and restaurants in the area. Some specific tapas recommendations include El 58 and a place in Plaza del Sol. Gracia provides a charming and comfortable atmosphere, with easy access to attractions like Sagrada Familia and Parc Guell. The Estrella market nearby offers fresh fruit and traditional Catalan products. Overall, Gracia stands out as a vibrant neighborhood with fantastic tapas options and a convenient location.”

This sounds so cool and yummi that I asked Dalle-3 to generate an image of what this might look like. Here’s the result:

How fun!

Our final query is: “Where can I find night clubs?”

With the response:

“Night clubs can be found in various locations. One search result mentioned a night club called Club Pipos located under an apartment, which can cause noise disturbance. Another result mentioned a night club near a certain apartment, where noise from lively visitors could be heard until around 1-2 AM. Additionally, another search result mentioned a night club called APOLLO located near an apartment, causing potential noise issues until 5:30 AM on Thursdays, Fridays, and Saturdays. However, one review stated that with windows and shutters closed, the noise from the nearby nightclub was not a problem. It is important to consider the proximity of night clubs to potential accommodations and personal preferences for a peaceful stay.”

Of course, we may be looking for some recommendations from pre-pandemic times, so what if we add the filtering criteria:

doc.date < ‘2020-01-01’

With the same query we now get:

“Based on the search results, there are several areas in Barcelona where you can find night clubs. One option is to stay in Sant Antoni, which has shops, restaurants, tapas bars, and nightclubs nearby. Another area is Example district, which is a 15-minute walk from the famous pedestrian boulevard, Rambla, and offers a vibrant nightlife scene. However, it’s worth noting that some accommodations near these areas mentioned that there might be some noise at night due to the proximity of nightclubs. Overall, these locations provide convenient access to nightlife options in Barcelona.”

Conclusion

A lot of enterprise data resides in structured database tables, and in this blog post we looked at how to ingest this type of data into Vectara, and specifically a common methodology for creating a Vectara “document” object from each row of the table to enable powerful semantic search, question-answering, and conversational AI.

The full code for this blog is available here: ingest, query

To try Vectara with your own data here are some helpful links to get started:

Sign up for a free account if you don’t have one already
Follow the quickstart guide to create a corpus and API key. Boomerang is enabled by default for new corpora.
Ingest your data into the corpus using Vectara’s Indexing API or use the open source vectara-ingest project.
If you need help, check out our forums or join our Discord server.