Retrieval Augmented Generation (RAG) Done Right: Document Stores

Introduction

Data has historically been synonymous with structured, numerical information. This paradigm, deeply ingrained in the foundations of data science and analytics, with SQL as the primary language to query such data, viewed text data as an outlier.

Over time there was a need to store such text data in a database. A good example is a text review of a product or a comment left by a guest about an AirBnB property.

With that, the so-called NoSQL or document databases, like Elasticsearch (or its open source version OpenSearch) and MongoDB grew in popularity, addressing the need to efficiently store large scale text and provide efficient ways to process it for mission-critical enterprise applications.

In 2023, with Large Language Models (LLMs) becoming popular this immediately changed, and text became the new “primary” data of interest. Text is what LLMs consume in training, fine-tuning, or at inference time, and what drives Retrieval augmented generation (RAG) pipelines. Many organizations are starting to understand they can finally realize the value embedded in their text data assets.

In a previous blog post we’ve shown how you can take text data from a cloud database like Snowflake and ingest specific text fields into Vectara.

Here we will show how to do this when the data is stored in Elasticsearch, and we will use our new Airbyte integration to perform the data movement.

The AirBnB Dataset

In the previous blog post we used the following publicly available dataset of AirBnB listings in Barcelona – so to make it interesting, this time we’ll use Paris instead.

The dataset includes two tables: listings and reviews. For this demo we focus on the reviews table only, which has 6 columns, including:

Listing ID (mapping the review to the listing it refers to)
Reviewer name (name of the person leaving the comments)
Date of review
Comments (review text)

Our first step is to upload this data into an instance of Elasticsearch and create an index called “reviews”.

Ingesting Data into Vectara with Airbyte

Airbyte is an open-source tool for data movement, with 350+ connectors, including of course connectors for many popular document stores such as MongoDB and Elasticsearch.

Besides providing the connectivity to those databases out-of-the-box, Airbyte solves common data integration problems like incremental syncs and schema evolution in a single place in a consistent manner, so that you, the GenAI developer, don’t need to do this on your own.

We first install Airbyte on an EC2 instance (follow this guide), and then continue to setup our Elasticsearch -> Vectara connection as follows:

Login to Airbyte at localhost:8000, and click “Create your first connection” (if you don’t have any connection setup), or just the “New connection” button.
Select the Elastic destination connector, and set it up as follows:
- Under source name enter “elastic-demo” (or you can pick your own name).
- Under authentication method pick the Elastic authentication method that is appropriate for your case.
- Under Server endpoint enter the URL of your elastic server endpoint.
Select the Vectara destination and set it up as follows:
- Under customer ID, enter our Vectara customer ID.
- Under corpus name, pick “my-elastic-corpus” or another name that you would like to use for your Vectara corpus.
- Turn on the parallelize button to accelerate the data movement using multi-threading.
- Under Authentication, enter your Vectara OAuth2 client ID and client Secret.
- For fields to store as metadata enter: reviewer_name and date. By specifying these as metadata fields, filtering can be later done using these fields when querying Vectara.
- For text fields to index with Vectara we simply choose the field commentswhich contains the comments left by AirBnB reviewers.
Configure the connection:
- We will keep the default “scheduled” sync with “daily” update periods.
- For the stream, choose “incremental / append” for sync-mode. This choice ensures the connection performs incremental updates as new data is updated on the Elasticsearch side.

Once this configuration is finished, you can enable the connection and just wait for Airbyte to do its work – indexing all of your Elasticsearch data from the reviews table into Vectara. Because Vectara provides a full end-to-end RAG solution, this includes all the steps needed for ingest: chunking, embedding, storage in the Vectara platform, and making data ready for retrieval in query time.

Asking Questions About Paris AirBnB Reviews

Now that the data is in the Vectara corpus, let’s see how we can use Vectara’s RAG to ask questions based on the data in the reviews.

I can run example queries in the Vectara console, query directly using the Vectara API, or use the vectara-answer tool to build a stand-alone application.

Our first question is “is there a good vegetarian restaurant near champs elisee?”

Notice that here I intentionally mis-spelled “Champs-Élysée” as “champs elisee”, but this does not deter Vectara’s powerful neural-based GenAI engine, and it understand what I mean, providing a very helpful response, and in particular mentions La Bonne Heure:

Yes, there are good vegetarian and vegan restaurants near the Champs-Élysées. One recommended place is “Le Bonne Heure,” which has been praised for its excellent vegetarian food and friendly staff. Additionally, “Jah Jah By Le Tricycle” is noted for its fantastic vegan food and is located a few doors down from another recommended spot. These options suggest that you will find quality vegetarian dining in the vicinity of the Champs-Élysées.

Let’s try next the French translation of the query “which museum is best for children?”, namely “quel musée est le meilleur pour les enfants?”. Vectara’s multilingual capabilities and support for 100+ languages kick in, with the following response:

Based on the search results, several museums in Paris are appreciated by children. The Musee D’Orsay and Versailles are mentioned as places that kids loved. The Picasso Museum is highlighted as very suitable for a family with a child. The evolution and dinosaur museum, presumably the Gallery of Paleontology and Comparative Anatomy in the Jardin des Plantes, is noted for being especially liked by kids. The Ateliers des Lumières is recommended for its mesmerizing art installations for children. Additionally, the Cité des Sciences et de l’Industrie at Parc de la Villette is suggested for its child-friendly exhibits and activities.

In this case, the response is in English as I’ve requested in the query (using the responseLang parameter), but you can also specify any other language in the query API, or specify “auto” – asking Vectara to detect the language of the query, and responds in that language.

Conclusion

Text data is becoming increasingly important, driving powerful semantic search and GenAI applications using enterprise text data resources.

In this blog post, we looked at how simple it is to ingest textual data, specifically AirBnB review for the city of Paris, from an Elasticsearch instance into Vectara, using the new Airbyte Vectara connector.

To try Vectara with your own Elasticsearch Instance:

Sign up for a free account if you don’t have one already
Follow the quickstart guide to create a corpus and API key.
Use Airbyte to ingest your data into the corpus using the Vectara Destination Component.

If you need help, check out our forums and Discord server.