Ingesting Data into Vectara Using PyAirbyte

Introduction

In the previous blog post about the integration between Vectara and Airbyte, we learned how to use the Vectara destination connector for Airbyte to make data ingestion into Vectara from over 360 data sources simple, scalable, and reliable.

While this integration is easy to use for no-code developers, it does make some assumptions about how source data is mapped into Vectara’s document JSON structure. When you use the Vectara destination in the Airbyte dashboard, you map fields from the input source into Vectara document text elements or meta-data fields, but you cannot really do any transformations.

What do I mean by transformations? It could be editing the text coming from the source, concatenating two fields together, or even aggregating multiple records into a single Vectara document.

Here’s an example. Consider a dataset of hotel reviews from this dataset, currently residing in Airtable. I would like to use the Airtable source connector to ingest this data into Vectara. The dataset has a rich list of fields describing each hotel review such as the hotel’s name, address, as well as the actual review text.

Each review is in its own Airtable record (or row), and when I ingest this into Vectara I would like to aggregate all reviews of the same hotel into a single document.

How can I do that?

This is where PyAirbyte, the new open-source Python package from Airbyte comes into play – it provides access to the source data from within a Python environment, allowing you to have maximum flexibility in ingesting the data to Vectara while performing any kind of transformation you need.

To see how this works, let’s dive into this example in more detail.

Ingesting Hotel Review Data from Airtable into Vectara

To get started with PyAirbyte, we first install it in our local Python environment:

Using the PyAirbyte Python package and the Airtable source connector we can pull all the hotel review data into our python environment.

Following the Airbyte instructions, we create an access token in Airbyte that includes the needed permission scopes and points it to our “hotel review” table.

Then reading data using PyAirbyte is quite simple:

We see that the Airtable source connector stream is:

['hotel-reviews/imported_table/tblGpuhPqCt1T7vKE']

Now let’s create a Pandas dataframe from the input data. Here we use the get_documents() function of the source, with the stream name. Each returned Document object has a metadata field which contains a dictionary of the actual column names and values from the data, so we simple use this Python one-liner:

And print the columns we have for each record:

Now comes the critical step: we want to choose only the columns we are interested in, and aggregate reviews for the same hotel in a single Vectara document before ingesting that data into our corpus. We will use Pandas’ agg() functionality as follows:

Now let’s construct a single Vectara document for each hotel:

Here’s how this Vectara document looks like (for one of the hotels):

We’ve done a few custom operations here:

Aggregated all the reviews for a single hotel, and used the “section” part of the Vectara document to hold all these reviews in a single document.
We’ve created a custom title for the whole document that clarifies these are reviews and includes the hotel name and city.
We’ve added a title to each text section, alongside the actual review, that includes the hotel name, the title of the review and the rating.
We’ve added hotel name, city, country and average rating as metadata.

Now comes the final step – ingest those documents into our corpus. We will be using a separate helper function called index_document() which is available in this Vectara Repository.

And that’s it. All documents are now ingested into the corpus in the form we want them in (aggregated into a single document per hotel) and ready for querying.

Chat with Hotel Reviews

Now that all the data is uploaded to the Vectara corpus, let’s use the open-source react-chatbot project to generate a chatbot interface.

React-chatbot has a demo page that allows users to quickly test how it works. We use that page and edit the configuration to enter our customer ID, corpus ID and API key, then proceed to try the chat interface. Here is a screenshot from our chatbot demonstrating the multi-turn aspect of chat:

As shown in this screenshot, we asked “what is best about the Alexander autograph collection?” Even though the hotel name was mis-spelled as “Alexander”, Vectara’s NLP engine identifies the intention of the query and provides the correct result – “Alexandrian, Autograph collection”.

The follow up question “what is worst about it?” refers to the Alexandrian hotel, and as you can see Vectara’s chat capability correctly identifies that from the context and provides the correct response.

Conclusions

In this blog post we’ve seen how you can utilize PyAirbyte to ingest data from Airtable into Vectara while performing custom data transformations on the data before it is ingested into the Vectara corpus, such that the data is optimized for Chatbot, question-answering or any other type of RAG application.

In our example, we read hotel reviews from a table in Airtable (although clearly, this use case is quite similar in any tabular data store like Redshift, Snowflake, or Postgres), transformed the data using Python, and ingested the final data into Vectara. Then, we used Vectara’s RAG functionality to chat with this data.

If you are interested in a robust and easy-to-use path to ingesting data from any of Airbyte’s 360+ sources while having full control over how the data is ingested, try this new exciting option with PyAirbyte.

To get started, sign up for a free Vectara account if you don’t already have one. Then, follow the steps outlined in this blog post (full code is available in this Jupyter notebook) or modify them to fit your specific data source and application needs. If you have any questions, please feel free to join the Vectara Discord server or our discussion forums and ask us there.