Skip to main content
Menu

Blog Post

Data ingestion

Ingesting Data into Vectara Using PyAirbyte

How to run custom transformations on data from any Airbyte data source ingested into Vectara

Introduction

In the previous blog post about the integration between Vectara and Airbyte, we learned how to use the Vectara destination connector for Airbyte to make data ingestion into Vectara from over 360 data sources simple, scalable, and reliable.

While this integration is easy to use for no-code developers, it does make some assumptions about how source data is mapped into Vectara’s document JSON structure. When you use the Vectara destination in the Airbyte dashboard, you map fields from the input source into Vectara document text elements or meta-data fields, but you cannot really do any transformations. 

What do I mean by transformations? It could be editing the text coming from the source, concatenating two fields together, or even aggregating multiple records into a single Vectara document.

Here’s an example. Consider a dataset of hotel reviews from this dataset, currently residing in Airtable. I would like to use the Airtable source connector to ingest this data into Vectara. The dataset has a rich list of fields describing each hotel review such as the hotel’s name, address, as well as the actual review text. 

Each review is in its own Airtable record (or row), and when I ingest this into Vectara I would like to aggregate all reviews of the same hotel into a single document. 

How can I do that?

This is where PyAirbyte, the new open-source Python package from Airbyte comes into play – it provides access to the source data from within a Python environment, allowing you to have maximum flexibility in ingesting the data to Vectara while performing any kind of transformation you need.

To see how this works, let’s dive into this example in more detail.

Ingesting Hotel Review Data from Airtable into Vectara

To get started with PyAirbyte, we first install it in our local Python environment:

pip install airbyte

Using the PyAirbyte Python package and the Airtable source connector we can pull all the hotel review data into our python environment.

Following the Airbyte instructions, we create an access token in Airbyte that includes the needed permission scopes and points it to our “hotel review” table.

Then reading data using PyAirbyte is quite simple:

import airbyte as ab
src = ab.get_source('source-airtable', config={'api_key': <”YOUR AIRTABLE API KEY”>}) 
streams = src.get_available_streams()
print(streams[0])

We see that the Airtable source connector stream is:

['hotel-reviews/imported_table/tblGpuhPqCt1T7vKE']

Now let’s create a Pandas dataframe from the input data. Here we use the get_documents() function of the source, with the stream name. Each returned Document object has a metadata field which contains a dictionary of the actual column names and values from the data, so we simple use this Python one-liner:

df = pd.DataFrame([x.metadata for x in src.get_documents(streams[0])])

And print the columns we have for each record:

 

print(df.columns)
['_airtable_id', '_airtable_created_time', '_airtable_table_name', 'address', 'categories', 'city', 'country', 'latitude', 'longitude', 'name', 'postalcode', 'province', 'reviews.date', 'reviews.dateadded', 'reviews.dorecommend', 'reviews.id', 'reviews.rating', 'reviews.text', 'reviews.title', 'reviews.usercity', 'reviews.username', 'reviews.userprovince']

Now comes the critical step: we want to choose only the columns we are interested in, and aggregate reviews for the same hotel in a single Vectara document before ingesting that data into our corpus. We will use Pandas’ agg() functionality as follows:

agg_df = df.groupby('name').agg({
    'city': 'first',  # Assuming city and country are the same for all entries of the same name
    'country': 'first',
    'reviews.rating': lambda values: list(values),
    'reviews.text': lambda texts: list(texts),
    'reviews.title': lambda titles: list(titles),
}).reset_index()

Now let’s construct a single Vectara document for each hotel:

docs = [{
'documentId': record['name'],
'title': f"Reviews for the hotel {record['name']} in {record['city']}.",
'metadataJson': json.dumps({
        'city': record['city'],
        'country': record['country'],
        'rating': np.mean(record['reviews.rating']),
        'name': record['name']
    }),
'section': [
        {
            'title': f'Review for {record["name"]} with rating {rating} titled {title}: ', 
            'text': text
        }
        for title,text,rating in zip(record['reviews.title'],record['reviews.text'],record['reviews.rating'])
    ],
} for record in agg_df.to_dict(orient='records')]

Here’s how this Vectara document looks like (for one of the hotels):

{'documentId': 'Ambassador Inn Albuquerque',
 'title': 'Reviews for the hotel Ambassador Inn Albuquerque in Albuquerque.',
 'metadataJson': "{'city': 'Albuquerque', 'country': 'US', 'rating': 3.2, 'name': 'Ambassador Inn Albuquerque'}",
 'section': [
   {'title': 'Review for Ambassador Inn Albuquerque with rating 1.0 titled Disappointing: ',
    'text': 'Cheap-quality room in industrial area. No restaurants or other services nearby. Hard bed, sloppy cleaning, few electrical outlets, lots of road noise. Breakfast is stale donuts and usually empty orange-juice dispenser. Bathroom has no towel bars or even hooks, and little space for toiletries. Window is cracked and hard to open or close. TV is old, with pocked glass. On... More'},
   {'title': 'Review for Ambassador Inn Albuquerque with rating 4.0 titled Small but clean Every thing we needed: ',
   'text': 'We enjoyed our stay. We were looking for a cheaper room during the Fiestea. We had a frige and microWave-you need to bring your own coffee pot and clock which we did. Our room had only shower no tub but that worked for us. The staff was very helpful. They had coffee and rolls in the lobby at 6:00 am... More'},
   {'title': 'Review for Ambassador Inn Albuquerque with rating 1.0 titled Scary Hotel: ',
    'text': 'I arrived late evening and pulled into the hotel parking lot. I was greeted by two homeless men sitting in the parking lot. The check In was quick and I had asked about the men in the parking lot. I was assured they lived in the hotel. I was not aware when I made my reservation that the Ambassador Inn... More'},
   {'title': 'Review for Ambassador Inn Albuquerque with rating 5.0 titled Excellent: ', 
    'text': 'We stayed there for one night during August 2013. After more than 2 weeks on the road sleeping every night in a different low budget motel we came upon this little gem which was by far the best we have encountered so far and at a very reasonable rate as well. The room was spotlessly clean and everything worked as... More'},
  {'title': 'Review for Ambassador Inn Albuquerque with rating 5.0 titled Great people and Great staff: ',
   'text': 'Great people ! Great staff ! Great service . Very clean ! nothing in our room was taken or missing when we left . They clean very well housekeeping people are so nice . The A/C is so awesome its sooo nice and cold top of all that there's a fan that helps circulate the air . mini fridge with... More'}
  ]
}

We’ve done a few custom operations here:

  1. Aggregated all the reviews for a single hotel, and used the “section” part of the Vectara document to hold all these reviews in a single document.
  2. We’ve created a custom title for the whole document that clarifies these are reviews and includes the hotel name and city.
  3. We’ve added a title to each text section, alongside the actual review, that includes the hotel name, the title of the review and the rating.
  4. We’ve added hotel name, city, country and average rating as metadata.

Now comes the final step – ingest those documents into our corpus. We will be using a separate helper function called index_document() which is available in this Vectara Repository

customer_id = "<VECTARA-CUSTOMER-ID>"
corpus_id = "<VECTARA-CORPUS-ID>"
api_key = "<VECTARA-API-KEY>"
for doc in docs:
    index_document(customer_id, corpus_id, api_key, doc)

And that’s it. All documents are now ingested into the corpus in the form we want them in (aggregated into a single document per hotel) and ready for querying.

Chat with Hotel Reviews

Now that all the data is uploaded to the Vectara corpus, let’s use the open-source react-chatbot project to generate a chatbot interface. 

React-chatbot has a demo page that allows users to quickly test how it works. We use that page and edit the configuration to enter our customer ID, corpus ID and API key, then proceed to try the chat interface. Here is a screenshot from our chatbot demonstrating the multi-turn aspect of chat:

 

As shown in this screenshot, we asked “what is best about the Alexander autograph collection?” Even though the hotel name was mis-spelled as “Alexander”, Vectara’s NLP engine identifies the intention of the query and provides the correct result – “Alexandrian, Autograph collection”.

The follow up question “what is worst about it?” refers to the Alexandrian hotel, and as you can see Vectara’s chat capability correctly identifies that from the context and provides the correct response.

Conclusions

In this blog post we’ve seen how you can utilize PyAirbyte to ingest data from Airtable into Vectara while performing custom data transformations on the data before it is ingested into the Vectara corpus, such that the data is optimized for Chatbot, question-answering or any other type of RAG application.

In our example, we read hotel reviews from a table in Airtable (although clearly, this use case is quite similar in any tabular data store like Redshift, Snowflake, or Postgres), transformed the data using Python, and ingested the final data into Vectara. Then, we used Vectara’s RAG functionality to chat with this data.

If you are interested in a robust and easy-to-use path to ingesting data from any of Airbyte’s 360+ sources while having full control over how the data is ingested, try this new exciting option with PyAirbyte. 

To get started, sign up for a free Vectara account if you don’t already have one. Then, follow the steps outlined in this blog post (full code is available in this Jupyter notebook) or modify them to fit your specific data source and application needs. If you have any questions, please feel free to join the Vectara Discord server or our discussion forums and ask us there.

Recommended Content

Open-source Library

Build a Chatbot with React!

React-Chatbot is an NPM package that allows you to build a modern chatbot interface connected to Vectara in minutes.

Start Now for Free
Resource Image
Close Menu