Building GenAI Applications with Vectara and Unstructured

Introduction

Vectara’s serverless RAG (Retrieval Augmented Generation) platform provides an easy-to-use API for building RAG pipelines and chat applications that are enterprise-ready and scalable to any number of documents.

Out of the box, Vectara’s API supports two ways to ingest data into a Vectara corpus.

You can extract data from any enterprise data sources (a database or an enterprise application) and index the relevant text directly to your corpus via the Standard Indexing API.
You can also use Vectara’s FILE_UPLOAD API to ingest files such as PDFs, PPTs, DOC, markdown, or HTML documents (see the full list of file types supported) directly to the Vectara corpus.

Unstructured is a Python library that brings advanced preprocessing of various file types, enabling the transformation of complex natural language data into text, and thus simplifying the ingest of data in RAG pipelines. The recent Vectara connector to Unstructured provides yet another robust alternative for ingesting text from files into Vectara.

For this blog post, we will use 7 specific reports by the Consumer Financial Protection Bureau (CFPB), a regulatory agency of the US government responsible for consumer protection in the financial sector. These reports from the years 2023-2024 include the full CFPB annual 2023 report and specific reports about student loans and the mortgage market.

We will use the Unstructured Ingest CLI and build a question-answering demo based on this data using create-ui.

Using The Unstructured Ingest CLI

We placed our PDF files into a local folder for simplicity, but what we show below can as easily work with other sources, such as AWS S3 folder, Azure blob storage, or Google Cloud Storage (GCS).

Before we start, we need two items from the Vectara side OAuth 2 configuration: the Client IDand Client Secret. You can copy those from your Vectara console, under “API access”:

If you click the “Copy” button, this will copy the Client ID to your clipboard, and with the dropdown menu on the right you can copy the Client Secret. You also need to specify your Vectara Customer ID so copy that from your account console.

First, we install Unstructured. To do a clean install, you can use Conda:

conda create -n unst python=3.11conda activate unstpip install "unstructured[local-inference]" httpx

Our PDF files are under the “/Users/ofer/dev/data/cfpb-reports” folder, so all I have to do to ingest these PDF files into Vectara is execute the following command:

unstructured-ingest \

local \

--input-path "/Users/ofer/dev/data/cfpb-reports" \

--strategy "hi_res" \

vectara \

--oauth-client-id"<VECTARA-OAUTH-CLIENT-ID>" \

--oauth-secret"<VECTARA-OAUTH2-SECRET>" \

--customer-id"<VECTARA-CUSTOMER-ID>" \

--corpus-name"GenAI-demo"

This runs the unstructured-ingest command with the following arguments:

The source connector is specified as local, and the “–input_path” argument points to the folder path. Note that this can also be a folder on S3 or GCS as pointed out above.
The destination connector is specified as vectara with 4 arguments: the OAuth client ID and Secret, the Vectara Customer ID, and the name of the corpus you want to use in Vectara. If a corpus by that name already exists in your account it will be used, otherwise a new corpus with that name will be automatically created for you.

Querying the data

Now that the data is ingested into Vectara, we can issue queries and chat with the data using Vectara’s Query API, using a tool like vectara-answer or create-ui.

Let’s try an example with create-ui. To use it we must make sure Node and NPM are installed, and then simply install create-ui using:

npx @vectara/create-ui

This installs and runs the package, walking you through a 3-step installation process.

First, you can select the type of GenAI UI you want to use. In this case we choose the “Question answering” variant.
Then we choose the “Use my own data”’ in order to point the create-ui application to the CFPB documents already ingested into Vectara.
After providing a name for the application (we named it “cfpb”), and providing the customer ID, corpus ID, and API key, you can define a few questions that are pre-populated with the application like “What are the risks with student loans?” or “What is the CFPB?”.

That’s it! create-ui generates the application in the “cfpb” folder and all you have to do is run it in that folder:

cd cfpbnpm installnpm run start

Now let’s try “What are the risks with student loans”?

The output is a generative summary based on the CFPB documents ingested, along with the list of citations that the application used to generate this summary, a great example of Retrieval Augmented Generation at its best.

Conclusions

Vectara provides a serverless RAG-as-a-service platform for building trusted and scalable GenAI applications. Using Unstructured’s capabilities to ingest data from various data sources allows data engineering and software developers to focus on building their GenAI application without being mired in the complexity of data movement.

Want to try it with your own data? It’s super easy.

First, sign up for a free Vectara account if you don’t have one already. Then follow the instructions in this blog post and explore the power and wonder of building an application to chat with your own data in minutes.

For more details about using Vectara check out the documentation, join our forums or Discord community. You can visit unstructured.io to learn more about its data ingest capabilities or join the unstructured community Slack.