Vectara-ingest: Data Ingestion made easy

Introduction

Vectar a provides an easy to use, managed platform for building LLM-powered conversational search applications with your data.

Developing such an application typically includes the following steps:

Retrieving the data or content from its source (a website, Notion, Jira, etc)
Using Vectara’s indexing API to ingest that content into a Vectara corpus
Building a search user interface that calls Vectara’s Search API with a user query and displays the results to the user.

Indexing documents is very straightforward, and thanks to our Instant Index capability indexed data is available to query within seconds.

But how does one extract all this content from those sources in the first place?

For a website – you need to understand the latest in web crawling and HTML extraction; for sources with an API like Jira, Notion or Discourse, you need to learn and understand the detailed nuances of using all these APIs. This can quickly become overwhelming.

This is why I’m excited to announce the release of vectara-ingest, an open source project that includes a set of reusable code for crawling data sources and indexing the extracted content into Vectara corpora.

Getting Started with vectara-ingest

So how do you index a data source with vectara-ingest?

Let’s start with an example: we will crawl and index the content of sf.gov – the website for the city and county of San Francisco.

First, we set up our environment (instructions shown for mac; for other environments see here):

Install Docker if it’s not yet installed
Install Python 3.8 or above if it’s not yet installed
Install the yq command, if it’s not yet installed (brew install yq)
Clone the repo: git clone https://github.com/vectara/vectara-ingest
cd vectara-ingest

Next let’s open the Vectara console and set up a corpus for this crawl job.

Figure 1:

create new corpus

We call our corpus “sf” and provide a simple description. We can then click the “Create” button and the corpus is ready.

In the new corpus view, we generate an API key for indexing and searching:

Figure 2:

create an API key for the corpus

Now that the corpus is ready, we can create a new YAML configuration file config/sf.yaml for our crawl job:

Let’s review the contents of this configuration file:

The vectara section provides the information about our Vectara account and corpus – in this case the corpus_id (found in the top-right of this specific corpus page in the console) and customer_id
The crawling section has a single parameter crawler_type. In our case we select the “website” crawler type (see here for the list of all available crawlers).
The website_crawler section provides specific parameters for this crawl job:
- We specify the target website URL with the website_homepage parameter
- In the pages_source parameter we choose the sitemap crawling technique
- We choose the PDF method for rendering website content by choosing PDF in the extraction paramaeter.
- We specify a 1 second delay between URL extractions to make sure we don’t overload the sf.gov website.

Vectara-ingest uses a secrets.toml file to hold secrets that are not part of the code-base such as API keys. In this case we add a specific profile called “sf”, and store under this profile the vectara auth_url and the api-key we created earlier.

To run the crawl job we use the run.sh script:

This creates the Docker image, and a Docker container (called vingest), and then runs that Docker container with the sf.yaml file we provided and kicks off the crawl job. If you want to track progress, you can look at the log messages from the running docker container by:

docker logs -f vingest

Once the job is finished, we can use the Vectara console to explore the results by trying out some search queries:

Figure 3:

Searching sf.gov from the Vectara console

How does a vectara-ingest crawler work?

Now that we’ve seen how to run a crawl job using vectara-ingest, let’s look at an example crawler (the RSS crawler) to better understand how it works internally.

The RSS crawler retrieves a list of URLs from an RSS feed and ingests the documents pointed to by these URLs into a Vectara corpus.

This crawler has the following parameters

source: the name of the RSS source.
rss_pages: a list of one or more RSS feed URLs.
days_past: specifies a filtering condition; URLs from the RSS feed will be included only if they have been published in the last N days.
delay: number of seconds to wait between indexing operations (to avoid overloading servers).
extraction: determines the way we want to extract content from the URL (valid values are pdf or html).

Every crawler in vectara-ingest is a subclass of the Crawler base class, and has to implement the crawl() method, and RSSCrawler is no exception:

Take a look at the implementation of this method and notice there are two major steps:

In the first step, we collect a list of all URLs from the RSS feeds that are within the time period specified by days_past:

feed = feedparser.parse(rss_page) for entry in feed.entries: if "published_parsed" not in entry: urls.append([entry.link, entry.title, None]) continue entry_date = datetime.fromtimestamp(mktime(entry.published_parsed)) if entry_date >= days_ago and entry_date <= today: urls.append([entry.link, entry.title, entry_date])

In the second step, for each URL we call the url_to_file() helper method to render the content into a PDF file, and the index_file() method (part of the Indexer object) to index that content into the Vectara corpus:

Make your own crawler!

We saw how to use vectara-ingest to run a website crawl job, and then looked at the code of the RSSCrawler for a detailed example of how the internals of a crawler work.

The vectara-ingest project has many other crawlers implemented that might come in handy:

Mediawiki: crawl a website powered by MediaWiki such as wikipedia
Notion: crawl content from your company’s Notion instance
Jira: crawl your company’s Jira instance indexing the issues and comments
Docusaurus: crawl a documentation site powered by Docusaurus
Discourse: crawl a public forum powered by Discourse
S3: crawl files on an S3 bucket
Folder: crawl all files in a certain local folder
PMC: crawl scientific papers from pubmed central
GitHub: crawl a G i tHub repository, indexing all issues and comments
Hacker News: crawl top stories from Hacker News
Edgar: crawl 10-K annual reports from the SEC Edgar website

I invite you to contribute to this project – whether it’s to improve an existing crawler implementation, contribute a new crawler type, or even add a small improvement to the project documentation – every contribution is appreciated.

Please see the contribution guidelines for additional information, and submit a PR.

Summary

Vectara-ingest provides code samples that make data source crawling and indexing easier, and a framework to easily run “crawl” jobs to ingest data into Vectara.

Vectara community members are using this codebase to build their LLM-powered applications, making the whole process easier and simpler.

I am excited to see this project continue to evolve, support additional types of data ingestion flows, and power new and innovative LLM-powered applications.