Skip to main content
Menu

Blog Post

Data ingestion

Vectara-ingest: Data Ingestion made easy

A collection of crawlers for the Vectara community, making crawling and indexing documents quick and easy

Introduction

Vectara provides an easy to use, managed platform for building LLM-powered conversational search applications with your data.

Developing such an application typically includes the following steps:

  • Retrieving the data or content from its source (a website, Notion, Jira, etc)
  • Using Vectara’s indexing API to ingest that content into a Vectara corpus
  • Building a search user interface that calls Vectara’s Search API with a user query and displays the results to the user.

Indexing documents is very straightforward, and thanks to our Instant Index capability indexed data is available to query within seconds.

But how does one extract all this content from those sources in the first place?

For a website – you need to understand the latest in web crawling and HTML extraction; for sources with an API like Jira, Notion or Discourse, you need to learn and understand the detailed nuances of using all these APIs. This can quickly become overwhelming.

This is why I’m excited to announce the release of vectara-ingest, an open source project that includes a set of reusable code for crawling data sources and indexing the extracted content into Vectara corpora.

Getting Started with vectara-ingest

So how do you index a data source with vectara-ingest?

Let’s start with an example: we will crawl and index the content of sf.gov – the website for the city and county of San Francisco.

First, we set up our environment (instructions shown for mac; for other environments see here):

  • Install Docker if it’s not yet installed
  • Install Python 3.8 or above if it’s not yet installed
  • Install the yq command, if it’s not yet installed (brew install yq)
  • Clone the repo: git clone https://github.com/vectara/vectara-ingest
  • cd vectara-ingest

Next let’s open the Vectara console and set up a corpus for this crawl job. 

Figure 1: create new corpus

We call our corpus “sf” and provide a simple description. We can then click the “Create” button and the corpus is ready.

In the new corpus view, we generate an API key for indexing and searching: 

Figure 2: create an API key for the corpus

Now that the corpus is ready, we can create a new YAML configuration file config/sf.yaml for our crawl job:

 

vectara:
   corpus_id: 30
   customer_id: 1169579801

crawling:
   crawler_type: website

website_crawler:
   website_homepage: https://www.sf.gov
   delay: 1
   pages_source: sitemap
   extraction: pdf

 

Let’s review the contents of this configuration file:

  1. The vectara section provides the information about our Vectara account and corpus – in this case the corpus_id (found in the top-right of this specific corpus page in the console) and customer_id
  2. The crawling section has a single parameter crawler_type. In our case we select the “website” crawler type (see here for the list of all available crawlers).
  3. The website_crawler section provides specific parameters for this crawl job:
    • We specify the target website URL with the website_homepage parameter
    • In the pages_source parameter we choose the sitemap crawling technique
    • We choose the PDF method for rendering website content by choosing PDF in the extraction paramaeter.
    • We specify a 1 second delay between URL extractions to make sure we don’t overload the sf.gov website.

Vectara-ingest uses a secrets.toml file to hold secrets that are not part of the code-base such as API keys. In this case we add a specific profile called “sf”, and store under this profile the vectara auth_url and the api-key we created earlier.

[sf] 
auth_url="https://vectara-prod-1169579801.auth.us-west-2.amazoncognito.com" 
api_key="<VECTARA-API-KEY>"

 

To run the crawl job we use the run.sh script:

bash run.sh config/sf.yaml sf

This creates the Docker image, and a Docker container (called vingest), and then runs that Docker container with the sf.yaml file we provided and kicks off the crawl job. If you want to track progress, you can look at the log messages from the running docker container by:

docker logs -f vingest

Once the job is finished, we can use the Vectara console to explore the results by trying out some search queries: 

Figure 3: Searching sf.gov from the Vectara console

How does a vectara-ingest crawler work?

Now that we’ve seen how to run a crawl job using vectara-ingest, let’s look at an example crawler (the RSS crawler) to better understand how it works internally.

The RSS crawler retrieves a list of URLs from an RSS feed and ingests the documents pointed to by these URLs into a Vectara corpus.

This crawler has the following parameters

  • source: the name of the RSS source.
  • rss_pages: a list of one or more RSS feed URLs.
  • days_past: specifies a filtering condition; URLs from the RSS feed will be included only if they have been published in the last N days.
  • delay: number of seconds to wait between indexing operations (to avoid overloading servers).
  • extraction: determines the way we want to extract content from the URL (valid values are pdf or html).

Every crawler in vectara-ingest is a subclass of the Crawler base class, and has to implement the crawl() method, and RSSCrawler is no exception:

class RssCrawler(Crawler):
   def crawl(self):
       """
       Crawl RSS feeds and upload to Vectara.

       """

Take a look at the implementation of this method and notice there are two major steps:

In the first step, we collect a list of all URLs from the RSS feeds that are within the time period specified by days_past:

urls = []
for rss_page in rss_pages:
    feed = feedparser.parse(rss_page)
    for entry in feed.entries:
        if "published_parsed" not in entry:
            urls.append([entry.link, entry.title, None])
            continue
        entry_date = datetime.fromtimestamp(mktime(entry.published_parsed))
        if entry_date &gt;= days_ago and entry_date &lt;= today:
            urls.append([entry.link, entry.title, entry_date])
feed = feedparser.parse(rss_page) for entry in feed.entries: if "published_parsed" not in entry: urls.append([entry.link, entry.title, None]) continue entry_date = datetime.fromtimestamp(mktime(entry.published_parsed)) if entry_date >= days_ago and entry_date <= today: urls.append([entry.link, entry.title, entry_date])

In the second step, for each URL we call the url_to_file() helper method to render the content into a PDF file, and the index_file() method (part of the Indexer object) to index that content into the Vectara corpus:

try:
    filename = self.url_to_file(url, title=title, extraction=self.cfg.rss_crawler.extraction)
except Exception as e:
    logging.error(f"Error while processing {url}: {e}")
    continue

# index document into Vectara
try:
    if pub_date:
        pub_date_int = int(str(pub_date.timestamp()).split('.')[0])
    else:
        pub_date_int = 0        # unknown published date
        pub_date = 'unknown'
    crawl_date_int = int(str(today.timestamp()).split('.')[0])
    metadata = {
        'source': source, 'url': url, 'title': title, 
        'pub_date': str(pub_date), 'pub_date_int': pub_date_int,
        'crawl_date': str(today),
        'crawl_date_int': crawl_date_int
    }
    succeeded = self.indexer.index_file(filename, uri=url, metadata=metadata)
    if succeeded:
        crawled_urls.add(url)
    else:
        logging.info(f"Indexing failed for {url}")
    if os.path.exists(filename):
        os.remove(filename)
except Exception as e:
    logging.error(f"Error while indexing {url}: {e}")
time.sleep(delay_in_secs)

Make your own crawler!

We saw how to use vectara-ingest to run a website crawl job, and then looked at the code of the RSSCrawler for a detailed example of how the internals of a crawler work.

The vectara-ingest project has many other crawlers implemented that might come in handy:

  • Mediawiki: crawl a website powered by MediaWiki such as wikipedia
  • Notion: crawl content from your company’s Notion instance
  • Jira: crawl your company’s Jira instance indexing the issues and comments
  • Docusaurus: crawl a documentation site powered by Docusaurus
  • Discourse: crawl a public forum powered by Discourse
  • S3: crawl files on an S3 bucket
  • Folder: crawl all files in a certain local folder
  • PMC: crawl scientific papers from pubmed central
  • GitHub: crawl a GitHub repository, indexing all issues and comments
  • Hacker News: crawl top stories from Hacker News
  • Edgar: crawl 10-K annual reports from the SEC Edgar website

I invite you to contribute to this project – whether it’s to improve an existing crawler implementation, contribute a new crawler type, or even add a small improvement to the project documentation – every contribution is appreciated.

Please see the contribution guidelines for additional information, and submit a PR. 

Summary

Vectara-ingest provides code samples that make data source crawling and indexing easier, and a framework to easily run “crawl” jobs to ingest data into Vectara. 

Vectara community members are using this codebase to build their LLM-powered applications, making the whole process easier and simpler. 

I am excited to see this project continue to evolve, support additional types of data ingestion flows, and power new and innovative LLM-powered applications. 

Recommended Content

COMPANY OVERVIEW

Vectara: Hybrid Search and Beyond [PDF]

In the AI era, how people interact with information has changed. Users expect relevant answers to questions in natural language, not a shopping list of hit or miss search results. They expect the best semantic or exact matches regardless of typos, colloquialisms, or context. Additionally, it is Vectara's mission to remove language as a barrier by allowing cross-language hybrid search that delivers summarized answers in the language of your choice. The Internet, mobile, and AI have made information accessible, now Vectara helps you find meaning quickly through the most relevant answers. Get to know Vectara, and if you have a question, just ask.

Get Introduced to Vectara
Resource Image
Close Menu