Uncategorized
The Great Search Disruption
June 8, 2023 by Ofer Mendelevitch CJ Cenizal | 5 min read
Read NowBlog Post
Large Language Models
A collection of crawlers for the Vectara community, making crawling and indexing documents quick and easy
May 16, 2023 by Ofer Mendelevitch
Vectara provides an easy to use, managed platform for building LLM-powered conversational search applications with your data.
Developing such an application typically includes the following steps:
Indexing documents is very straightforward, and thanks to our Instant Index capability indexed data is available to query within seconds.
But how does one extract all this content from those sources in the first place?
For a website – you need to understand the latest in web crawling and HTML extraction; for sources with an API like Jira, Notion or Discourse, you need to learn and understand the detailed nuances of using all these APIs. This can quickly become overwhelming.
This is why I’m excited to announce the release of vectara-ingest, an open source project that includes a set of reusable code for crawling data sources and indexing the extracted content into Vectara corpora.
So how do you index a data source with vectara-ingest?
Let’s start with an example: we will crawl and index the content of sf.gov – the website for the city and county of San Francisco.
First, we set up our environment (instructions shown for mac; for other environments see here):
brew install yq
)git clone https://github.com/vectara/vectara-ingest
cd vectara-ingest
Next let’s open the Vectara console and set up a corpus for this crawl job.
We call our corpus “sf” and provide a simple description. We can then click the “Create” button and the corpus is ready.
In the new corpus view, we generate an API key for indexing and searching:
Now that the corpus is ready, we can create a new YAML configuration file config/sf.yaml
for our crawl job:
vectara:
corpus_id: 30
customer_id: 1169579801
crawling:
crawler_type: website
website_crawler:
website_homepage: https://www.sf.gov
delay: 1
pages_source: sitemap
extraction: pdf
Let’s review the contents of this configuration file:
vectara
section provides the information about our Vectara account and corpus – in this case the corpus_id (found in the top-right of this specific corpus page in the console) and customer_idcrawling
section has a single parameter crawler_type
. In our case we select the “website” crawler type (see here for the list of all available crawlers).website_crawler
section provides specific parameters for this crawl job:
website_homepage
parameterpages_source
parameter we choose the sitemap
crawling techniquePDF
in the extraction
paramaeter.Vectara-ingest uses a secrets.toml
file to hold secrets that are not part of the code-base such as API keys. In this case we add a specific profile called “sf”, and store under this profile the vectara auth_url
and the api-key
we created earlier.
[sf]
auth_url="https://vectara-prod-1169579801.auth.us-west-2.amazoncognito.com"
api_key="<VECTARA-API-KEY>"
To run the crawl job we use the run.sh
script:
bash run.sh config/sf.yaml sf
This creates the Docker image, and a Docker container (called vingest
), and then runs that Docker container with the sf.yaml
file we provided and kicks off the crawl job. If you want to track progress, you can look at the log messages from the running docker container by:
docker logs -f vingest
Once the job is finished, we can use the Vectara console to explore the results by trying out some search queries:
Now that we’ve seen how to run a crawl job using vectara-ingest
, let’s look at an example crawler (the RSS crawler) to better understand how it works internally.
The RSS crawler retrieves a list of URLs from an RSS feed and ingests the documents pointed to by these URLs into a Vectara corpus.
This crawler has the following parameters
source
: the name of the RSS source.rss_pages
: a list of one or more RSS feed URLs.days_past
: specifies a filtering condition; URLs from the RSS feed will be included only if they have been published in the last N days.delay
: number of seconds to wait between indexing operations (to avoid overloading servers).extraction
: determines the way we want to extract content from the URL (valid values are pdf
or html
).Every crawler in vectara-ingest
is a subclass of the Crawler
base class, and has to implement the crawl()
method, and RSSCrawler
is no exception:
class RssCrawler(Crawler):
def crawl(self):
"""
Crawl RSS feeds and upload to Vectara.
"""
Take a look at the implementation of this method and notice there are two major steps:
In the first step, we collect a list of all URLs from the RSS feeds that are within the time period specified by days_past
:
urls = []
for rss_page in rss_pages:
feed = feedparser.parse(rss_page)
for entry in feed.entries:
if "published_parsed" not in entry:
urls.append([entry.link, entry.title, None])
continue
entry_date = datetime.fromtimestamp(mktime(entry.published_parsed))
if entry_date >= days_ago and entry_date <= today:
urls.append([entry.link, entry.title, entry_date])
In the second step, for each URL we call the url_to_file()
helper method to render the content into a PDF file, and the index_file()
method (part of the Indexer
object) to index that content into the Vectara corpus:
try:
filename = self.url_to_file(url, title=title, extraction=self.cfg.rss_crawler.extraction)
except Exception as e:
logging.error(f"Error while processing {url}: {e}")
continue
# index document into Vectara
try:
if pub_date:
pub_date_int = int(str(pub_date.timestamp()).split('.')[0])
else:
pub_date_int = 0 # unknown published date
pub_date = 'unknown'
crawl_date_int = int(str(today.timestamp()).split('.')[0])
metadata = {
'source': source, 'url': url, 'title': title,
'pub_date': str(pub_date), 'pub_date_int': pub_date_int,
'crawl_date': str(today),
'crawl_date_int': crawl_date_int
}
succeeded = self.indexer.index_file(filename, uri=url, metadata=metadata)
if succeeded:
crawled_urls.add(url)
else:
logging.info(f"Indexing failed for {url}")
if os.path.exists(filename):
os.remove(filename)
except Exception as e:
logging.error(f"Error while indexing {url}: {e}")
time.sleep(delay_in_secs)
We saw how to use vectara-ingest
to run a website crawl job, and then looked at the code of the RSSCrawler
for a detailed example of how the internals of a crawler work.
The vectara-ingest
project has many other crawlers implemented that might come in handy:
I invite you to contribute to this project – whether it’s to improve an existing crawler implementation, contribute a new crawler type, or even add a small improvement to the project documentation – every contribution is appreciated.
Please see the contribution guidelines for additional information, and submit a PR.
Vectara-ingest
provides code samples that make data source crawling and indexing easier, and a framework to easily run “crawl” jobs to ingest data into Vectara.
Vectara community members are using this codebase to build their LLM-powered applications, making the whole process easier and simpler.
I am excited to see this project continue to evolve, support additional types of data ingestion flows, and power new and innovative LLM-powered applications.
Uncategorized
June 8, 2023 by Ofer Mendelevitch CJ Cenizal | 5 min read
Read Nowgrounded generation
by Ofer Mendelevitch | 2 min read
Read Nowgrounded generation
May 31, 2023 by Justin Hayes | 13 min read
Read Nowgrounded generation
May 30, 2023 by Shane Connelly CJ Cenizal | 3 min read
Read NowLarge Language Models
May 18, 2023 by Ofer Mendelevitch | 6 min read
Read NowLarge Language Models
May 17, 2023 by Ofer Mendelevitch | 7 min read
Read Nowgrounded generation
May 2, 2023 by Ofer Mendelevitch | 10 min read
Read NowSample App
April 4, 2023 by Ofer Mendelevitch CJ Cenizal | 6 min read
Read NowLarge Language Models
March 28, 2023 by Suleman Kazi Adel Elmahdy | 9 min read
Read Now