Example Solutions - Web Crawler

Web Crawler

A command-line web crawler that can crawl a single page, a web sitemap, an RSS feed or recursively and then ingests discovered content into Vectara.

Turn a website or webpage into searchable content in Vectara

The web crawler currently has 4 modes of operation:

  1. Single URL: provide the crawler with a URL and it will ingest it into Vectara.
  2. Sitemap: provide the crawler with a root page, and it will retrieve the sitemap(s) and index all links from the sitemap.
  3. RSS: provide the crawler with an RSS feed URL and it will find all of the direct links on the feed. It can be used to periodically sync content published there.
  4. Recursive: the most comprehensive mode; recursively attempting to find links on its own.