How to crawl websites for search (part 2)
This is part 2 of a 2 part blog of how to crawl websites. Part 1 is available here. After talking about some of the good and bad practices to…
5-minute read timeThis is part 2 of a 2 part blog of how to crawl websites. Part 1 is available here.
After talking about some of the good and bad practices to crawl websites in part 1, now let’s look at how Vectara’s web crawler sample application performs a crawl. This crawler is intended to handle the “worst possible scenario” – the one where you don’t have access to upstream semi-structured data and where you don’t know/can’t use rendered tags to extract it – in as graceful a way as possible.
Finding all links
The crawler has 4 modes of link discovery:
- Single URL: you provide a single URL for the crawler to pull. This is the simplest and best for testing or if you have a small handful of URLs you want to index.
- Sitemap: most websites have sitemaps of the pages they do want crawled. When you place the crawler into this mode, it attempts to find sitemaps in a variety of fashions: standard sitemaps linked from robots.txt, sitemaps with common URLs that aren’t linked from robots.txt, RSS/Atom-format sitemaps, and even text-file sitemaps. This is what most people will want to use.
- RSS: you provide an RSS feed and the crawler crawls all of the URLs from that RSS feed. This can be useful if, e.g., you’re looking at a news aggregator feed periodically.
- Recursive: the most “intense” crawler operation but is what most people most commonly consider a “crawler.” You provide a starting URL and potentially restrictions to the crawl and it will attempt to discover and index all the links it can. This takes several command line options such as a regular expression to make sure the links contain and a maximum depth.
The recursive mode of operation is worth a bit of extra explanation that it’s not what most people will want to use. There are several reasons for this that you should be aware of if you do decide to use it:
- Individual page timeouts, Javascript rendering timeouts/quirks, etc, can all lead to links not showing up during 1 pass of the rendering, meaning those linked URLs may not be discovered. On the other hand, sitemaps and RSS feeds are intended to provide canonical lists of links, for at least a particular time.
- It’s difficult for the crawler to determine what links to treat as unique. For example, is vectara.com/foo different from vectara.com/foo?bar ? The only way to tell is to render both, but a list like those in sitemaps will generally provide “just” the right information to crawl.
- Currently, the crawler operates purely in-memory (except for saving PDFs), which means that it tries to keep all URLs it has seen in memory. This is fast, but if it discovers a lot of unique URLs, it could be memory intensive. The crawler attempts to mitigate this by implementing a bloom filter to keep track of visited websites, and in the future it may use a lightweight attached database like SQLite, but these come with tradeoffs.
- There may be corners of the website that a crawl would never discover. For example, if some content is never linked to along the path starting from the initial crawl URL, it won’t be indexed. You can always run the crawler multiple times with different seed URLs, but sitemaps are the generally intended solution for this problem.
The Recursive mode uses Chrome for its rendering for link discovery.
Rendering the page
Once the crawler has found a link, it needs to render it to submit it to Vectara. The crawler has 2 built in renderers: Chrome and Qt WebKit. You can toggle between which renderer you want to use by passing the –pdf-driver parameter, which can either be chrome for Chrome or wkhtmltopdf for WebKit. There isn’t a “universally better” one here, but there are tradeoffs:
- wkhtmltopdf generally renders complex pages pretty well for a search engine because it renders lots of elements that might be slightly tucked away/in fragments that would be rendered with a small javascript transition that isn’t a page transition. However, it has some security limitations (the process should be sandboxed) and if/when URL fragments are actually desired in the search engine, it won’t handle those.
- chrome generally has more accurate positioning of elements for what most users are used to. However, it doesn’t always do what you might expect with rendering the print media type and actually that layout accuracy in print can sometimes lead to worse results for search purposes.
It’s recommended you try both and see what’s best for your website(s) as long as you trust the data source enough to run wkhtmltopdf.
Submit to Vectara’s File Upload API
Once the crawler has found and rendered a URL, it will automatically send it to Vectara’s file upload API. That API handles these PDFs gracefully, and generally you should see good search results after having crawled your website. Make sure to set the authentication and corpus parameters correctly to ensure the content ends up in the right place.
More details and feedback
The crawler takes a number of parameters that we didn’t go over in this blog, but you can find in the table here. If you have other questions, feel free to ping us over on https://discuss.vectara.com/ or if you find any problems or would like a feature implemented, feel free to open an issue or pull request on the GitHub repo. Happy crawling!