How to crawl websites for search

This is part 1 of a 2 part blog of how to crawl websites.

Suppose you have a website, and on that website is a bunch of content that you want your users to find and use. How do you make it available for search? The first place many readers will gravitate to is “crawl it!” In this blog, we’re going to talk about methods for crawling your website, when it’s a good idea, when there are better alternatives, and how to best crawl.

When to not crawl

First, let’s talk about when you shouldn’t crawl a website and why. The best reason to not crawl is if you have a more machine-readable source of semi-structured documents like raw JSON documents that represent the (important) contents of the page. The reason for this is that websites weren’t made for machines: they were made for humans. Web pages have headers and footers and sidebars and other navigational elements that are often on nearly every page but typically don’t add any value to the search results – and often can even negatively affect the results. A common example is that every page might have text like “privacy policy” in the footer. If someone searches for “privacy policy” you probably don’t want every webpage to be returned.

Some web crawlers have gotten more sophisticated at employing machine learning algorithms to try to extract the “good parts” of web pages and ignore the irrelevant parts, but any solution like this can still have false positives/negatives and filter out the text for some good parts and leave some bad parts still in. Similarly, some search engines – particularly neural systems like Vectara – have gotten better at surfacing relevant content and ignoring the irrelevant parts.

Even with the improvements in crawling, extraction, and indexing, having a “real” source of truth is still best: most web pages these days are HTML to render beautifully for our browsers, but typically they’re backed by JSON, ,Markdown, or other “more raw” data that’s generally better for ingesting into search engines.

If crawling is the right choice, these are some of the essential recommendations we would offer:

1. Use a bespoke crawler when possible

If you don’t have access to the raw JSON or similar data but you know that the product name is always in <title>, the price is always in a particular <meta> tag, and the description is always in <div id=”description”> and those are the fields that matter, it’s generally best to build/use a crawler that pulls this specific information out into JSON to get the things that matter to your use case.

If you’re used to automated front-end testing, the process here is typically similar, and in fact the tools that can be used for automated front-end testing are great for using as crawlers as well! Selenium and Playwright have bindings in many languages and form the basis of a lot of great crawlers in use today.

2. Use a real browser

For a long time, it’s been common to use HTML parsers to extract the contents from websites (or worse, regular expressions). Some of these have worked well, and on some sites they still work well. However, websites have gotten a lot more dynamic: it’s not uncommon to see heavy use of Javascript and CSS to create new elements on a page or to change the order/visibility of elements. These dynamic elements often escape traditional HTML parsers.

The solution is to use a real browser to render the page, and programmatically access the content that you need. Fortunately, headless browsers have matured significantly: Chromium, Firefox, and Safari all can be driven programmatically. Selenium, Playwright, and Puppeteer can all be used to automate headless browsing for crawling and can be used to extract specific rendered elements (e.g. after Javascript and CSS) or even take screenshots of the whole page.

3. Extract the content as a rendering

In some cases, you’ll be in the least optimized situation: you don’t have access to structured data upstream and the web content itself is either too complicated or too dynamic to parse or where doing so simply doesn’t make business sense. What should you do then? The next best thing, and where our journey ends, is rendering the document out, e.g. as a PDF.

Rendering a page as a PDF has a few advantages over alternatives like “just” saving the HTML. As an example of a few problems that PDFs can help avoid, look at this raw HTML and compare it to how your browser renders it. Comments are provided in the raw HTML for why these can be problems, but in short: modern browsers can insert/remove, rearrange, rotate, insert text backwards, and many other formatting that can change the final output. By using a rendered PDF, we “get what the user gets.” Unfortunately, this is still an area that needs more maturity in the market:

Currently PDF generation is only available for Chromium and as special WebKit executables. This means automated PDF rendering isn’t yet available for, say, Firefox.
Modern CSS provides different rendering modes (aka media types) for websites and the act of printing a PDF in automated frameworks may produce unexpected renderings. You might need to try both Chromium and WebKit to see which produces better renderings for the website(s) you’re trying to crawl.
Some automated frameworks don’t provide a lot of options on how to render PDFs (e.g. screen width, adding/removing “print” headers like when the document was printed, etc, even if they are available in the underlying browser.

In many cases, these limitations will still work fine for crawling a website, but they are good to be aware of.

In the next post, we’ll explore how Vectara’s web crawler sample application tries to provide a generalized crawling framework taking these into account.