How to crawl websites for search
This is part 1 of a 2 part blog of how to crawl websites. Suppose you have a website, and on that website is a bunch of content that you…
February 09 , 2023 by Shane Connelly
This is part 1 of a 2 part blog of how to crawl websites.
Suppose you have a website, and on that website is a bunch of content that you want your users to find and use. How do you make it available for search? The first place many readers will gravitate to is “crawl it!” In this blog, we’re going to talk about methods for crawling your website, when it’s a good idea, when there are better alternatives, and how to best crawl.
When to not crawl
Some web crawlers have gotten more sophisticated at employing machine learning algorithms to try to extract the “good parts” of web pages and ignore the irrelevant parts, but any solution like this can still have false positives/negatives and filter out the text for some good parts and leave some bad parts still in. Similarly, some search engines – particularly neural systems like Vectara – have gotten better at surfacing relevant content and ignoring the irrelevant parts.
Even with the improvements in crawling, extraction, and indexing, having a “real” source of truth is still best: most web pages these days are HTML to render beautifully for our browsers, but typically they’re backed by JSON, ,Markdown, or other “more raw” data that’s generally better for ingesting into search engines.
If crawling is the right choice, these are some of the essential recommendations we would offer:
1. Use a bespoke crawler when possible
If you don’t have access to the raw JSON or similar data but you know that the product name is always in <title>, the price is always in a particular <meta> tag, and the description is always in <div id=”description”> and those are the fields that matter, it’s generally best to build/use a crawler that pulls this specific information out into JSON to get the things that matter to your use case.
If you’re used to automated front-end testing, the process here is typically similar, and in fact the tools that can be used for automated front-end testing are great for using as crawlers as well! Selenium and Playwright have bindings in many languages and form the basis of a lot of great crawlers in use today.
2. Use a real browser
3. Extract the content as a rendering
In some cases, you’ll be in the least optimized situation: you don’t have access to structured data upstream and the web content itself is either too complicated or too dynamic to parse or where doing so simply doesn’t make business sense. What should you do then? The next best thing, and where our journey ends, is rendering the document out, e.g. as a PDF.
Rendering a page as a PDF has a few advantages over alternatives like “just” saving the HTML. As an example of a few problems that PDFs can help avoid, look at this raw HTML and compare it to how your browser renders it. Comments are provided in the raw HTML for why these can be problems, but in short: modern browsers can insert/remove, rearrange, rotate, insert text backwards, and many other formatting that can change the final output. By using a rendered PDF, we “get what the user gets.” Unfortunately, this is still an area that needs more maturity in the market:
- Currently PDF generation is only available for Chromium and as special WebKit executables. This means automated PDF rendering isn’t yet available for, say, Firefox.
- Modern CSS provides different rendering modes (aka media types) for websites and the act of printing a PDF in automated frameworks may produce unexpected renderings. You might need to try both Chromium and WebKit to see which produces better renderings for the website(s) you’re trying to crawl.
- Some automated frameworks don’t provide a lot of options on how to render PDFs (e.g. screen width, adding/removing “print” headers like when the document was printed, etc, even if they are available in the underlying browser.
In many cases, these limitations will still work fine for crawling a website, but they are good to be aware of.
In the next post, we’ll explore how Vectara’s web crawler sample application tries to provide a generalized crawling framework taking these into account.