Large Language Models
Large Language Models Use Cases and Applications
March 14, 2023 by Justin Hayes | 17 min read
Read NowBlog Post
LLM-Powered Search
This is part 2 of a 2 part blog of how to crawl websites. Part 1 is available here. After talking about some of the good and bad practices to… Read more »
February 22, 2023 by Shane Connelly
This is part 2 of a 2 part blog of how to crawl websites. Part 1 is available here.
After talking about some of the good and bad practices to crawl websites in part 1, now let’s look at how Vectara’s web crawler sample application performs a crawl. This crawler is intended to handle the “worst possible scenario” – the one where you don’t have access to upstream semi-structured data and where you don’t know/can’t use rendered tags to extract it – in as graceful a way as possible.
The crawler has 4 modes of link discovery:
The recursive mode of operation is worth a bit of extra explanation that it’s not what most people will want to use. There are several reasons for this that you should be aware of if you do decide to use it:
The Recursive mode uses Chrome for its rendering for link discovery.
Once the crawler has found a link, it needs to render it to submit it to Vectara. The crawler has 2 built in renderers: Chrome and Qt WebKit. You can toggle between which renderer you want to use by passing the –pdf-driver parameter, which can either be chrome for Chrome or wkhtmltopdf for WebKit. There isn’t a “universally better” one here, but there are tradeoffs:
It’s recommended you try both and see what’s best for your website(s) as long as you trust the data source enough to run wkhtmltopdf.
Once the crawler has found and rendered a URL, it will automatically send it to Vectara’s file upload API. That API handles these PDFs gracefully, and generally you should see good search results after having crawled your website. Make sure to set the authentication and corpus parameters correctly to ensure the content ends up in the right place.
The crawler takes a number of parameters that we didn’t go over in this blog, but you can find in the table here. If you have other questions, feel free to ping us over on https://discuss.vectara.com/ or if you find any problems or would like a feature implemented, feel free to open an issue or pull request on the GitHub repo. Happy crawling!
Large Language Models
March 14, 2023 by Justin Hayes | 17 min read
Read NowLLM-Powered Search
March 7, 2023 by Shane Connelly | 3 min read
Read NowVectara Features
February 28, 2023 by Tallat Shafaat | 7 min read
Read NowLLM-Powered Search
February 9, 2023 by Shane Connelly | 6 min read
Read NowVectara Features
January 23, 2023 by Ed Albanese | 4 min read
Read NowLLM-Powered Search
December 15, 2022 by Ed Albanese | 5 min read
Read NowLLM-Powered Search
November 28, 2022 by Ed Albanese | 6 min read
Read Now