From Keywords to LLMs – The History of Search, and How ChatGPT Has Changed Search Forever

Introduction

From wall paintings and medieval manuscripts to books about history and news articles about current events, data has played an important role in each and every aspect of human lives. The invention of the internet made this data more accessible to people than ever before. But finding the right information relevant to one’s information needs in the endless sea of data is a challenging task.

This is the reason search engines came into existence.

Over the years, search systems have evolved, and so have users’ expectations. This article provides a brief overview of those evolutions in search engines.

The Beginnings of Search as We Know It

The earliest search engines were based on basic keyword matching, where the system would try to retrieve the pages or documents that most closely matched the words in the user query. These techniques are often called Boolean Matching.

Although effective in finding the relevant documents, these methods were unable to produce an ideal order of the relevant documents. For example, a query “cheesecake recipes” is likely to match multiple documents that contain the words “cheesecake” or “recipes,” but the user needs to be presented with an ordering of the documents that the system thinks would be the most relevant to the user’s information needs.

This led to the development of the notion of importance of words in a query and document. From there, approaches such as TF-IDF, BM25, and Indri were developed.

All of these helped improve the performance of simple keyword based search. Although the idea of keyword-based searches sounds simple, it has stood the test of time. BM25, for example, still forms competitive baselines for multiple state-of-the-art information retrieval research.

Drawbacks of Keyword-based Search

Even with their enormous success there are some clear scenarios where keyword-based search fails.

For example a user query “dog toys” should be able to fetch a document that refers to “puppy toys”, but a completely keyword-based search system would fail to do this as there is no notion of semantic matching.To remedy this issue, researchers came up with multiple approaches such as Pseudo-Relevance Feedback. The idea behind this approach is to expand the user query with concepts that most commonly occur in the top retrieved results in the initial search result.

This might help in cases where, let’s say, the top “dog toys” documents frequently contain the word “puppy”. But it can also lead to issues commonly referred to as “information drift” if the top documents are biased.

Take the example of the query “David Beckham.” The search system does not know whether the user wants information about David Beckham the English footballer, or David Beckham the American Television producer. In this case, if the top documents are more related to David Beckham the footballer, the system would bias the results towards football.

This led to the development of another aspect of information retrieval known as diversification. The idea behind this approach is to make the results as diversified as possible if the system is not sure about the information needs of the user. A few approaches to achieve this are – MMR, xQuAD and PM2.

Deep Learning’s Impact on Search

Even with significant advancements towards improving keyword-based searches, it was extremely difficult to build in complete natural language understanding capabilities into the system.

The development of transformer-based models such as BERT helped the natural language community make significant advancements toward embedding natural language understanding into models. However, the adoption of these techniques into the information retrieval domain was quite delayed, mainly due to the scale at which these systems had to operate.

To find relevant documents for a query, the system had to introduce some kind of interaction mechanism between the query and all possible sets of documents. One approach that was most commonly adopted was to learn query and document representations, also called embeddings, using these natural language models.

These embeddings are learned in such a way that if we represent all the embeddings as vectors in an n-dimensional space, the embeddings of queries and documents related to each other would be closer in this n-dimensional space as compared to the ones not related to each other. Even if the advancements in the field of deep learning made it possible to effectively learn this n-dimensional space, finding the closest neighbors was still a challenge.

The development of Approximate Nearest Neighbor search libraries such as FAISS helped remedy this problem. Once this issue was solved the search community mainly focused on two main areas – finding all relevant documents (measured by recall), and identifying the optimal order for relevant documents (precision). Both these methods present the end user with the best possible list of documents that it thinks would help satisfy the user’s information needs.

ChatGPT and Changes in User’s Search Experience

However, a more seismic change in the world of information retrieval happened when ChatGPT was introduced in late 2022. ChatGPT completely changed what a user expects from a search system. Users now wanted single-line/paragraph answers instead of browsing through a list of candidate answers.

And to a great extent, systems like ChatGPT are successful at doing this.

But this leaves us with the question, “are traditional search systems obsolete?”Probably not, as a common theme is that every solution has its own drawbacks and one of the major drawbacks of the LLM-based search system is hallucinations. Here at Vectara, we aim to resolve this issue once and for all using a technique we call Grounded Generation.

Conclusion and Next Steps – Experience Grounded Generation

The information retrieval techniques discussed above form the foundation of our grounded generation-based system, which then uses LLM’s to bring forth the best of both worlds to the end user.

To see Vectara’s approach to search, visit our sample site AskNews at https://asknews.vectara.com. Here you find that a “hybrid search” paired with best-in-class retrieval and generative summarization “Answers” your questions about recent news events and lists citations for verification that the summarized answer is based upon truth, avoiding hallucinations. If you want to take it a step further, you can sign up for your own free Vectara plan and pioneer more advances in search.