Retrieval and Search
Search by Meaning
Seeking knowledge about our world is something that all of us need. It is a task that goes back to the dawn of human civilization, ever since we figured out…
December 22 , 2022 by Amr Awadallah
Seeking knowledge about our world is something that all of us need. It is a task that goes back to the dawn of human civilization, ever since we figured out how to capture our knowledge and share it (via drawings on the wall of a cave, or spoken words).
With the advent of Information Technology we got a way to easily search the massive amounts of knowledge that we are accumulating. However, the main techniques we evolved over the last few decades have been focused on the concept of keyword matching. I.e. use a few words describing what you are looking for to locate the information around it. That keyword approach has a number of drawbacks, but primarily it suffers from Semantic Loss. By that we mean the keywords on their own, without comprehending them, don’t capture the true meaning of what we are looking for, the true essence of what we seek. Keyword search tried to be better at that by “faking semantic” understanding of our queries. That process involves building a sophisticated knowledge graph and thousands of language rules that try to capture:
(1) all the different ways a given keyword can be written (stemming, lemmatization, spell correction, acronyms, emojis, …)., and all the different meanings of that word based on context.
(2) all the different words that have the same meaning or refer to same concept
(3) part-of-whole relationships between words (USA is in North America, Hand is part of Arm, ..)
(4) cross-language words, i.e. words that sneak in from another language, e.g. Touché
(5) word associations that change the meaning, e.g. “table tennis” has nothing to do with food
(6) and many more rules
Building and more importantly maintaining these language rules is a very laborious task that overwhelms even the most sophisticated search teams. That is significantly more complicated if you have a multi-language audience (e.g. English, Spanish, and Chinese). You would now have to update these rules for every language separately, and you have to further refine that to support mixed language queries, for example: “should the Le hors-d’œuvre portion of a meal include fromage?”
To summarize the current state of affairs, we tried to make keyword search fake semantic understanding, which worked a bit, but is extremely difficult to build and maintain, leaving the mastering of this capability to only the most well funded search organizations.
Enter the 2010s, where we had a number of advancements in natural language understanding via neural networks and large language models (LLMs) that led to computers being able to comprehend human language at the adult level, and in some domains at the graduate level. LLM-powered search or Neural Search is truly the proper way to “Search by Meaning”, as compared to legacy keyword search techniques which could hardly fake it.
There are three seminal innovations that took place, which then led to a cascading effect of exponentially accelerating advancements:
- Word Vectors: This innovation enabled neural networks that map our keywords from human language space to a meaning/concept multi-dimensional space where words with similar meaning end up close to each other. For example: King and Queen will be close in that space; they are not equal, but the difference between them is just gender, so they are very similar.
- Transformers: This innovation enabled feed-forward neural networks to efficiently take into account the order/sequencing of words and the impact of that on meaning. For example “Man kills bull” has a very different meaning than “Bull kills man”. The attention mechanism in transformers also allows the system to look at longer-range dependencies: for example, looking at words in a previous sentence to determine who “they” is referring to.
- BERT: This technique enabled neural networks to take into account the pragmatics of our language, and not just the vocabulary and grammar. For example: “Felicia cooks eggs at the supermarket” is grammatically correct, i.e. proper sequence of word tenses, but it isn’t pragmatically correct (Felicia is most likely to cook eggs in the kitchen, and buy them at the supermarket). A key breakthrough of BERT was to achieve this by pre-training with an unsupervised language task, then subsequently fine-tuning on small amounts of task-specific data. Because the model does most of its training on the unsupervised task, which presents a nearly unlimited amount of training data, the size of the model is no longer constrained by the amount of task-specific data available. This set the foundation for the rise of Large Language Models (LLMs), of which BERT was the first.
The result of these rapid advancements was that by the end of 2018, neural networks became capable of comprehending the meaning of human language at the adult level. This was demonstrated by BERT beating state-of-the-art accuracy on a number of Natural Language Processing (NLP) benchmarks, including the Stanford QnA Data Set.
So now imagine you have at your disposal a graduate student whom you can ask any sophisticated question, for example “how does James Web work?”, and not only do they understand you mean NASA’s new telescope, but they then go and read all information about this topic and give you back a few paragraphs that truly answer the key essence of your inquiry! That is what we have today. Neural Search can read your accessible knowledge, map all of that knowledge from language space into a meaning space, then when you perform the search it returns the most relevant knowledge snippets back to you.
Did you know that there are more than 7000 human languages in the world? Human knowledge, our history, and our culture is being captured across all of these languages (at different scales). But today, when we search by legacy keyword search techniques, we are only getting back information from the same language that has these keywords. In the new world of Neural Search, all of these words, regardless of language, are converted from “word space” into a shared “meaning space” (A lingua franca space, or more appropriately a significatio franca space). This means that regardless of the language your query is in, or the language(s) the target knowledge is in, the Neural Search Engine will always return the most relevant answer to your question. For example, if I search in English for “How to cook the perfect paella?” the perfect recipe for that will most likely come from a Spanish source where the Paella dish was originally invented. Another example, if you are a global phone manufacturer, and one of your technicians in South Korea documented the perfect solution to a customer issue in Korean, the corresponding technician in Brazil who searches for that in Portuguese will still find that perfect answer.
In conclusion, Neural Search is the most effective way to do Semantic Search. We tried to fake Semantic Search via labourius hacks on top of Keyword Search, and that clearly was both suboptimal and expensive to achieve. Neural Search allows us to effectively search by the true essence, the true meaning, of the concepts we seek versus the keywords used to describe those concepts. Neural Search achieves significantly better search results, removes language as a barrier to seeking knowledge, and significantly reduces the costs of building and maintaining the next generation of search systems. Our mission at Vectara is to help the word find meaning. We started by enabling that through plug-n-play Neural Search, and we will be extending that to many other semantic capabilities in the future.