Stopwords have been used in keyword search systems for decades. However, stopwords have become unreliable sources of information in different semantic search contexts often resulting in diminished search relevance and system performance.
December 08 , 2022 by Shane Connelly
Intro: What are stopwords
Stopwords (or sometimes “stop words”) are a staple in search systems dating back decades: they are words that are considered to provide little value to the search or document. In English, these are usually words like “the,” “a,” “of,” and so on: more or less the most common words are often added as stop words.
These words are often justified in removal from search engines because, in Information Theory, these words – at least on their own – generally are considered to carry low amounts of information. As an anecdotal example, if I write “I had a cup of coffee and a muffin” and “I had cup coffee muffin” – while the second example obviously does not sound grammatically correct – it gets the point across mostly. This type of example is sometimes used to show that stopwords are not carrying much information and can be safely removed.
Why do many search engines encourage stopwords?
Search engines have relied on stopwords for a variety of reasons, but the most common one has been to save on resources. In English, the top 10 most common words make up about 25% of the words used in most text, and the top 100 words make up about 50%. Many other languages have a similar distribution. The original thinking for search engines introducing stopwords was if you can remove the top 100 stopwords from your corpus when you indexed content, you might be able to save 50% of the size of the index on disk.
Now for most systems, the cost of disk has gone down so significantly since search engines became popular (and compression has gotten much better) such that the disk size for text search is negligible relative to the overall system cost, where RAM and computational costs dominate. So that’s not a very good reason for most systems to use stopwords any more.
So if disk costs aren’t driving stopword usage, why do search engines still use them?
Caught holding the bag
Keyword systems tend to implement roughly a “bag of words” approach. A document containing our example “I had a cup of coffee and a muffin” would typically get indexed by counting how many times each word showed up. For example, “a” shows up 2 times, “coffee” shows up once, and so on. Keyword search systems then add up how many times all of these words show up across the entire corpus to give the “Inverse Document Frequency” or IDF, which is half of the famous TF/IDF model. And while newer approaches like BM25 have come after TF/IDF, many of them (including BM25) use IDF in the final score calculation.
And here’s where keyword systems suffer performance on the query side, regardless of whether they use BM25 or TF/IDF: if a user searches for “a coffee”, the system would need to search across all documents that have either “a” or “coffee” before calculating the document scores. While only a small number of documents may contain “coffee,” the vast majority probably have at least one occurrence of the word “a,” so you might need to calculate the final scores for a large number of documents even when only a small number are really relevant to the important words. That can be a significant performance burden – both in terms of disk reads and in terms of CPU costs.
More modern approaches to keyword search try to solve for this by essentially trying to dynamically detect stopwords at query time and stop scoring terms that seem to be too saturated relative to other terms in the query. For example, Block-Max WAND and others are sometimes introduced to keyword search systems to increase the performance by simply skipping past scoring some (“noncompetitive”) documents at some point.
When is a stopword not a stopword?
One of the oft ignored problems is that sometimes words that are added as stopwords do have information in them: just that there’s some specific context that’s driving the information. For example, consider “Las Vegas,” “The Museum of History,” “El Chapo,” and “Johnson and Johnson.” The words “The,” “Las,” “of,” “El,” and “and” all carry significant information even though they would often be considered a stopword in several languages. Similarly, the words in some languages might typically provide little meaning but in other languages might have a lot of meaning.
These are occasions where whether a word is providing little information is contextual in nature: you need to understand the context of the language(s) being used, whether there’s a proper name being used, if there’s a borrowed word or an acronym, and so on. In other words: stopwords are unreliable as standalone words without understanding semantic context. Understanding and acting on this is where pure neural retrieval systems like Vectara really shine.
How LLM-powered search makes the experience better
Neural retrieval systems like Vectara can fully understand the context of a word and phrase. They can understand when a capitalized word is an acronym or when a proper name is being used or when the language has changed and provide contextualized relevance scoring. An important aspect of this though is that the search system needs to be using neural retrieval throughout all query steps – and not just reranking. This is because if the initial retrieval is keyword-only, you might be filtering out relevant results.
Let’s stop stopping, and let’s going.