Introducing Boomerang – Vectara’s New and Improved Retrieval Model
TL;DR In this blogpost we cover the importance of retrieval models for search and generative AI use-cases, and we announce the release of our new state-of-the-art multilingual retrieval model, named Boomerang, that is available for use in the Vectara platform. We compare performance numbers between Boomerang and several other embedding APIs and models and demonstrate Boomerang’s high quality retrieval and generalization capabilities. Finally we discuss real-life quality improvements seen by a Vectara design partner for their use-case.
10-minute read time*** Update: This article is highly technical. If you want an overview of Boomerang, check out the complementary benefits article, “How Vectara’s New Boomerang Model Takes Retrieval Augmented Generation to the Next Level via Grounded Generation.” ***
Introduction
Large Language Models (LLMs) have become synonymous with generative models, and a menagerie of them have been released at breakneck speeds from a range of companies, including Meta, Open AI, and Google. However, there is another, oftentimes overlooked, category of Natural Language Processing (NLP) models that are important for a variety of use cases – Retrieval models.
Retrieval Models
Retrieval models are a critical component of neural search (also known as semantic search or vector search). These systems can find content based on meaning, as opposed to traditional keyword systems, which merely perform string (also called “lexical”) matching.
Generative Model Limitations
Two of the most notable weaknesses of current generative models are:
- Hallucinations, where the model produces output that is not justified by its training data or has factual inconsistency.
- Their inability, without retraining or fine-tuning, to ground their output in your data.
Grounded Generation (also known as Retrieval-Augmented-Generation) provides a promising method to address these limitations. First, a retrieval model fetches a subset of data that is relevant to the input prompt or user query. This data comes from a potentially huge corpus. Next, the data is added as additional context for the generative model, which then generates a response that we say is grounded in this data. The better the quality of the retrieval model, the more relevant the input to the generative model, which means a better final output for the user.
Fine Tuning vs. Grounded Generation (aka RAG)
Fine-tuning is another technique that is sometimes recommended for updating an LLM with custom data or enabling it to learn about new facts. In this approach, an existing LLM is further trained on the set of data you want it to learn about.
Compared with fine-tuning, RAG has some distinct advantages:
- Fine-tuning requires both setting up a training pipeline and using an accelerator (such as a GPU) with enough processing power to fine-tune a large language model. On the other hand, RAG avoids the need for expensive compute by not modifying the generative LLM.
- The speed at which new data can be introduced to the LLM via fine-tuning is orders of magnitude slower than RAG-based systems. Whereas fine-tuning takes minutes or even hours to complete, new data can be added to RAG systems with sub-second latency. This distinction is an important consideration in systems dealing with a regular influx of new information.
- Fine-tuning does not have a mechanism for access-control list (ACL) restrictions: any and all data you add to the model is potentially available as output for any user of your LLM. On the other hand, RAG-based systems allow ACLs to be attached directly to documents as metadata, effectively restricting what data is available for generation based on a user’s role.
Simply fine-tuning a model on new data doesn’t stop hallucinations, especially for data that the model has “preconceived notions” about. For example, in this article about fine-tuning a generative model, the author demonstrates that “fine-tuning is for form, and RAG is for knowledge”: using a modified copy of Shakespeare’s famous play, Romeo and Juliet, where all instances of Romeo’s name have been replaced by “Bob,” fine-tuning fails to teach the model that Juliet’s lover is, in fact, Bob, and not Romeo. And yet, without the complexity of model training, the same copy of “Bob and Juliet” uploaded into Vectara’s RAG system (<2 minutes of work!) correctly identifies Bob as Juliet’s lover.
Introducing Boomerang
This brings us to Boomerang, Vectara’s new multilingual retrieval model that can embed text in hundreds of languages and has strong generalization capabilities. By the way, most retrieval models are also embedding models and we’ll use the terms interchangeably. The quality of results from a RAG pipeline are highly dependent on having a performant embedding model, so it’s important to assess this quantitatively. We have benchmarked Boomerang on several different public datasets and then summarized comparisons with several commercial and open-source embedding models, including Cohere (multilingual-v2.0), OpenAI Ada (ada-002), and GTR-XXL.
English Language Performance
We begin by examining performance on English datasets, starting with the popular BEIR benchmark. As shown in Figure 1, we find that while the performance of all three models is generally in the same ballpark, OpenAI Ada tends to produce the strongest results on average, likely due to the much larger embedding size (1,536 dimensions for Open AI vs. 768 for Cohere and Boomerang). There are also domain-specific differences, such as trec-covid, where Boomerang produces stronger performance, and fiqa, where performance is notably weaker.
We also complete a performance comparison of Boomerang with widely used publicly available models, shown in Figure 2.
This evaluation also shows a varied performance profile, with Boomerang trading places with ME5 base and GTR-XXL. Note that while Boomerang is optimized for low-latency performance, models like GTR-XXL, which weighs in at 4.8 billion parameters, are very challenging to productionize.
It’s also important to note that BEIR has, since its release, become a very popular IR benchmark and performance on its datasets has steadily improved. Unfortunately, part of this improvement is the result of overfitting, an issue we have tried hard to avoid in the development of Boomerang by avoiding the BEIR training datasets we report results on.
To better gauge the true performance of these models on a broader set of domains, we also evaluated on another multi-domain benchmark with datasets in shopping, news, Wikipedia, and social forums (Figure 3). Note: We are withholding the name of the dataset to retain its value as a reliable proxy for zero-shot performance.
In contrast to the BEIR dataset, this benchmark paints Boomerang in a more competitive light, and it achieves performance as good or better than both Cohere and OpenAI Ada in all domains except Wiki. This shift in performance suggests that, due to overfitting, BIER may have lost some of its potency as a reliable measure of zero-shot performance.
We see a similar trend when comparing strong open-source models in Figure 4: Boomerang provides significantly better performance than most models, including the GTR-XXL juggernaut. The exception is ME5, which is exceptionally competitive and points to the rapid pace of innovation in the field of neural information retrieval.
Multilingual and Cross-lingual Performance
This section extends Boomerang’s performance evaluation to multilingual and cross-lingual settings. We start by indexing data in all 11 languages from the XQuAD-R evaluation set. (English, Spanish, Arabic, German, Greek, Hindi, Russian, Thai, Turkish, Vietnamese and Chinese). The relevant answers are then retrieved using queries from each of these 11 languages. Finally, the Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR) numbers are reported after being aggregated across all the languages.
In Figure 5, the MRR comparison reveals that Boomerang’s performance is shoulder-to-shoulder against paid models from Cohere and OpenAI. However, mean-average precision (MAP) paints a very different picture.
The differences can be attributed to the fact that while MRR considers the rank of the first, and presumably only, correct response, MAP is a more appropriate metric in this case because it accounts for the ranks of all relevant results, which are spread across eleven different languages. What the discrepancy in MAP scores indicates is that while all models are equally good at returning some relevant answer high in the results list, only Boomerang, and to a lesser extent, Cohere, are good at returning all correct answers high in the results list. This is a critical feature in any situation where the cost of missing relevant information is high. In the words of the US legal system, we want not just “the truth,” but “the whole truth.”
Next, we present the performance of Boomerang against a number of competitive open-source models (Figure 6.), and we observe that the same conclusions hold. In fact, the language bias problem appears even more pronounced in the open-source models.
In order to achieve a more comprehensive analysis of Boomerang’s performance we also provide the benchmarking performance on a mixture of high and low-resource Asian Languages from the MIRACL benchmark (Figure 7). The MIRACL benchmark, being much larger in size, provides us with a better estimate of performance on a large corpus.
For high-resource languages such as Arabic, Chinese, and Japanese we observe that Boomerang performs on par with Cohere, with the exception of Arabic. It also consistently outperforms ada-002 from OpenAI. On the other hand, for low-resource languages such as Swahili and Bengali, we observe that although Boomerang, in aggregate, performs ever-so-slightly below Cohere on aggregate, the difference is negligible in most cases.
Additionally, when comparing Boomerang’s performance with open-source models on the MIRACL benchmark (Figure 8), we observe that ME5 is the only model that consistently outperforms Boomerang.
While ME5’s performance is undoubtedly impressive, it’s important to note that Boomerang’s numbers on all the reported benchmarks are zero-shot, meaning that the model is not trained on any split of any of the datasets used for evaluation. Furthermore, the training regime adopted for developing Boomerang tries to balance out the multilingual and cross-lingual performance in an attempt to mitigate the language bias issue discussed earlier. The importance of maintaining this balance is further emphasized by the case study described below.
Design Partner Case Study
So far, we’ve described the performance of Boomerang on academic datasets. However, improving Vectara’s performance on academic benchmarks isn’t the end goal, or even the main goal. Instead, we want the gains to translate into improved retrieval performance for our customers’ use cases. For verification, we teamed up with one of our design partners to further evaluate Boomerang on their workload.
Vectara, under no circumstances, trains publicly-available models like Boomerang on customer’s data. The partner evaluation also represents a zero-shot evaluation, which, in this case, includes a mix of nearly 2,000 multilingual queries, as well as documents in both the English and Arabic languages.
Compared to Vectara’s legacy retrieval model, the new Boomerang model provides massive gains in the key metrics presented above (Figure 9), including a 54% relative improvement in Precision@1 and a 39% relative improvement in Recall@20.
As part of our commitment to providing trusted AI capabilities, we always perform extensive testing both internally and with design partners prior to the release of new models. This increases our confidence that the model will perform well across the wide range of domains in which our customers deploy Vectara.
Want To Try Out Boomerang?
We’ve rolled out Boomerang to all new and existing accounts. If you’re a new user to the Vectara platform, there’s nothing for you to do to take advantage of Boomerang: just create a new corpus and it will automatically use Boomerang. If you already had an existing account, we’re continuing to maintain our legacy encoder for the time being for you, so you will need to manually select Boomerang as the encoder when you’re creating a new corpus. With Vectara, we handle the vector database, indexing, embedding, and hosting end-to-end, so all you need to do is upload your data and start using it!
Acknowledgments
Contributions to the development of Boomerang were made by Suleman Kazi (datasets, data training pipelines, model architecture, and modeling experiments), Vivek Sourabh (training pipeline, modeling experiments, model optimization and deployments), Amin Ahmad (datasets, baseline models, technical guidance and advice), Jack Lin (modeling experiments, training pipelines, model architecture), and Adel Elmahdy (multilingual datasets, training pipelines, modeling experiments, multilingual and cross-lingual retrieval evaluation).