Deep dive into Multilingual Reranker v1, State-of-the-Art Reranker across 100+ Languages

Introduction

What is RAG?
Why is retrieval important to RAG?
How does reranking improve retrieval?

Retrieval Augmented Generation (RAG) is a method for applying generative AI to organizational data. It uses a retrieval system to select relevant information, which is then analyzed by a large language model (LLM) to generate the final answer. Unlike approaches that rely on fine-tuning LLMs, RAG enables the rapid integration of new information sources into LLM-based applications, often within milliseconds. A recent Microsoft study found that pairing LLMs with a retrieval system boosted question-answering performance by 5% in the agricultural domain. Given the limited focus on agriculture in AI research, this result underscores RAG’s potential to enhance AI system performance across diverse fields.

Prior evidence indicates that the performance of RAG systems is heavily influenced by the performance of the retrieval system. A 2024 study evaluated RAG systems by switching between different retrieval models and found that overall performance is directly proportional to the retrieval model’s effectiveness. Thus, maximizing the performance of RAG systems critically depends on optimizing the retrieval component.

The purpose of a retrieval system is to find relevant documents and rank them according to their relevance to the user query. This task typically utilizes embedding models, which represent text as vectors mapped in a multi-dimensional space. Texts with similar meanings are positioned closer together in this space. Retrieval using embedding models involves vector databases, and the ranking is based on cosine or dot-product similarity between the query and document vectors. While a good embedding model can ensure solid retrieval performance, incorporating re-rankers can further enhance this performance. Re-rankers assign new relevance scores to the top results retrieved by the embedding model, creating a more accurate ranking of the documents.

Cross Encoders

What are cross encoders and how are they used to improve search performance?

To facilitate reranking, we use a class of models known as cross-encoders. Architecturally similar to BERT, cross-encoders derive their name from their ability to compute token-to-token attention between, or across, a query and a document. The role of a cross-encoder is to assign a relevance score to a document given a specific query. This score is typically the logit value of the [CLS] token, a special token added to the start of each query-document pair to capture sentence-level semantic information.

1. Image from https://arxiv.org/abs/2004.12832

For a given set of documents, a cross-encoder model scores each document individually against the query. As a result, the runtime cost for reranking increases linearly with the number of candidate documents. Due to the high computational costs, reranking is typically applied as a precision-improving step to the top candidate documents retrieved by a retrieval or embedding model.In summary, the goal of a retrieval pipeline is, first, to find relevant documents (recall) and then to arrange them in the best order (precision). A state-of-the-art pipeline typically uses an embedding model to retrieve most of the relevant documents, enhancing recall, and then applies a re-ranker to achieve the best possible ranking, enhancing precision.

If you’d like to learn more about this topic, be sure to check out Vectara’s “Deep Dive into RAG Architectures” on YouTube.

Introducing Vectara’s Multilingual Reranker v1

We are excited to introduce our latest model, Vectara Multilingual Reranker v1, which supports over 100 languages in both multilingual and cross-lingual settings2. Trained on diverse datasets across multiple languages and domains, Vectara Multilingual Reranker v1 ensures impressive zero-shot performance on unseen data and domains. Importantly, Vectara never trains its models on customer data. To fully assess the capabilities of this new model, we report the performance improvements it provides over our state-of-the-art embedding model, Boomerang, and compare it to other open-source and enterprise rerankers.

2. A multilingual model is one that supports multiple languages but is tuned for the scenario in which both query and document are in the same language. A cross-lingual model, on the other hand, can perform equally well even when the query and retrieved document are in different languages.

Combining Vectara’s Multilingual Reranker with Boomerang

In the following experiments, we index datasets into the Vectara platform and retrieve the top 50 results for each query. The metrics for Boomerang are computed based on the retrieved list as-is, while the metrics for Vectara Multilingual Reranker v1 are reported after reranking the said list of results.

English language performance

Across a range of English-only domains, we observe that using the reranker significantly boosts NDCG@10 across all domains.

Expanding our analysis to include test sets from BEIR, we find that the Vectara Multilingual Reranker improves performance across all datasets with the exception of ArguAna. Since the objective of ArguAna is to retrieve documents that present counter-arguments to a query, as opposed to supporting documents, we hypothesize that the low performance is due to the direct conflict with the training objective of the reranker model, which ranks similar documents the highest. In fact, further experiments carried out with other reranker models show that they all suffer from poor performance on the ArguAna dataset.

Multilingual Performance

Next, to test multilingual and cross-lingual performance, we begin by using the popular benchmark MIRACL. Here, the improvement across both high and low-resource languages is even greater than for English.

While these results provide high confidence in the model’s ability to work across a range of languages, MIRACL is a multilingual, but not cross-lingual, evaluation dataset. In other words, queries and documents are always in the same language.

XQuad-R, on the other hand, presents a parallel corpus across eleven languages and is suitable for cross-lingual evaluation. We construct a test by creating a common corpus that combines answers from all languages into a single candidate pool, and then querying language-by-language, measuring NDCG@10. As we would hope, regardless of the query language, Vectara’s Multilingual Reranker provides a significant boost in search relevance.

Comparing Vectara’s Multilingual Reranker with Open Source and Commercial Rerankers

While the preceding analysis quantifies the performance gain that Vectara customers can expect to see on their workloads, it’s also interesting to compare the Multilingual Reranker directly against popular open source and commercial offerings. Below, we focus on the following cross-attentional rerankers:

rerank-multilingual-v3 from Cohere: Available through Cohere’s Platform APIs. The parameter count is unknown.
baai/bge-reranker-base: A 278M parameter model from the Beijing Academy of Artificial Intelligence available on HuggingFace.
maidalun1020/bce-reranker-base: A 278M parameter model from the WeChat group is also available on HuggingFace.
unicamp-dl/mt5-base-mmarco-v2: A seq-to-seq model built on mT5-base with 580M parameters, and also available on HuggingFace

We chose the models above based on their size and multilingual properties. In order to maintain consistency, in all the following experiments, we rerank the same set of fifty results retrieved using Boomerang.

English language performance

In the first set of experiments, using an English-only multi-domain internal benchmark, Vectara’s Multilingual Reranker and Cohere-Rerank-v3 are the top-performing models with Vectara Multilingual Reranker slightly edging out the competition.

The same trend holds for the BEIR benchmark. As noted previously, observe that all models, with the exception of bce-reranker-base, struggle to perform well on ArguAna base. In fact, even bce-reranker-base underperforms retrieval only with Boomerang which gets a NDCG@10 score of 0.4878 compared to 0.4292 by bce-reranker-base.

Multilingual Performance

Turning to MIRACL for multilingual performance figures, Cohere-Rrerank-3 achieves the best performance, with Vectara’s Multilingual Reranker a close second, and all other offerings, with the exception of Mono MT5, lagging far behind.

Almost the same story plays out in the cross-lingual XQuad-R benchmark, except that Mono MT5 is no longer as competitive.

It is important to note that the performance of a reranking model is heavily dependent on the retrieval model used to obtain the first set of results. The numbers reported above use our proprietary embedding model Boomerang which performs on par with other industry-leading embedding models. During the training of our models, we actively avoid training on any data related to our evaluation sets. Specifically, to get a true sense of zero-shot performance we went so far as to exclude the training splits of all our evaluation sets from our training data. This gives us confidence in the model’s generalization behavior on unseen customer data.

Cumulatively, the results from the preceding experiments suggest our customers will see very strong “out-of-the-box” relevance when using Vectara’s serverless RAG platform with the new Multilingual Reranker enabled.

Design Partner Case Study

At Vectara, we are well aware that academic benchmarks can often paint a distorted picture of model performance. Such benchmarks often suffer from systematic biases and also tend to exclude the sort of variation seen in real-world use cases. Therefore, before releasing models into the platform, the ML team at Vectara always validates the models’ performance with multiple design partners.

Below, we report MRR for three different design partners operating in very different domains: application marketplace, workplace help center, and biomedical. The results suggest that, without training on customer data, Vectara’s Multilingual Reranker reliably produces large relevance improvements for most workloads.

Other Practical Concerns

An old adage about performance optimization is that “nobody cares how fast you can compute the wrong result”. While this is still approximately true in the era of machine learning, there are certainly other factors that weigh off against relevance. Two of these are latency and the ability to set query-independent score thresholds that demarcate between good and bad results.

Latency

Nobody likes a slow response from their computer, that much is clear. However, what is less obvious is that given a fixed compute budget, a very accurate but expensive reranker model may produce worse results than a fast but less accurate one. This is because the smaller model can be used to rerank a much larger set of results in the same amount of time, potentially discovering good results buried much deeper in the list.

For this reason, the ML and engineering teams at Vectara spent significant effort ensuring that the model runs with both low latency and low variance. Figure 10, below, shows that up to P99, reranking 25 results adds about 100ms to request latency. While that’s certainly not “free”, neither is it of much concern in most RAG systems, where the generative phase is often an order of magnitude slower.

And because Vectara operates an end-to-end platform with all key ML models co-located, our customers avoid the latency overhead often implied in solutions like LangChain or LlamaIndex that delegate reranking to a third-party API.

Relevance Based Cut-Offs

Scores from embedding and reranking models are generally not interpretable in an absolute sense, as they primarily serve to rank documents in the context of a specific query. The actual scores mean little so long as relevant documents are ranked higher than irrelevant ones. However, in use cases such as RAG, where results are passed to an LLM’s context window, it’s better to pass only the most relevant information. To address this, we trained our model to clearly distinguish the scores of relevant and irrelevant documents in a query-independent manner. This allows users to set thresholds to filter out bad results. Based on extensive experimentation, we suggest starting with a cut-off threshold of 0.5, with documents scoring above the threshold considered relevant. Raising the threshold increases precision at the expense of recall, while, conversely, lowering it includes more relevant documents at the expense of additional noise.

Figure 11 shows the score distribution of relevant and non-relevant documents from a large-scale evaluation. As one would wish, there is a clear distinction between good and bad scores, with very minimal overlap between the sets.Finally, we observe that in real-world applications, precision and recall are rarely of equal importance. Therefore, we encourage you to experiment with a range of cut-off scores before settling on the optimum threshold for your use case.

As always, we’d love to hear your feedback! Connect with us on our forums or on our Discord.

Sign up for a free account to see how Vectara can help you easily leverage retrieval-augmented generation in your GenAI apps.

Deep Dive Into Vectara Multilingual Reranker v1, State-of-the-Art Reranker Across 100+ Languages

Introduction

Cross Encoders

Introducing Vectara’s Multilingual Reranker v1

Combining Vectara’s Multilingual Reranker with Boomerang

English language performance

Multilingual Performance

Comparing Vectara’s Multilingual Reranker with Open Source and Commercial Rerankers

English language performance

Multilingual Performance

Design Partner Case Study

Other Practical Concerns

Latency

Relevance Based Cut-Offs

Connect with
our Community!

Discord.

Github.

X / Twitter.

LinkedIn.

Discuss.

E-mail.

English language performance

Multilingual Performance

Latency

Relevance Based Cut-Offs

Related posts

Connect with our Community!

Discord.

Github.

X / Twitter.

LinkedIn.

Discuss.

E-mail.

Connect with
our Community!