Introducing Mockingbird 2 - Vectara’s crosslingual grounded generation model

Introduction

We are excited to announce the release of Mockingbird 2, our latest grounded generation model specifically finetuned for RAG with advanced crosslingual capabilities. Mockingbird 2 builds upon the success of its predecessor Mockingbird 1, bringing enhanced performance and expanded language support.

Like Mockingbird 1, Mockingbird 2 works in any environment - SaaS, cloud, or on-premise, and empowers enterprises to achieve high accuracy in RAG responses without any concern of data leakage.

Key highlights of Mockingbird 2

Crosslingual functionality

Mockingbird 2 is designed to operate seamlessly across multiple languages. Users can input queries in one language and receive responses generated from documents in another language. The generation model supports crosslingual inputs and output in the following 7 languages: English, Spanish, French, Chinese, Japanese, Korean, and Arabic.

Here is a cross-lingual RAG example in the Vectara’s platform:

We issue the query “什么是医师助理 (What is a physician’s assistant?)” in Chinese. While the retrieved documents are in English, we set the response to be in Arabic:

[Model response translated to English] A physician assistant is a licensed healthcare provider who is authorized to diagnose, treat, and prescribe medications for patients [1]. They are licensed medical professionals who can work independently of a primary physician, while advanced practice nurses work under the close supervision of a supervising physician [2]. A physician assistant focuses on the biological and medical aspects of health, performing assessment, diagnosis, and treatment [3].

Enhanced generation quality

Mockingbird 2 has been optimized to produce coherent, contextually appropriate, and informative text in various languages. We evaluated Mockingbird 2 using datasets encompassing diverse domains, such as Wikipedia, Academia, and News. To do this, we used the Open RAG eval toolkit. Please see more details about the evaluation metrics and methodology here but as a summary for the purposes of this evaluation, we focus on the AutoNugget metric, which evaluates an LLM when it generates output grounded in some input sources.

If we were to evaluate all possible combinations of our supported languages (7 languages for query, 7 for documents, 7 for output), we end up with 7 * 7 * 7 = 343 different language combinations which are too many to meaningfully evaluate and understand results across.

Therefore, in our evaluation, we focus on fixing the response language and randomizing the languages for query and corpus. We select 60 samples for each of the 7 response languages randomly from the MIRACL (a multilingual retrieval dataset) test set, translated via an LLM (Llama 3.3 70B in particular), and evaluate these samples using the Open RAG eval toolkit.

Figure 1. Nuggets Assignment score by language

The results of the experiment are shown in Figure 1. As shown in the figure above, Mockingbird 2 achieves a higher Nugget Assignment score for all 7 languages, compared to Mockingbird 1. This means Mockingbird 2 is better at including the key facts and claims that are considered necessary for a good answer to the given queries.

In addition to scores from LLM-based automatic evaluation, we assessed the quality of generated summaries in crosslingual settings using ROUGE and BERTScore. These metrics measure lexical and semantic similarity, respectively, to ground-truth summaries. As shown in Figure 2., the Mockingbird 2 model demonstrates improved performance in the crosslingual evaluations.

Figure 2. Overall BERTScore and RougeL score

Hallucination mitigation with Mockingbird 2

The Vectara platform comes with two models aimed at hallucination reduction and management. The first is our Hughes Hallucination Evaluation Model (HHEM) which detects hallucinations and the second is our Hallucination Correction Model (HCM) which corrects hallucinations. Together these can work together with Mockingbird 2 to create a RAG system with excellent hallucination guardrails and a low hallucination rate.

We evaluate this system, Mockingbird 2 with Hallucination Correction, that we call Mockingbird-2-Echo (MB2-Echo), on our popular HHEM leaderboard which measures the hallucination rate for a variety of LLMs. MB2-Echo achieves a hallucination rate of 0.9% on the leaderboard evaluation dataset, which as of the time of this writing, positions it below only three much larger models. The number of parameters in the MB2-Echo system in total is <10B and it is deployable entirely on-premise or in any VPC within the Vectara platform.

Figure 3. Hallucination rate on HHEM Leaderboard

Conclusion

Mockingbird 2 represents a significant step forward in crosslingual RAG-focused LLMs, enabling Vectara customers to improve the quality of their RAG application responses, in any mode of deployment and in many languages.

To experience Mockingbird first hand, Try Vectara today, or contact us to get a demo today!