Mockingbird: A RAG and Structured Output Focused LLM

Overview

In this blog post, we introduce Mockingbird, Vectara’s Retrieval Augmented Generation (RAG) and structured output focused LLM, and do a technical deep dive into its performance and discuss its technical capabilities.

Why Another LLM?

With dozens of different LLMs available both through APIs as well as commercial weights, why would Vectara be interested in having their own? That is an excellent question, and there are a few reasons we wanted our own LLM. You can read more details about this on our product-oriented blog post but to summarize, MockingBird is focused on and has high quality (as we will see below) on the tasks Vectara’s customers care about, and it can be run on Vectara’s customers’ VPCs or on-premise so their critical data never leaves their environment.

Training Mockingbird

Mockingbird is trained primarily for RAG and structured output tasks.

RAG Training

For RAG, the LLM (the G in RAG) is seen as input search results from a retrieval model for a user query. These results are in most cases not perfect. Some potential issues and complexities for an LLM producing an answer based on these results are:

In the real world, these search results can be noisy. Some of them will be relevant to the query, some of them won’t, the LLM has to decide which ones to use in writing its answer.
The LLM may have to piece together output from multiple search results into one coherent summary.
In some cases, the query will have no relevant results or no answer in the retrieved results.
For some queries the answer will be a short, to-the-point answer, for other queries the answer could be a descriptive one, spanning multiple paragraphs.
The LLM’s answer also has to be grounded, citing sources for each claim from the search results to gain the user’s trust and enable them to verify the veracity of the answer by looking at the claimed source.
The search results and queries can be from multiple, different diverse domains (think medical, legal, finance, business, academic and so many more) and in multiple different languages.

In order to train an LLM that is good at this task and can handle all the scenarios mentioned above, one of the most important parts is getting training datasets that also have all of these complexities in the input and good output summaries with citations in the output. More than half of the training efforts for Mockingbird went towards producing, creating, and curating such RAG datasets in different domains and across different languages. Note that we don’t train on customer data, so your data remains secure, and private and MockingBird does not get to see it in the training process.

Structured Output Training

The second task MockingBird is focused on is Structured output, particularly in JSON. In order to train an LLM to output responses that are valid JSON and match a supplied schema. Here is a simple example:

There are several frameworks that use constrained decoding to enforce that an LLM’s generated output matches a certain schema, why the need to train an LLM to produce structured output then? We found during our evaluations that training an LLM to produce structured output and then constraining output produces much better results than using an LLM that has not specifically been trained to produce structured output.

In order to train MockingBird to produce structured output, we created a dataset of thousands of challenging real-world examples of JSON outputs, matched to their schemas and descriptions of what the LLM should produce and used it as training data for the structured output task.

Evaluation

To get a comprehensive overview of MockingBird’s performance we evaluate model performance using a variety of different metrics on multiple datasets.

Generation Quality

The first aspect of MockingBird we test is the generation quality for RAG i.e. for the answer/summary produced by the LLM, how good is it. In order to judge this we compare a model’s generated summary to a golden ‘ground truth’ summary. However, comparing two summaries is not a very straightforward task as they can use very different words and styles while conveying the same meaning. To capture this we use two automated metrics for summary quality evaluation: ROUGE Score and BERT Score.

Why two metrics? They’re measuring two different aspects of the summarization quality. ROUGE Score is focused on lexical overlap. i.e. how commonly words occur in both summaries but does not measure semantic similarity (the large cat and the giant feline are similar semantically but different lexically). BERT Score on the other hand uses pre-trained BERT embeddings to compute a similarity score that takes into account the meaning and the semantics not just the lexical overlap. For our results, we show both metrics to illustrate the lexical and semantic overlap.

For our first eval, we consider English-only datasets. The datasets we use to evaluate generation quality span a variety of domains including Wikipedia, Academia, News, and Search. As shown in Figure 1, we observe that MockingBird performs better than competitive commercial models as well as open source models showing a higher lexical and semantic overlap with ground truth summaries.

In the multilingual case as well we see that Mockingbird outperforms all competition, showing excellent performance in multiple languages that our customers care about.

Citations

Another important aspect of LLMs used for RAG is citations. When generating responses based on retrieved reference results we want the LLM to add citations in its output so users can know which results it used to generate each part of the answer. This also helps the LLM ground its generation in just the provided references. Here is an example of what this looks like in the Vectara platform

We measure citation quality using two metrics. Citation precision and recall. Our ground truth summaries are annotated with citations as well. For the precision metric determine how many of the citations produced by the models are actually relevant. For recall we determine of all citations in the ground truth summary, how many are actually cited by the model. Intuitively, precision measures how accurately a model is able to determine which references contain relevant information. Recall measures how much of the relevant information the model is able to use when synthesizing the answer.

Here we see Mockingbird outperforming all competition as well. Although GPT-4 and Gemini 1.5 Pro come close to Mockingbird on recall, Mockingbird is able to pull ahead, showing that you can expect high-quality grounded answers that ignore irrelevant information and include all relevant information in the generated summary.

Structured Output Metrics

In addition to RAG output, another important task that Mockingbird is focused on is Structured Output. To evaluate model performance on this task, we use real-world JSON objects as the ‘ground truth’ and ask the model to generate an output that matches the JSON schema corresponding to those objects.

Our evaluation metric here is Precision@1: Essentially, given the schema and the model output, we validate that 1) The output is a valid JSON object 2) It matches the supplied schema. Note that we do not validate if the values of the JSON exactly match the values of the expected JSON object as an exact match here is not always necessary.

* Note that in this plot we exclude Llama 3 8B and Gemini 1.5 Pro as we were unable to get them to successfully output JSON reliably any meaningful amount of times. This may be due to a difference in training, or different prompt expectations which we did not explore.

In the structured output case, we see again, GPT4 being the only model coming close to Mockingbird’s precision numbers demonstrating the importance of training the model on hard real-world examples of such data.

Human Ratings

Why not use LLMs as a Judge? Another common way to evaluate summaries is to use LLM-as-a-judge. This blog from OpenAI outlines the steps to evaluate a summary using models like GPT-4. Although this seems more holistic, it has drawbacks such as LLMs preferring LLM-generated text (especially if it’s from the same ‘family’ of LLMs) over other summaries and sensitivity of the LLM’s judgment to prompts. As our aim is to compare the summaries generated by different LLMs we opt to not use this approach.

All of the metrics we describe above are automated and (especially for generation quality) to an extent imperfect with their own limitations and caveats. Ultimately the model output, in a lot of cases, has to be shown to a human and so the ultimate judge of model quality should be a human judgment. Human judgments though are expensive and time-consuming to get, which is why we use automated metrics for a large portion of our evaluation.

For this section of the evaluation, we use Vectara’s platform to produce outputs using Mockingbird and GPT-4 on two different corpora spanning several websites. Model names are anonymized and we ask humans to rate each summary pair side by side on a scale of 1-3 based on veracity, relevance, how well it answers the asked question, hallucinations, and general structure.

In this case, we see MockingBird performing at the level of GPT-4, with a very small edge in the mean score, but well within the threshold to not be considered significantly different.

We also use Vectara’s latest Hallucination Evaluation Model to determine the hallucination rate for GPT-4 vs Mockingbird, which produces a similar result with both Mockingbird and GPT-4 having essentially the same hallucination rate on this test set.

Conclusion

In this blogpost we went through the details of how we train and evaluate Mockingbird, which shows it’s a competitive task focused LLM for RAG and Structured Output. At <10B parameters, a much smaller size than GPT 4 or Gemini 1.5 Pro it’s able to hold its own against larger models and produce excellent summaries and answers meeting or exceeding those models.

Mockingbird is integrated into the Vectara platform and available for all users to try with their own data. Sign up here for an account to get started.