Why is Metadata Important in GenAI?

The concept of metadata is simple, but given it’s often ignored by vector databases and embedding models as well as not easy to understand in the context of text-focused GenAI systems, it’s worth being explicit about the goals and benefits of having metadata as a separate form of data.To perhaps state the obvious: metadata is simply data – usually structured/categorical text or numeric – that’s attached to a larger piece of content. Some examples include the date a document/post was created, the price or color of an item for sale on your e-commerce site, and the permissions associated with a document (who should be able to see it). This data can be extremely important to both relevance to an end user and security. We’ll talk about relevance first and then security.As an end-user of various websites and apps on your phone, you’re probably used to seeing something like this:

oomerang compared to embedding models of OpenAI and Cohere on multi-domain datasets

Figure 1. An eComm website example of metadata via faceted search attributes

If I want straight-sized 30 jeans in a particular shade of blue and have no more than $50 budget, it can be a better user experience for me to make sure I’m able to apply these as strict filters.The problem with moving strictly to neural models for retrieval is that most of them have a hard time really handling the metadata: they’re very focused on turning completely unstructured data into vectors and on the generative side producing the next most probable token. This is understandably difficult because some of these are by their very nature personal to me: $50 for jeans might be expensive to one person and cheap for another and users might type “blue” when they really want dark navy or a light sky blue. From a UX perspective, it’s important to realize that some things are just cognitively easier to pick from a list than to describe in words.Even in the realm of chatbots, this is true: think of categories like “which of our SaaS products would you like to chat about?” or “what year was the device you’re asking about manufactured?” If the user knows the answers to these questions, it’s great for retrieval augmented generation (RAG)! If you can apply these as filters, you can quickly narrow down information that you should feed into the generative LLM so it has less possibility of hallucinating by looking at irrelevant data.Metadata is also very important for security: both in terms of being able to apply business rules of what can be retrieved at all (“confidential” / “secret” / “only HR members should have access”) and in terms of what data within the retrieved set that you want to give an LLM (“even if a document has a social security number in it, do not provide it to the LLM so that a malicious user could try to prompt hack it out.”) There was a recent incident where a RAG system provided too muchdata and it allowed the entire dataset to be exfiltrated.What we’ve realized at Vectara is that to really build a successful RAG application, you need to treat the retrieval and generative sides differently, and employ the best-in-class mechanisms for each. That’s why we employ keyword, vector, and structured data in retrieval and we merge the results together that best answer the query before we pass only the most relevant information to the generative LLM to act on.

Vectara Filters … Now Updatable!

Vectara recognized that it was important to mix the types of operations you could do for search, which is why we released Hybrid (Semantic+Keyword+Boolean+Structured) search in April 2023. Now we’re excited to bring the next stage of Hybrid search: being able to update which fields are available for structured search and how they’re processed. Let’s walk through how it works.Let’s say you start off with your first Vectara corpus and you have the default 2 filterable metadata fields defined. On the Console, if you look at the corpus you’ll see these 2 fields: is_title (a boolean) and lang (text, representing the language of the detected text):

Figure 2. Default two metadata fields in a Vectara corpus

But now let’s say you notice that many of your documents have a “price” field and you’d really like to let users filter by price. In many systems, this would mean a complete reindex of all existing data. However, in Vectara you can now add (or rename/remove!) a field by clicking the “Edit” button. This will lead to a fly-out:

Figure 3. Fly-out for adding a new metadata field in Vectara without requiring re-indexing

You can then go about adding your “price” filter attribute – probably as type Float.Once you save your new filters, Vectara will automatically handle the process of updating documents and indices in the background for you, including any data type changes along the way. The “Filter attributes” section of the corpus page will indicate that the corpus is currently undergoing an update:

Updated-Filter-Attributes-in-Vectara-Product

Figure 4. Metadata configuration screen showing the new custom metadata field

Depending on the corpus size, it can take a bit of time to do this background process, but for many corpora, this process completes in a few minutes. In the meantime, we’ve crafted the filter update to allow you to continue to query with your “old” filters until the new filters are applied. Afterwards, you will be able to execute filters with your “price” field just as though it always existed.Filters can be created, updated, renamed, or deleted through this workflow.

Final Notes and Some Roadmap

We’ll keep these brief, but we wanted to give you a sense of some caveats but more importantly why we think this is groundwork for many more exciting features in the future:

Vectara won’t let you issue multiple filter updates at once for the same corpus: you’ll need to wait for the previous update to complete before the new update instruction can be issued. Just be careful before kicking off a job on a huge corpus, as this will take a while. In the future, we’ll add the ability to cancel an ongoing update job to override it with an updated schema.
The update process happens through an asynchronous/background job. Part of what took us a while to get this fully baked into production was that we knew this was one of many asynchronous jobs we want to support in the future, and we wanted to build a framework for it. Some examples of asynchronous jobs we’ll be considering over time include: document enrichment, personalized fine tuning/reinforcement learning for your account/corpora (we’ll never train general models on your data – this would only be for your account), and actions/tasks that the generative LLM could perform on top of results (sending e-mails, generating Jira issues, etc).
We’re simultaneously releasing the APIs for this feature. We realize that as a developer/API focused platform, we need to make sure that you can build applications on top of our capabilities as quickly as we release them, and not be put into a situation where you must use our administrative console. In the future, we’re going to be releasing many more APIs to help you get the most out of Vectara in your applications, and do so much more quickly.

As always, we’d love to hear your feedback! Connect with uson our forums oron our Discord. If you’d like to see what Vectara can offer you for retrieval augmented generation on your application or website,sign up for an account!