Uncategorized
The Great Search Disruption
June 8, 2023 by Ofer Mendelevitch CJ Cenizal | 5 min read
Read NowBlog Post
Vectara Features
This blog post describes different patterns for using metadata fields within Vectara, gives a real world example, and provides several code samples so you can learn how to improve the discovery of your data and apply Vectara in a variety of use cases beyond the search box.
November 17, 2022 by Justin Hayes
The use of large language models in search is revolutionizing how people find information. Spend 5 minutes with an LLM-powered search platform like Vectara and you’ll get amazing search relevance – with no prior language configuration required.
That said, you can further improve your search capability by effectively using metadata fields. Doing so lets your users more quickly hone in on what they want, ensures that the best matches show up earlier in the result list, and enables you to show the results in the most engaging way.
Metadata is information that describes, or gives context to, other data. In the case of files on your computer – the file’s contents are the data, and the file name, size, permissions, create date, etc are the metadata. This approach of using metadata to provide context can be applied to any other type of data.
There are several techniques for using metadata to enhance your search. Each accomplishes a different goal, but they are all built upon the concept of attaching useful bits of information to every piece of indexed content. You can later use that information to adjust how the search is executed or to apply post-processing logic on the results to meet your requirements. The main patterns are listed below:
You can configure your searches to ignore specific results that don’t meet the criteria, specified as metadata field values. That way, you are able to narrow down the search results to find precisely what the user is looking for. You might want to filter based on a geographic area, or timestamp, or language.
One very useful example of filtering is Attribute Based Access Control. In this technique you “tag” indexed data with metadata fields describing the data and later use those to control which results can be shown to each individual user. For example, you could use a “PII” metadata field that is set if the indexed data contains personally identifiable information. Then when users run searches, you can add a filter clause that blocks any results having the PII field if the user is not permitted to see this type of sensitive data.
It is often useful to modify the search results based on specific attributes of each piece of indexed data. This is called “boosting and burying”, where you increase or decrease a given result’s confidence score based on the metadata associated with the text for that specific result. For example, if you are indexing user comments, you might adjust the scores of each result based on the average number of upvotes it got (see below for more on this example).
You may want to show additional information with the search results to make it easier for the user to understand each result or know what to do next. One approach to do this is to look up the additional information for every single search result, but that is very slow. A better approach is to “pass through” the required information by adding it as metadata fields at index-time. Then when processing the results the information is already there for you to use.
For example, when you index each document you could add a metadata field that is a comma-separated list of IDs of related documents. Then when a user does a search and for each result, you can use that field to generate a set of links to the related documents so the user can easily see what to look at next.
In some scenarios you can get significantly better results by running multiple searches in parallel and then “fusing” the results into a single result list. For example, when searching with a given query, you might first run a search with certain filters or boosting rules, then run a second search with different filters/rules, then use Reciprocal Rank Fusion to merge the results. This type of technique is only possible when leveraging metadata associated with your indexed content.
For the remainder of the post we will make these concepts tangible with a real world example. Imagine a digital media company whose users discuss consumer products. This discussion takes the form of Posts and threaded Comments, each of which can be given upvotes to indicate quality of the content.
The easier it is for users to find what they need across this vast set of user generated content, the longer they will remain engaged. So we want to use Vectara, and several of the metadata patterns explained above, to accomplish this.
We will rely on the following metadata fields:
There are three categories of metadata fields within Vectara. All are based on the concept, mentioned above, of attaching information to each piece of indexed content. And all are defined within the context of a single Corpus. But how and where they are used differs.
This post shows how to create and use metadata filters at Corpus creation time, indexing time, and querying time. In each of these lifecycle stages, there is one code snippet that uses the Vectara REST services (with cURL) and one snippet that uses the gRPC services (with Python). We also give an example showing how to use the Vectara console to create the Corpus.
How to specify additional metadata fields to be used for filtering and boosting (custom dimensions) when creating a corpus.
Via the management console:
After clicking on “Create Corpus”, click the “Add” buttons to enter the new metadata fields.
Via REST (using cURL):
curl -X POST \
-H "Authorization: Bearer ${JWT_TOKEN}" \
-H "customer-id: ${CUSTOMER_ID}" \
https://h.admin.vectara.io:443/v1/create-corpus \
-d @- <<END;
{
"corpus":
{
"name": "User Posts and Comments",
"description": "All the Posts and Comments that users add to the site.",
"custom_dimensions": [
{
"name": "upvotes",
"description": "Number of upvotes that people have given this content. Having more upvotes causes search results to be boosted.",
"serving_default": 0,
"indexing_default": 0
}
],
"filter_attributes": [
{
"name": "region",
"description": "Region the user was in when they submitted the content. Valid values are: AMER, LATAM, EU, AFRICA, APAC.",
"indexed": false,
"type": "FILTER_ATTRIBUTE_TYPE__TEXT",
"level": "FILTER_ATTRIBUTE_LEVEL__DOCUMENT"
}
]
}
}
END
Via gRPC (using Python):
# Create corpus object to specify what to create
corpus = admin_pb2.Corpus()
corpus.name = "User Posts and Comments"
corpus.description = "All the Posts and Comments that users add to the site."
# Specify custom dimensions
custom_dimension = admin_pb2.Dimension()
custom_dimension.name = "upvotes"
custom_dimension.description = "Number of upvotes that people have given this content. Having more upvotes causes search results to be boosted."
custom_dimension.serving_default = 0
custom_dimension.indexing_default = 0
corpus.custom_dimensions.extend([custom_dimension])
# Specify filter attributes
filter_attribute = admin_pb2.FilterAttribute()
filter_attribute.name = "region"
filter_attribute.description = "Region the user was in when they submitted the content. Valid values are: AMER, LATAM, EU, AFRICA, APAC."
filter_attribute.indexed = False
filter_attribute.type = admin_pb2.FILTER_ATTRIBUTE_TYPE__TEXT
filter_attribute.level = admin_pb2.FILTER_ATTRIBUTE_LEVEL__DOCUMENT
corpus.filter_attributes.extend([filter_attribute])
filter_attribute = admin_pb2.FilterAttribute()
filter_attribute.name = "lang"
filter_attribute.description = "Detected language, as an ISO 639-3 code."
filter_attribute.indexed = False
filter_attribute.type = admin_pb2.FILTER_ATTRIBUTE_TYPE__TEXT
filter_attribute.level = admin_pb2.FILTER_ATTRIBUTE_LEVEL__DOCUMENT_PART
corpus.filter_attributes.extend([filter_attribute])
# Later on, make the CreateCorpus call using the
# admin_pb2 and services_pb2_grpc libraries
...
How to add new content into the Corpus, setting metadata fields that can be used for filtering, boosting and burying, and results enrichment.
Via REST (using cURL):
curl -X POST \
-H "Authorization: Bearer ${JWT_TOKEN}" \
https://h.indexing.vectara.io:443/upload?c=${CUSTOMER_ID}&o=${CORPUS_ID} \
-d @- <<END;
{
"documentId": "237a8b63-2826-4ee1-8d83-14c2451a3357",
"parts": [
{
"text": "This peloton bike is constantly breaking.",
"custom_dims": [{
"name": "upvotes",
"value": 14
}],
"metadata_json": "{\"region\":\"AMER\", \"post_id\":13572468}"
}
]
}
END
Via gRPC (using Python):
# Create document to be indexed
document = indexing_pb2.Document()
document.document_id = "UNIQUE_DOC_ID"
document.title = "TITLE OF USER GENERATED CONTENT"
# Add the metadata fields to the document
metadata = dict()
metadata["upvotes"] = 14
metadata["region"] = "AMER"
metadata["post_id"] = 13572468
document.metadata_json = json.dumps(metadata)
# Create one section for the actual content
section = indexing_pb2.Section()
section.text = "BODY OF USER GENERATED CONTENT"
document.section.extend([section])
# Later on, make the Index call using the
# services_pb2 and services_pb2_grpc libraries
...
How to use query filtering and boosting (custom dimensions) when running a query. Additionally, how to extract metadata fields when processing search results to enrich what is shown to the user.
Note: only the metadata fields that you explicitly specify as filterable at corpus creation time can be added to the filter expression. Other fields that are added automatically (such as “Content-Encoding”) or that you provide as “pass through” fields are not filterable. Additionally, field names used in filtering are case sensitive.
Filtering Query String Example – specified in addition to a normal query string when running a search:
doc.region='AMER' and part.lang='eng'
Via REST (using cURL):
curl -X POST \
-H "Authorization: Bearer ${JWT_TOKEN}" \
-H "customer-id: ${CUSTOMER_ID}" \
https://h.serving.vectara.io:443/v1/query \
-d @- <<END;
{
"query": [
{ "query": "How durable are peloton bikes?",
"num_results": 5,
"corpus_key": [{
"customer_id": ${CUSTOMER_ID},
"corpus_id": ${CORPUS_ID},
"dim": [{
"name": "upvotes",
"value": 0.01
}],
"metadata_filter": "doc.region='AMER' and part.lang='eng'"
}]
}
]
}
END
Via gRPC (using Python):
This example also uses the “post_id” field to generate a link to view the entire Post to which this content belongs, which is an example of enriching the search results.
request = serving_pb2.BatchQueryRequest()
query = request.query.add()
query.query = "How durable are peloton bikes?"
query.num_results = 5
corpus_key = query.corpus_key.add()
corpus_key.customer_id = 189
corpus_key.corpus_id = 264
dim = corpus_key.dim.add()
dim.name = "upvotes"
dim.weight = 0.01
corpus_key.metadata_filter = "doc.region='AMER' and part.lang='eng'"
...
response = query_stub.Query(request, ...)
...
# Use the "post_id" metadata field to output a
# link to view the full Post submitted by the user
for response in response_set.response:
# Get the "post_id" field
post_id_field = [metadata for metadata in response.metadata if metadata.name == "lang"]
if post_id_field:
# Call the function to output the Post link
print_post_link(post_id_field[0].value)
To learn more about how to use metadata with Vectara, visit our Custom Dimensions and Filtering documentation. Also, there are several Getting Started examples that provide complete applications which you can start with to further explore Vectara or to build out your own application.
Good luck, and happy building!
Uncategorized
June 8, 2023 by Ofer Mendelevitch CJ Cenizal | 5 min read
Read Nowgrounded generation
by Ofer Mendelevitch | 2 min read
Read Nowgrounded generation
May 31, 2023 by Justin Hayes | 13 min read
Read Nowgrounded generation
May 30, 2023 by Shane Connelly CJ Cenizal | 3 min read
Read NowLarge Language Models
May 18, 2023 by Ofer Mendelevitch | 6 min read
Read NowLarge Language Models
May 17, 2023 by Ofer Mendelevitch | 7 min read
Read NowLarge Language Models
May 16, 2023 by Ofer Mendelevitch | 8 min read
Read Nowgrounded generation
May 2, 2023 by Ofer Mendelevitch | 10 min read
Read NowSample App
April 4, 2023 by Ofer Mendelevitch CJ Cenizal | 6 min read
Read Now