Listing Documents in Vectara
Understanding the content of your corpus with Vectara’s new document management capability.
January 09 , 2024 by Shane Connelly & Ofer Mendelevitch
After creating a corpus with Vectara and indexing some documents, it is quite natural that you would like to list all the documents in your corpus. This is quite helpful when you want an inventory of all the documents indexed so far to check if a certain document is included in your corpus and potentially remove documents that are no longer needed.
That’s why today we’re excited to announce the release of Vectara’s document listing capabilities as part of our API, which is also integrated into the Console.
Listing Corpus Documents
Whether you ingest your documents to Vectara via the Console’s data upload functionality, programmatically via the Indexing API, or with a tool like vectara-ingest, you might need to inspect your corpus on occasion to understand which documents are included.
The new list-documents API makes this possible. Your initial request responds with a list of the first N documents (10 by default and configurable up to 1000). Then you can continue to paginate further in a manner similar to the list-corpora or list-users API calls.
The response to the list-document request is a list of documents, each with two bits of information:
- The document ID
- The metadata associated with that document
Let’s look at a real-world example. Consider a corpus that includes documents indexed from two sources: the content of our main website (https://vectara.com) and the information from our main documentation site (https://docs.vectara.com); while indexing, we’ve included a metadata field called
source which can have a value of “website” or “docs,” depending on the source of the page indexed.
Now consider the case where I want to remove all the pages from the documentation source to then reindex them. With the new API, I can just list the documents with a filter on the metadata field (
doc.source = 'docs') to get all the document IDs that are included in the corpus and then use the delete-document API call to remove those documents from the corpus.
Listing Documents in the Console
As part of this release, our Console has been updated with the ability to list documents as well. When you select a corpus and select the “Data” tab, the screen shows an initial list of documents:
You can see the first 10 documents listed and the familiar paging mechanism with the “next” and “previous” buttons.
When you click on one of these documents, a side window opens that allows you to see more details about this document as well as delete this document, if needed.
We’re excited to share the document management capability in Vectara: the ability to list documents in a corpus, available as an API call as well as in the console.
This new capability allows you, the GenAI developer, to easily understand what documents are included in a corpus and manage the document lifecycle with more ease.
As always, we’d love to hear your feedback! Connect with us on our forums or on our Discord server. If you’d like to see what Vectara can offer you for retrieval augmented generation on your application or website, sign up for an account.