Scaling Semantic Search at Vectara: Corpora

Training a machine learning model to solve a business problem is only one part of the solution. Once the model is trained and validated, it must be deployed in a scalable and reliable manner to a production environment, and this poses its own challenges.

For instance, a complete system for neural information retrieval not only involves development of one or more machine learned models, but additionally requires an ML inference system (e.g., TensorFlow Serving ), vector indexes (e.g., Annoy ), BM25 indexes (e.g., Lucene ), blob storage infrastructure (e.g., Cloud Storage ), a document storage system (e.g., Datastore ), auto-scale serving platform (e.g., App Engine ), monitoring tools (e.g., Prometheus / Grafana ), and integration middleware (e.g., Kafka). In addition, cross-cutting concerns like authentication, authorization, logging, and encryption also require attention.

Thus, despite huge potential value to businesses, neural retrieval systems have not seen significant adoption outside a few Silicon Valley companies with gigantic budgets.

Vectara Semantic Search is a fully managed platform that seamlessly integrates all these subsystems and provides access through intuitive REST and gRPC APIs, ensuring smooth scaling to any workload. Since launching earlier this year, we have achieved over 99.9% customer uptime for our query serving infrastructure, while handling workloads as high as 40qps, and one of the keys to achieving this reliability is replication across multiple availability zones.

In the remainder of this blog post, we’ll cover some recent improvements we’ve made to the scalability of serving nodes, specifically when handling customer accounts that define a large number of corpora.

A Vectara customer account segregates searchable collections of content into corpora. Think of a corpus as a unit of aggregated content that is searched by a query. Depending on query workload, a corpus may be backed by one or, less commonly, multiple indexes.

A common blueprint for integrating semantic search into SaaS applications is to define one corpus per customer account, thereby ensuring data segregation. Vectara Semantic Search was designed with this use case in mind, and can easily scale to thousands of corpora per account. Recently, however, a potential customer approached us with a requirement for 1 million corpora in a single account. This provided us a good opportunity to test the limits of our infrastructure, and tune as needed.

By default, serving infrastructure for business accounts, which are meant for production workloads, use at least 2x replication across different availability zones. Higher replication counts can be configured at added cost. Should one replica become inaccessible, due to a network failure, for instance, the situation will be quickly detected by Vectara’s monitoring systems, and any affected customer accounts will be automatically migrated to new replicas. The faster the migration process happens, the better it is for the overall health of the serving subsystem. This is because during migration, the customer account is more vulnerable to sudden traffic spikes, or additional replica outages.

During load testing with O(1 million) corpora, we noticed a few bottlenecks that were not apparent with fewer corpora:

The customer account is not made available incrementally, but rather all at once, after all corpora have been retrieved and loaded.
Corpora retrieval from the object storage subsystem was not taking full advantage of parallelism.
Account metadata was not encoded and transmitted as efficiently as possible.

Addressing point (2) provided the largest gains in performance, providing a 10x improvement in replica startup time for large accounts! Interestingly, the fix required migrating away from the S3 Transfer Manager to a direct, low-level implementation given the file sizes were small.

Making the customer account available incrementally was an easy change to make. We have observed that the query pattern to corpora often follows a Zipfian distribution, and in the future, we can prioritize load order based on query statistics, allowing the replica to serve the majority of traffic almost instantly.

Finally, point (3) made a minor difference in load time, but reduced metadata transmission memory usage by 4x, and similar CPU usage savings, allowing the replica to serve higher volumes of traffic.

After these bottlenecks were addressed, an account replica of 400,000 corpora was brought fully online in under 5 minutes. Note again, that a customer would not experience downtime because other replicas serve traffic during this time.

Vectara Semantic Search’s ability to support hundreds of thousands of corpora differs from other semantic search solutions in the market. Many are targeted for large indexes only: AWS Kendra only supports 5 indexes, for example.

Contributed by Tallat Shafaat. Tallat is cofounder of Vectara,Inc., and specializes in design and implementation of large scale distributed systems.