grounded generation

###### A Reference Architecture for Grounded Generation

August 8, 2023 by Ofer Mendelevitch Tallat Shafaat | 10 min read

Read NowBlog Post

Vector Search

Dive deep into the mechanics of Vector Search, where data meets geometry to redefine the future of search algorithms.

August 22, 2023 by Tallat Shafaat Talip Ozturk

Neural search and Grounded Generation (aka Retrieval Augmented Generation) have recently created a lot of buzz due to their vast applications. One of the core components of such a pipeline is vector search. Given a myriad of input data, vector search retrieves the relevant data that answers an input query. The retrieved data can then be consumed directly or fed to an LLM to generate a coherent answer to the query.

In this article, we will discuss what vectors are in the context of NLP (natural language processing) and neural retrieval, and present some popular algorithms used for performing vector search.

Vectors are mathematical representations that have a magnitude and direction. Essentially, in natural language processing (NLP) vectors are arrays of floating point numbers, where the number of floats in the array is called the *dimension* of the vector.

Vectors are produced using various algorithms, like word2vec, and machine learning models, like MiniLM. For such models, the input is data like text, image, audio, video, and the output is a vector. In this context, vectors are also commonly referred to as *embeddings*, and the ML model that produces the vectors are called *embedding models*.

The vector representation of input data has several desirable properties. For a retrieval system, one of the core properties of vectors generated by the embedding model is that two input data items that are semantically similar will have vectors that are similar too, meaning the “distance” of the two vectors will be small). For example, the distance of the vectors for “dog” and “animal” will be significantly smaller than the distance between “dog” and “airplane”. Note that the input data can even be phrases, sentences, paragraphs, or even whole documents.**Note:** The definition of distance depends on the embedding model. A common distance metric employed is the inner product between the vectors.

Based on their dimensions, vectors are broadly classified as dense and sparse.

**Dense vectors **are vectors where input information is highly condensed into vectors with relatively small dimensions (e.g., 1024). Normally, most of the floating point numbers in such vectors are non-zero. Smaller dimensions are desirable as they are efficient (less memory consumed, less computational resources needed to process, etc), though it can come at the expense of the quality of the vector generated.

**Sparse vectors,** on the other hand, are high dimensional vectors (e.g., 100,000). Such vectors generally have most of their floating-point number zero. For example, in the bag-of-words model, the size of the language vocabulary is used as the dimensionality of the vector. When converting input text into a vector, each dimension (position in the array) corresponds to a word in the vocabulary, and the value at each dimension indicates the presence (or frequency) of that word in the document.

In general, dense vectors are produced by neural models and used for semantic search, while sparse vectors are produced and used for lexical search. Recently, neural models like Splade have generated sparse vectors that can cater to both semantic and lexical searches.

In this article, we focus on dense vectors as they are generally used for semantic search.

Given an input set of vectors and a query vector, *vector search* is the process of finding vectors from the input set that answer the query. These output vectors are the vectors from the input set that have the smallest distance from the query vector.

We now know that information can be converted into vectors. How can we use the vector representation of the information? In neural retrieval systems, the main use case is to employ vectors to find similar information.

For example, let’s take the example of a search system where the input is a set of documents. The desired output is, given a query, retrieve and return the documents that answer the query. The following steps are carried out:

- Convert the input documents into vectors, V, using the embedding model
- Convert the query into a vector, Q, using the same embedding model.
- From V, find the vectors, X, that have the shortest distance from Q. The documents that correspond to the vectors X are the most similar documents to the query, Q

Step 3 is what is called *vector search*. What makes vector search difficult is that (1) the number of vectors V to search over can be very large (hundreds of millions is common), (2) the latency of search should be short (most applications require sub-second latency), and (3) the hardware compute resources required should not be overly expensive.

FAISS and ANNOY are popular open source libraries that implement various techniques to perform vector search.

Vector databases are storage systems specialized in storing large amounts of vectors, and providing efficient search over the vectors. Some popular open source vector databases include Milvus, Weaviate, and Qdrant. These databases implement various strategies for indexing input vectors and efficient vector search. They also take care of other desirable properties for using vector search in production, e.g., scalability, auto-tuning, and fault tolerance.

In this section, we describe some of the popular techniques for performing vector search. Since low latency and resource requirements are fundamental, all of these techniques – barring brute force – provide approximate search results, i.e., the best result may not appear in the top N search results. Most of the search applications are fine making this tradeoff to achieve lower latency. The quality of the search results is generally measured by *recall*.

As the name suggests, in brute force search, the distance between the query vector and indexed vectors is calculated for each vector in the indexed set of vectors. The results are sorted by the distance, and returned to the caller. While this mechanism yields the best results, it is computationally very expensive and thus has high latency. Therefore, it is generally only used in cases where the number of indexed vectors is small. Similarly, since it produces the ideal results, it is frequently used as the point of reference when comparing approximate search techniques.

IVF improves the latency of vector search by apriori dividing the vectors into *partitions* (also known as *clusters*). The query vector is then compared against the vectors in a subset of the partitions. While this reduces the latency as lesser comparisons and computations need to be carried out, it also impacts quality of the results (i.e., recall) as the best result may belong to a cluster that was not selected to be in the subset of clusters searched.

Figure 1. shows a hypothetical 2-dimensional space where the input vectors (hollow circles) are divided into four partitions(clusters). Each cluster is denoted by a central point, called the *centroid* of the cluster. E.g., C1 is the centroid for the top-left partition. Clustering the input vectors and finding centroids for each cluster is done via clustering techniques like k-mean. The clustering depends on the distribution of the input data. Each vector is assigned to the cluster where the distance of the vector to the centroid is the smallest.

When a query vector q is searched, the first step is to decide (1) how many clusters, and (2) which clusters to search in. Say the index is set up to search in two clusters. The clusters where the centroids have the least distance from the query vector are selected for searching. In the figure, these would be top-left and bottom-left clusters. The closest vectors to q from these clusters are returned as the search results (e.g., y and z vectors). While this reduced the search space by half, if we had performed a brute force search, vector x would have been the top result as it is closest to q.

Some factors that affect the recall in IVF are: (1) partitioning scheme, (2) distribution of vectors, (3) total number of partitions, and (4) number of partitions searched at query time. As the size of data increases, a lot of applications are tolerant to reduced recall.

We can think of IVF as a technique to place each vector in one of the many clusters on a single layer. Hierarchical Navigable Small Worlds (HNSW), on the other hand, is a multi-layer graph. The top layer contains the least number of vectors with connections, and layer 0 (bottom layer) is very dense and has all vectors. Because it is a graph we will have to store a set of edges for each vector.

When both inserting and searching, you start from the top layer where we have the least number of connections and find the nearest node on that layer. From that node, you go down a layer below and find the nearest node again on that layer and continue going down until you reach layer 0.

Compared to IVF, HNSW requires more memory and/or disk space because for each vector, we have to store its edges too. Because it is a multi-layer graph algorithm, it is also more complicated to understand and implement.

High-dimensional vectors are expensive to store and compute. Would you give up on a tiny bit of accuracy (usually less than 1%) to gain significant cost, speed and storage efficiency? Quantization is a way to compress the vectors into much smaller vectors without significant impact on recall. Quantization techniques are orthogonal to search algorithms like IVF and HNSW; e.g., you can use quantization with IVF.

There are two main quantization techniques:

- Scalar quantization

As we know, a vector is a set of floating numbers and each number is a fixed number of bits. Scalar quantization is a technique to reduce the number of bits used to represent each value. So, if you reduce the number of bits from 32 to 8 then we reduce the storage by four times!

- Product quantization

Four times size reduction is great, but can we do better? For very high dimensional vectors, product quantization provides significantly better size reduction by dividing a vector into multiple subvectors and quantizing each subvector separately.

Say we have 768 dimensional 32-bit vectors. Each vector is divided into 8 subvectors, each with 96 dimensions. Each group of subvectors are clustered separately and then each subvector is mapped to its closest centroid. Instead of storing subvectors, we only store the ids of their closest centroid. Since we no longer keep the subvectors and work the closest centroid instead, we lose precision but we gain enormous space; we go from 768×32 bits to 8×8 bits.

Vector search is a fundamental building block for neural information retrieval systems, resulting in the development of several vector search libraries (e.g., FAISS, ANNOY), dedicated vector databases (e.g., Milvus, Weaviate, Qdrant, Pinecone), and being embedded into existing databases (e.g., pgvector for PostgreSQL, Atlas for MongoDB). An end-to-end retrieval system would also include other components, most importantly, an embedding model that generates the vectors. Even further, a generative platform will include LLMs as well. Vectara bundles all such components behind a simple and easy-to-use API, not just removing the burden of ML expertise from the users, but also taking care of production necessities like security, reliability, fault tolerance, and scalability.

grounded generation

August 8, 2023 by Ofer Mendelevitch Tallat Shafaat | 10 min read

Read Nowgrounded generation

May 2, 2023 by Ofer Mendelevitch | 10 min read

Read Nowgrounded generation

July 11, 2023 by Vivek Sourabh | 6 min read

Read Now