Vector Databases Explained for Semantic Search

What are vector databases?

Simply put, they are databases specifically designed to store very high-dimensional vectors.

Why do we need to store high-dimensional vectors, and why does a traditional database (like SQL) not suffice?

One use-case of high dimensional vectors: semantic search.

Have you ever thought about how each engine works? How does it take your search query and output a decent result?

For example:-

Let’s say that we have a search engine with the following corpus

Animal shelters near Delhi
IPL predictions
What should you do after engineering?
When will the tech recession end?

Now you come and want to make a query for “dog shelters around Gurgaon.”

What document from the corpus should the search engine return as a result of the above query?

Animal shelters near Delhi

But how would the search engine do this?

How would it know that “dog” is an “Animal” and “Gurgaon” is near “Delhi”?

To tackle this problem, we use Word embeddings.

I won’t detail how word embedding is created and how they do what they do. But essentially, an embedding algorithm takes a word and turns it into a multi-dimensional vector.

These multi-dimensional vectors have a handy property.

The vectors produced by two semantically similar words are close to each other in the vector space.

In simple words:

The vectors produced by “dog” and “animal” would be close to each other, and the vectors of “dog” and “car” would be far away.

Using this information is easy to guess what the search engine should do.

Step 1: calculate the word embeddings of each document in the corpus and store it somewhere.

Step 2: when a user submits a query to the search engine, calculate the word embeddings of the query.

Step 3: get the documents whose word embedding vector is close to the query’s vector.

Tackling the problem of storing huge vectors and retrieving K closest vectors from a query vector.

Step 2: and Step 3: were easy to type out but not easy to implement.

In reality, vectors of just a couple hundred words exceed the RAM capacity of most consumer computers. And searching for K closest vectors would far exceed the 100ms - 200ms latency limits we expect from search engines.

We will tackle these problems using Vamana Index and DiskANN.

These methods were researched by Microsoft, and the published paper can be found here :

https://suhasjs.github.io/files/diskann_neurips19.pdf

Approximate nearest neighbors (ANN)

We just discussed that searching for K closest vectors to a query vector would be too slow for our needs.

In most real-world applications, we don’t need the K closest vector to a query vector. Getting K vectors that are somewhat close to the query vector also suffices.

The ANN algorithm works on this philosophy. Rather than getting the actual K closest vectors to a query vector, it finds K vectors that are somewhat close to the query vector, and most of those vectors are also in the actual K closest vectors.

Let the set $G$ represent the actual K closest vectors to a query vector, and the set $X$ represent the Approximate nearest vector.

if $|G \ \cap \ X |$ is sufficiently high enough, then the ANN algorithm did an excellent job

Definition :

k-recall@k : - k-recall@k of an ANN algorithm is defined as $\frac{| G \cap X |}{k}$ and it is the measure of how good our ANN algorithm is for finding k neighbors of a vector.

The ANN algorithm we will use to make sembed will aim to achieve 95% + 100-recall@100.

We will use the graph data structure to “index” our vectors. Here, “indexing” means a way to cleverly store the vectors or some relevant information about the vectors so our ANN algorithm performs faster.

Graph-based ANN algorithm

Graph-based ANN works by constructing a Graph from the available dataset and then using a greedy method to find the closest neighbors to any query vector.

Hence it works in two steps:

Given Dataset $D$ construct a Graph $G = (D,E)$ here, $E$ represents the edges of the Graph.
Given a query vector $v$ , search though $G$ using a greedy method and obtain the set $X$ .

You may have noticed that our Algorithm’s search performance entirely relies on the quality of the Graph $G$ .

Much research has gone into ways to construct a Sparse graph so that the greedy method returns good enough results.

One way to achieve this is to use SNG graphs. Mentioned in this paper:

Approximate nearest neighbor queries in fixed dimensions | Proceedings of the fourth annual ACM-SIAM symposium on Discrete algorithms

The procedure is simple:

For any point $p \in D$ its out edges in Graph $G$ are determined as :

Initialize a set $S = D / \{ p\}$
while $S \ne \phi$ do
find $p' \in S$ , the closest point from $p$ .
make the edge $(p,p')$
remove all points from $S$ that are closer to $p'$ than $p$ . This Algorithm runs in $O(n^{2})$ time and becomes infeasible for large datasets.

We will modify this Algorithm when we construct our Vamana index. But first, we need to discuss two algorithms.

Greedy search

Untitled

This is the Algorithm we will use to search for our (approximate) nearest neighbors. The pseudo-code is challenging to read, so let me walk you through it.

We have two sets $L$ and $V$ , $L$ keeps track of the points we have yet to visit & have visited, and $V$ keeps track of all the points we have visited. Hence $L \backslash V$ is all the points we have yet to visit.

While we have points that we have to visit, i.e., $L \backslash V \ne \phi$ , we get the closest point $p^* \in L \backslash V$ from $x_q$ .

We update $L$ to contains all the neighbors of the point $p^*$ and add $p^*$ to $V$

the above line exists to primarily limit the size of the set $L$ to save computation time. Finally, the Algorithm returns the k closest points from $L$ and the set $V$ .

Robust pruning

Untitled

We have already discussed this Algorithm at the end of the Graph-based ANN algorithm

section. We only have two new tweaks, $\alpha$ and $R$. Both are introduced to reduce the computation time of the Algorithm.

Making the Graph: The Vamana Index

Untitled

This is the Algorithm we will use to create our Graph-based index. If you have understood Greedy Search and Robust Prune, then the pseudo-code is self-explanatory.

But in simple terms, it starts with a random graph and an entry point $s$ .

then for every point $p \in P$ in random order, runs Greedy Search from $s$ to $p$ to obtain a candidate set $V$ for the out neighbors of $p$ .

It then runs Robust Prune to optimize the candidate set $V$