Vector search is an essential tool for many modern applications, particularly for information retrieval. It allows for efficient searching and retrieval of information based on similarity between vectors. Today I will explore how to implement a naive vector search with F#. Whether you are a seasoned F# developer or just starting out, this will provide you some useful information if you want to better understand the basics of vector search in your own projects.
The impetus for this post is to dig into vector search and how it can be used in a project. It has become such a hot topic as of late, it is valuable to dig into the underlying concepts. Even though what I show here isn’t large-scale high performance, it will describe some of the underlying concepts, which can help you understand what the big solutions are talking about. This will be a multi-phase process. I first want to provide a simple breakdown of what vector search can do in the simpliest way. The resulting code won’t be the best search algorithm, nor the fastest. But once the basics are covered, I’ll start replacing pieces until I get to a more usable result. It’ll also worthwhile to note up front, there isn’t one right way to do vector search. There are lots of variations, depending on the specific use case. I’ll at least show some of them and point in directions for deeper information.
First things first. What is vector search? Vector search is a way to find similar entries based on their features, specifically represented as vectors (or an array of numbers). For example, if you have a bunch of text documents, you can represent each document as a vector of numbers. You can then compare these vectors to find documents that are similar to each other, even if they use different words. This can be used for anything that can be represented, so although I’ll focus mostly on text, this conceptually works for other types of data, like images or audio. The idea is that similar things will have similar features, so their vectors will be similar too. And I know what you’re asking, how can text be represented as an array of numbers, for now “magic”, but I’ll dig into that a bit.
The starting point is going to be comparing sentences. I’ll get into the details soon, but for my implementation I’m going to need a list of already vectorized words. There are several sources, but I’ve decided to use GloVe from Standford’s NLP lab. They provide multiple vector size embeddings, I’m going to use a vector size of 300. You can think of larger vectors providing a better representation, with the obvious tradeoff of slower and larger. Once loaded, I’ll see the word -> vector mappings in a Map.
The loadGloVeModel function loads the downloaded pre-trained GloVe model from a file and returns it as a WordModel. Each key in the map is a word, and each value is a 300-dimensional float array representing that word’s vector.
1 | open System |
Once I have a list of words and their associated vectors, I can use these to vectorize the sentence. In this particular case I’ve decided to add the vectors of each word together to calculate a sentence’s vector. Without going into too much detail just yet, this can be handled several different ways. Other typical ways include averaging the word vectors and multiplying vectors. There are more sophisticated methods, but I don’t want to get ahead of myself just yet.
1 | /// Vectorize sentence |
Before going much further, let’s take a look at how this looks in practice. The below code snippet uses the GloVe word vectorizations to vectorize a sentence. I’ve included the resulting vector. Now that we know how to convert sentences to vectors we can start using all sorts of comparison tricks.
1 | let modelFile = "./glove.6B/glove.6B.300d.txt" |
How do comparison between vectors work? The options here are quite varied, but a couple common methods bubble to the top. I’m going to use Cosine similarity, which essentially is a calculation of angle between the vectors of two sentences (in this case). The results go from -1 to 1, where 1 is very similar and -1 indicates an opposite concept. Right in the middle is 0, this indicates that the two vectors are orthoginal, and not related at all. Like so many other parts of this process, there are a variety of methods, the Pearson Correlation Coefficient being one. The last part just takes it a step further, given a set of sentences, which ones are closest to the “query” sentence. This is simply a map over the sentences, performing the required vectorization and comparisons.
1 | /// Cosine similarity between two vectors |
Time to put it all together. Below I have a query sentence as well as a corpus of other sentences that I’ll use for comparison. The goal is to see which of those sentences are closest to the query sentence. The key here being I’m not looking for specific word matches. I expect to match closer to sentences that carry similarities, possibily related to apples, fruit, food, and eating. Time to see how well a very naive implementation can do.
1 | let modelFile = "./glove.6B/glove.6B.300d.txt" |
I said it before, but the big caveat is that this is meant to provide a general feeling for the components and methodologies involved. The results below won’t be fast, or great (honestly), but it provides a base on which better things can be built. I’ve starred the ones I would expect to match. There is higher concentration similar sentences at the top of the list, although there are some interesting matches. For something simple, this is encouraging.
1 | *0.9072 Knowledge is knowing a tomato is a fruit; wisdom is not putting it in a fruit salad |
Now, the harsh truth. This is cool, but there are some severe limitations. First, the results are better than random, but could use some boosting to be more useful. This leads to the second point, the methodologies for vectorization and similarity comparisons are naive. There is room for improvement there. Third, the implementation is not necessarily the most performant. The good thing is, there are all solvable, or at least something that can be improved. Fourth, how can these techniques be applied to document search (this has its own set of nuances)? These are good topics to dig into next time. I hope you found this at least mildly informative. Until then, keep coding.