2023-04-08

Vector Search using F#

Read Time: 14 minutes

Vector search is an essential tool for many modern applications, particularly for information retrieval. It allows for efficient searching and retrieval of information based on similarity between vectors. Today I will explore how to implement a naive vector search with F#. Whether you are a seasoned F# developer or just starting out, this will provide you some useful information if you want to better understand the basics of vector search in your own projects.

The impetus for this post is to dig into vector search and how it can be used in a project. It has become such a hot topic as of late, it is valuable to dig into the underlying concepts. Even though what I show here isn’t large-scale high performance, it will describe some of the underlying concepts, which can help you understand what the big solutions are talking about. This will be a multi-phase process. I first want to provide a simple breakdown of what vector search can do in the simpliest way. The resulting code won’t be the best search algorithm, nor the fastest. But once the basics are covered, I’ll start replacing pieces until I get to a more usable result. It’ll also worthwhile to note up front, there isn’t one right way to do vector search. There are lots of variations, depending on the specific use case. I’ll at least show some of them and point in directions for deeper information.

First things first. What is vector search? Vector search is a way to find similar entries based on their features, specifically represented as vectors (or an array of numbers). For example, if you have a bunch of text documents, you can represent each document as a vector of numbers. You can then compare these vectors to find documents that are similar to each other, even if they use different words. This can be used for anything that can be represented, so although I’ll focus mostly on text, this conceptually works for other types of data, like images or audio. The idea is that similar things will have similar features, so their vectors will be similar too. And I know what you’re asking, how can text be represented as an array of numbers, for now “magic”, but I’ll dig into that a bit.

The starting point is going to be comparing sentences. I’ll get into the details soon, but for my implementation I’m going to need a list of already vectorized words. There are several sources, but I’ve decided to use GloVe from Standford’s NLP lab. They provide multiple vector size embeddings, I’m going to use a vector size of 300. You can think of larger vectors providing a better representation, with the obvious tradeoff of slower and larger. Once loaded, I’ll see the word -> vector mappings in a Map.

The loadGloVeModel function loads the downloaded pre-trained GloVe model from a file and returns it as a WordModel. Each key in the map is a word, and each value is a 300-dimensional float array representing that word’s vector.

open System

type WordModel = Map<string, float[]>

/// Load the GloVe model
let loadGloVeModel (modelFile: string) :WordModel =
  let mutable model :WordModel = Map.empty<string, float[]>
  IO.File.ReadAllLines(modelFile)
  |> Array.map (fun line -> 
      let tokens = line.Split([|' '|])
      let word = tokens[0]
      let vector =
        tokens[1..tokens.Length-1]
        |> Array.map float
      (word, vector))
  |> Map.ofArray

Once I have a list of words and their associated vectors, I can use these to vectorize the sentence. In this particular case I’ve decided to add the vectors of each word together to calculate a sentence’s vector. Without going into too much detail just yet, this can be handled several different ways. Other typical ways include averaging the word vectors and multiplying vectors. There are more sophisticated methods, but I don’t want to get ahead of myself just yet.

/// Vectorize sentence
/// Aggregate word vectors into a single sentence vector
let vectorizeSentence (model: WordModel) (sentence: string) =
  let vectorLength = 300
  let words = sentence.ToLower().Split([|' '|])

  let mutable vector = Array.create vectorLength 0.
  let wordVectors =
    words
    |> Array.map (fun word -> Map.tryFind word model)
    |> Array.filter (fun x -> x.IsSome)
    |> Array.map (fun x -> x.Value)

  // Sum vectors
  for wordVector in wordVectors do
    for i in [0..vectorLength - 1] do
      vector[i] <- vector[i] + wordVector[i]

  vector

Before going much further, let’s take a look at how this looks in practice. The below code snippet uses the GloVe word vectorizations to vectorize a sentence. I’ve included the resulting vector. Now that we know how to convert sentences to vectors we can start using all sorts of comparison tricks.

let modelFile = "./glove.6B/glove.6B.300d.txt"

let model = loadGloVeModel modelFile

let vs = vectorizeSentence model "this is a test"
printfn "%A" vs

(* Results (300 dimension vector):
[|-0.64975; 1.224579; 0.339562; -1.924; -0.683991; 0.351187; -0.313945;
  -0.604948; -0.31373; -7.8024; 1.20313; -0.312192; -0.191178; 0.192142;
  0.393947; 0.9711565; -1.64555; 0.435215; 0.172147; -1.468; -0.741057; 0.72615;
  0.098328; 1.86091; -0.548895; -0.225667; 0.45209; -1.07928; -0.493058; 0.05875;
  0.090338; 0.71061; -1.39378; 0.29642; -3.17332; 1.51456; -1.170281; 0.55138;
  -1.75381; 1.178137; 1.021473; 0.878106; 0.156037; 0.0063756; 0.208939;
  0.132626; 0.2679; 0.054727; -0.714408; -0.39498; 0.256841; -0.344485; 0.165513;
  0.533583; 0.015597; 0.453348; -0.657536; 0.27526; 1.1971; -0.113965; 0.045459;
  -0.130756; 1.842526; -0.353876; -0.72542; -0.82638; -0.042127; -0.756783;
  0.609844; 0.30856; -0.802697; 1.045285; 0.43572; -0.540373; -1.55073;
  -0.371212; 0.52077; 0.192682; -0.819995; 0.952318; -0.829941; 0.725035;
  0.45467; -0.456168; 0.18888; 0.620742; -0.202445; 0.78332; 0.684417; 0.079607;
  -1.009762; 1.36058; 0.150918; -1.93009; 1.3126; 0.3376661; -0.66495; 0.580754;
  0.0967394; -1.257588; ...|]
*)

How do comparison between vectors work? The options here are quite varied, but a couple common methods bubble to the top. I’m going to use Cosine similarity, which essentially is a calculation of angle between the vectors of two sentences (in this case). The results go from -1 to 1, where 1 is very similar and -1 indicates an opposite concept. Right in the middle is 0, this indicates that the two vectors are orthoginal, and not related at all. Like so many other parts of this process, there are a variety of methods, the Pearson Correlation Coefficient being one. The last part just takes it a step further, given a set of sentences, which ones are closest to the “query” sentence. This is simply a map over the sentences, performing the required vectorization and comparisons.

/// Cosine similarity between two vectors
let cosineSimilarity (v1: float[]) (v2: float[]) :float =
  let dotProduct = Array.map2 (*) v1 v2 |> Array.sum
  let magnitude1 = sqrt(Array.sumBy (fun x -> x * x) v1)
  let magnitude2 = sqrt(Array.sumBy (fun x -> x * x) v2)
  if magnitude1 <> 0. && magnitude2 <> 0. then
    dotProduct / (magnitude1 * magnitude2)
  else
    0.

/// Compare and rank query to a list of other sentences
/// Note: Higher values mean they are more similar
let similarityVectorSearch (query: string) (sentences: string list) (model: WordModel) =
    let queryVector = vectorizeSentence model query

    sentences
    |> List.map (fun sentence -> (sentence, vectorizeSentence model sentence ))
    |> List.map (fun (sentence, sentenceVector) -> (sentence, cosineSimilarity queryVector sentenceVector))
    |> List.sortByDescending snd

Time to put it all together. Below I have a query sentence as well as a corpus of other sentences that I’ll use for comparison. The goal is to see which of those sentences are closest to the query sentence. The key here being I’m not looking for specific word matches. I expect to match closer to sentences that carry similarities, possibily related to apples, fruit, food, and eating. Time to see how well a very naive implementation can do.

let modelFile = "./glove.6B/glove.6B.300d.txt"

let model = loadGloVeModel modelFile

let query = "This is a good apple to eat"
let queryVector = vectorizeSentence model query

let sentences = [
  "A man's got to do what a man's got to do"
  "All those moments will be lost in time, like tears in rain"
  "An apple a day keeps the doctor away"
  "An apple is an excellent thing - until you have tried a peach"
  "An idealist is one who, on noticing that roses smell better than a cabbage, concludes that it will also make better soup"
  "As American as apple pie"
  "Fruit does not grow on the trunk of a tree but on the tips of its branches"
  "Game over, man" 
  "I am not a robot, but I play one on TV"
  "I have come here to chew bubblegum and kick ass and I'm all out of bubblegum"
  "I never met a man I didn't like until I met that man"
  "I'll be back" 
  "I'm sorry, Dave. I'm afraid I can't do that" 
  "I've got a bad feeling about this" 
  "Knowledge is knowing a tomato is a fruit; wisdom is not putting it in a fruit salad"
  "Let food be thy medicine and medicine be thy food"
  "Life is uncertain. Eat dessert first"
  "Open the pod bay doors HAL" 
  "Resistance is futile" 
  "The early bird gets the worm"
  "The future is not set. There is no fate but what we make for ourselves"
  "The needs of the many outweigh the needs of the few, or the one" 
  "The only thing we have to fear is fear itself - and robots"
  "There's no place like home"
]

let similarities =
  sentences
  |> List.map (fun sentence ->
      let sentenceVector = vectorizeSentence model sentence
      let similarity = cosineSimilarity queryVector sentenceVector
      (sentence, similarity))
  |> List.sortByDescending snd

for (sentence, similarity) in similarities do
  printfn $"%2.4f{similarity} {sentence}"

I said it before, but the big caveat is that this is meant to provide a general feeling for the components and methodologies involved. The results below won’t be fast, or great (honestly), but it provides a base on which better things can be built. I’ve starred the ones I would expect to match. There is higher concentration similar sentences at the top of the list, although there are some interesting matches. For something simple, this is encouraging.

*0.9072 Knowledge is knowing a tomato is a fruit; wisdom is not putting it in a fruit salad
*0.9022 An idealist is one who, on noticing that roses smell better than a cabbage, concludes that it will also make better soup
*0.8949 An apple is an excellent thing - until you have tried a peach
 0.8730 A man's got to do what a man's got to do
*0.8557 An apple a day keeps the doctor away
 0.8504 The future is not set. There is no fate but what we make for ourselves
*0.8480 Life is uncertain. Eat dessert first
 0.8300 The only thing we have to fear is fear itself - and robots
*0.8282 Fruit does not grow on the trunk of a tree but on the tips of its branches
 0.8215 I am not a robot, but I play one on TV
 0.8212 I've got a bad feeling about this
 0.7758 There's no place like home
 0.7718 I never met a man I didn't like until I met that man
 0.7669 I have come here to chew bubblegum and kick ass and I'm all out of bubblegum
 0.7598 The needs of the many outweigh the needs of the few, or the one
 0.7591 All those moments will be lost in time, like tears in rain
*0.7405 As American as apple pie
 0.7347 I'll be back
 0.7315 I'm sorry, Dave. I'm afraid I can't do that
 0.7053 The early bird gets the worm
*0.6668 Let food be thy medicine and medicine be thy food
 0.5982 Game over, man
 0.5306 Resistance is futile
 0.4955 Open the pod bay doors HAL

Now, the harsh truth. This is cool, but there are some severe limitations. First, the results are better than random, but could use some boosting to be more useful. This leads to the second point, the methodologies for vectorization and similarity comparisons are naive. There is room for improvement there. Third, the implementation is not necessarily the most performant. The good thing is, there are all solvable, or at least something that can be improved. Fourth, how can these techniques be applied to document search (this has its own set of nuances)? These are good topics to dig into next time. I hope you found this at least mildly informative. Until then, keep coding.