Real World ™ Machine Learning on Fly GPU's

Fly Birdin a purple spaceship with a ghost searching planets.
Image by Annie Ruygt

We’re Fly.io. We run apps for our users on hardware we host around the world. Fly.io happens to be a great place to use GPUs. Check out how to get started!

We live in truly a time of wonders; every single week, a new demo drops with a mind glimpse of what our incredible AI future might be. From generating illustrations, front-end apps, or very convincing text, however…

If we’re being honest, those demos, while incredible, are difficult to even imagine putting into a real-world application. We’re not all living in a world where we can send our customer’s personal data off to a SaaS. Nor are all of us enthusiastic to lock into another developer SaaS platform. Most of us simply want to help our customers solve their problems using our own tools.

The good news is we have options. Models are being trained and released under open-source licenses on HuggingFace. Elixir is expanding its capability and reach far past the world of the Web and into Machine Learning. Now Fly.io is offering GPU’s to anyone who is interested, it is as simple as choosing the correct Dockerfile and a flyctl command we can have some of the world’s most powerful devices working for us. While also running in the same data centers as our applications, close to our customers, starting and stopping when we need them.

Search all of HexDocs, again

Let’s see how we can use a GPU in a real world project. In a previous post, we went on an adventure, collecting all of the Hexdocs and building a SQLite Database with them. Using the built-in SQLite FTS5 Search Index and culminating in a kinda useful website. This actually started a ball rolling to get a more useful and universal docs search going that’s available to all of Introduce search across all of HexDocs. So let’s try our hand at building our own Semantic Search for all of Hex Docs!

In this post, we will build off that existing database and build a new search index. We’ll be doing the following:

  • Using a Fly.io GPU with Livebook
  • Using Bumblebee.Text.TextEmbedding.text_embedding to generate a vector based on the documents.
  • Index them using the hnswlib index.
  • Then query the index using both.

Setup

We’ll be following the GPU Quickstart guide, and because we’re not starting from an existing Phoenix project, we’re going to start our project a little differently in an empty directory:

fly apps create --region iad

Create/modify the fly.toml:

app = "name"
primary_region = "iad"
vm.size = "a100-80gb"

# Use a volume to store LLMs or any big file that doesn't fit in a Docker image
[[mounts]]
  source = "data"
  destination = "/data"
  initial_size = "40gb"

[build]
  image = "ghcr.io/livebook-dev/livebook:latest-cuda12.1"

[env]
  ELIXIR_ERL_OPTIONS = "-proto_dist inet6_tcp +sssdio 128"
  LIVEBOOK_DATA_PATH = "/data"
  LIVEBOOK_HOME = "/data"
  LIVEBOOK_IP = "::"
  LIVEBOOK_ROOT_PATH = "/data"
  BUMBLEBEE_CACHE_DIR="/data/cache/bumblebee"
  XLA_CACHE_DIR="/data/cache/xla"
  XLA_TARGET="cuda120"
  PORT = "8080"

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines =false
  auto_start_machines = false
  min_machines_running = 1
  processes = ["app"]

Take note of some newish options specifically the vm.size where we call out the GPU size we’re interested in and initial_size on our volume. We’re also referencing the official Livebook CUDA image.

We will need to set a secret for the password

fly secrets set LIVEBOOK_PASSWORD="very secret password"

Finally fly deploy and that’s it! When it’s done deploying, we’ll have a fully working Livebook, and we can visit the URL and login with our very secret password!

When we normally use one of these Large Language Models (LLM), our experience is essentially converse(model, "Hello AI") but what’s actually happening is as follows:

  • Your text is encoded into a list of floats, otherwise known as an Embedding Vector…
  • … then “passed through” the model, which outputs another vector
  • … which is decoded into text.

Those vectors represent a multi-dimensional vector space, and you can compare the cosine of the angle between two N-dimensional vectors to see how similar they are.

This sounds complex, but by the end of this post, we’ll have done it. We are going to use the text_embedding function from Bumblebee with a good model for text retrieval, pipe that into a tool for indexing vectors by cosine similarity, and ideally, it gives a better search result than SQLite FTS5!

Choosing a model is not a small task; based on a very scientific search, I found this chart that claims to test and compare models, and the bge-large model seems to be compatible with Bumblebee and pretty high on the chart, so let’s roll with it! If you are playing along at home, the bge-small model also seems to do pretty good!

Let’s open a new Livebook and get started!

Mix.install([
  {:bumblebee, "0.4.2"},
  {:exla, "0.6.4"},
  {:hnswlib, "0.1.4"}
])

Application.put_env(:exla, :clients,
  cuda: [
    platform: :cuda,
    lazy_transfers: :never
  ]
)

Nx.global_default_backend(EXLA.Backend)

Adding a new cell for our model.

repo = {:hf, "BAAI/bge-large-en"}
{:ok, model_info} = Bumblebee.load_model(repo, architecture: :base)
{:ok, tokenizer} = Bumblebee.load_tokenizer(repo)

Here we download our model and load it up. To keep this post from becoming too long, we’re skipping the part where we massage the data into a clean format, I simply queried it from the SQLite database we built in the last post and ended up with something like this:

docs = [
  {"string-id", "doc text and title....sometimes very long"},
  ...
]

There is one small issue here, and that’s that our doc text can sometimes be larger than the max sequence_length as defined in the model of 512. If we don’t chunk up our document when we calculate the embedding, it will truncate anything greater than 512.

docs =
  Enum.flat_map(docs, fn {id, doc} ->
    doc
    |> String.codepoints()
    |> Stream.chunk_every(512)
    |> Stream.with_index()
    |> Enum.map(fn {chunk, chunk_id} ->
      doc = Enum.join(chunk)
      {"#{id}-#{chunk_id}", doc}
    end)
  end)

This chunks each string into 512 segments and then gives it a new id of id-N to help us keep the order of everything later.

Building our Index

Now we have everything we need to finally calculate our embeddings and create our HSNW index.

dim = 1024
space = :cosine
batch_size = 64
sequence_length = 512

Here we have some variables,

  • The embedding (for the large model) will produce a vector of dimension 1024, so our index needs that size.
  • We want to use :cosine similarity when building our index
  • Our Fly.io GPU can do a batch_size of 64 embeddings at once, a smaller machine may need to tune that down to as low as 1; this requires experimentation where you start at 1 and increase by powers of two until XLA complains about running out of memory in the next steps.
  • The sequence_length is defined by the model as the max number of tokens it was trained on.
serving =
  Bumblebee.Text.TextEmbedding.text_embedding(model_info, tokenizer,
    defn_options: [compiler: EXLA, lazy_transfers: :never],
    output_attribute: :hidden_state,
    output_pool: :mean_pooling,
    compile: [sequence_length: sequence_length, batch_size: batch_size]
  )

Kino.start_child(
  {Nx.Serving, serving: serving, name: BGE, batch_size: batch_size, batch_timeout: 0}
)

This will set up a process to handle our text embedding requests. Most of these options are optimizations I picked up from the Erlref Slack #machine-learning channel, hat-tip to them!

The only key bit for us is the name BGE which we will reference as the process to send requests to. Let’s start with the entire indexing function, and I’ll break it down bit by bit afterward.

max_elements = Enum.count(docs)
{:ok, index} = HNSWLib.Index.new(space, dim, max_elements)

indexed_docs =
  docs
  |> Stream.chunk_every(batch_size)
  |> Stream.flat_map(fn chunk ->
    chunks = Enum.map(chunk, fn {_id, doc} -> doc end)
    embeddings = Nx.Serving.batched_run(BGE, chunks)
    Enum.zip(chunk, embeddings)
    |> Enum.map(fn {{id, doc}, %{embedding: t}} ->
      {id, doc, t}
    end)
  end)
  |> Stream.with_index()
  |> Stream.map(fn {{composite_id, doc, embedding}, i} ->
    HNSWLib.Index.add_items(index, embedding, ids: [i])
    IO.puts("#{Float.round(i / max_elements * 100, 3)}%")
    {i, {composite_id, doc}}
  end)
  |> Enum.into(%{})

HNSWLib.Index.save_index(index, "/data/bge-index-large.bin")
File.write!("/data/indexed_docs.bin", :erlang.term_to_binary(indexed_docs))

First up, we set the HNSWLib.Index and we chunk the doc’s list into the same size as our batch_size using Stream.chunk_every.

ASIDE: This is a very good example of when to use a Stream over Enum. In this case docs is not small, possibly a hundred or more megabytes, if we used Enum it would create a copy between every step.

We’ll flat_map over the chunks sending each chunk to Nx.Serving.batched_run(BGE, chunks). We reference our process by name, BGE and the result is a list of maps with key embedding.

We need to associate those back with the original data so that when we index it, we can search for it, so we zip the embeddings up with the original docs and continue building our Stream pipeline.

When using HSNWLib you can give it an “id” for a document, but it’s limited to integers, so by using Stream.with_index we give ourselves an incrementing integer for each individual doc. We add our embedding to the index with its id and return the index and doc data.

Finally, because we’ll want to quickly grab a document by its id we dump the final results into a map. We’re also adding a logger to show progress because to run this on our very crazy GPU still takes about an hour.

As a result, I save the index and my values to disk just in case I do something dumb and lose my Livebook context!

Querying!

Now to query, we’ll generate the embedding for our query, use it with HNSWLib.Index.query, and finally, we’ll grab our doc from the indexed_docs map we built above and see if we did a good job!

This model recommends a prompt when searching:

prompt = "Represent this question for searching relevant passages:"

We’ll append our query to that and send it along to our same Nx.Serving setup as before and then querying HNSQLib:

query = "#{prompt} How do I make an inner join using ecto?"
[%{embedding: query}] = Nx.Serving.batched_run(BGE, [query])
{:ok, results, weights} = HNSWLib.Index.knn_query(index, query, k: 3)

And finally, we need to massage the results that HNSWLib gives back:

results = Nx.to_list(results) |> hd
weights = Nx.to_list(weights) |> hd
unindexed_docs = Map.values(indexed_docs)

for {w, id} <- Enum.zip(weights, results) do
  {id, _doc} = result_doc = Map.get(indexed_docs, id)
  [id, _] = String.split(id, "-")
  to_find = "#{id}-"
  doc =
    unindexed_docs
      |> Enum.filter(fn {id, _doc} -> String.starts_with?(id, to_find) end)
      |> Enum.sort_by(fn {id, _} -> id end)
      |> Enum.map(fn {_id, doc} -> doc end)
      |> Enum.join()

  IO.puts("#{w}: #{doc}")
end

For each result, we get the doc using our index, find all docs with similar id-prefixes, sort them, and finally join them into a single doc.

Results:

0.09391725063323975: Joining with fragments - Ecto.Query...
0.09523665904998779: Ecto.Query.join/5 A join query ..
0.09840273857116699: locus - Documentation Supported F...

Not bad! It found two relevant Ecto Docs and a third document of questionable relevance! When running the query, it took under 100ms which is well in the realm of autocomplete-level performance.

Wrap up

I would be lying to say any of this was simple. I had to learn about GPU’s, fight Nvidia packages, find and diagnose bugs in multiple Elixir packages, use trial and error to find the correct optimized options for XLA, and also develop even more patience because this stuff takes forever to run. Here are some of my top tips for less pain:

  • Use Nvidia’s Dockerfiles. Livebook’s Dockerfile is based on Nvidia’s, and that seems to just work. It is pure pain to deviate from this path.
  • Join the Erlref Slack #machine-learning channel. The more you push Bumblebee/Nx, the more you will want advice from the authors of those packages, and vice versa; they love getting feedback!
  • Write results to disk. I lost a couple of very long-running experiments because I didn’t write the results to disk. Everything about this process gobbles up memory, and you can quickly OOM and lose everything. Or you know you can have a syntax error that takes 11hrs to be found.

Getting this ready for production is as simple as spinning up an Elixir server that loads the model, index, and joins our production cluster, allowing our web server to call it.

Fly.io ❤️ Elixir

Fly.io is a great way to run your Phoenix LiveView apps. It’s really easy to get started. You can be running in minutes.

Deploy a Phoenix app today!