AI GPU Clusters, From Your Laptop, With Livebook

Author

Name: Chris McCord
@chris_mccord: @chris_mccord

Author

Name: José Valim
@josevalim: @josevalim

A cartoon of the Fly baloon and a livebook roasting marshmallows over a campfire. — Image by Annie Ruygt

Livebook, FLAME, and the Nx stack: three Elixir components that are easy to describe, more powerful than they look, and intricately threaded into the Elixir ecosystem. A few weeks ago, Chris McCord (👋) and Chris Grainger showed them off at ElixirConf 2024. We thought the talk was worth a recap.

Let’s begin by introducing our cast of characters.

Livebook is usually described as Elixir’s answer to Jupyter Notebooks. And that’s a good way to think about it. But Livebook takes full advantage of the Elixir platform, which makes it sneakily powerful. By linking up directly with Elixir app clusters, Livebook can switch easily between driving compute locally or on remote servers, and makes it easy to bring in any kind of data into reproducible workflows.

FLAME is the Elixir’s answer to serverless computing. By having the library manage a pool of executors for you, FLAME lets you treat your entire application as if it was elastic and scale-to-zero. You configure FLAME with some basic information about where to run code and how many instances it’s allowed to run with, and then mark off any arbitrary section of code with Flame.call. The framework takes care of the rest. It’s the upside of serverless without committing yourself to blowing your app apart into tiny, intricately connected pieces.

The Nx stack is how you do Elixir-native AI and ML. Nx gives you an Elixir-native notion of tensor computations with GPU backends. Axon builds a common interface for ML models on top of it. Bumblebee makes those models available to any Elixir app that wants to download them, from just a couple lines of code.

Here is quick video showing how to transfer a local tensor to a remote GPU, using Livebook, FLAME, and Nx:

Let’s dive into the keynote.

Poking a hole in your infrastructure

Any Livebook, including the one running on your laptop, can start a runtime running on a Fly Machine, in Fly.io’s public cloud. That Elixir machine will (by default) live in your default Fly.io organization, giving it networked access to all the other apps that might live there.

This is an access control situation that mostly just does what you want it to do without asking. Unless you ask it to, Fly.io isn’t exposing anything to the Internet, or to other users of Fly.io. For instance: say we have a database we’re going to use to generate reports. It can hang out on our Fly organization, inside of a private network with no connectivity to the world. We can spin up a Livebook instance that can talk to it, without doing any network or infrastructure engineering to make that happen.

But wait, there’s more. Because this is all Elixir, Livebook also allows you to connect to any running Erlang/Elixir application in your infrastructure to debug, introspect, and monitor them.

Check out this clip of Chris McCord connecting to an existing application during the keynote:

Running a snippet of code from a laptop on a remote server is a neat trick, but Livebook is doing something deeper than that. It’s taking advantage of Erlang/Elixir’s native facility with cluster computation and making it available to the notebook. As a result, when we do things like auto-completing, Livebook delivers results from modules defined on the remote note itself. 🤯

Elastic scale with FLAME

When we first introduced FLAME, the example we used was video encoding.

Video encoding is complicated and slow enough that you’d normally make arrangements to run it remotely or in a background job queue, or as a triggerable Lambda function. The point of FLAME is to get rid of all those steps, and give them over to the framework instead. So: we wrote our ffpmeg calls inline like normal code, as if they were going to complete in microseconds, and wrapped them in Flame.call blocks. That was it, that was the demo.

Here, we’re going to put a little AI spin on it.

The first thing we’re doing here is driving FLAME pools from Livebook. Livebook will automatically synchronize your notebook dependencies as well as any module or code defined in your notebook across nodes. That means any code we write in our notebook can be dispatched transparently out to arbitrarily many compute nodes, without ceremony.

Now let’s add some AI flair. We take an object store bucket full of video files. We use ffmpeg to extract stills from the video at different moments. Then: we send them to Llama, running on GPU Fly Machines (still locked to our organization), to get descriptions of the stills.

All those stills and descriptions get streamed back to our notebook, in real time:

At the end, the descriptions are sent to Mistral, which builds a summary.

Thanks to FLAME, we get explicit control over the minimum and the maximum amount of nodes you want running at once, as well their concurrency settings. As nodes finish processing each video, new ones are automatically sent to them, until the whole bucket has been traversed. Each node will automatically shut down after an idle timeout and the whole cluster terminates if you disconnect the Livebook runtime.

Just like your app code, FLAME lets you take your notebook code designed to run locally, change almost nothing, and elastically execute it across ephemeral infrastructure.

64-GPUs hyperparameter tuning on a laptop

Next, Chris Grainger, CTO of Amplified, takes the stage.

For work at Amplified, Chris wants to analyze a gigantic archive of patents, on behalf of a client doing edible cannibinoid work. To do that, he uses a BERT model (BERT, from Google, is one of the OG “transformer” models, optimized for text comprehension).

To make the BERT model effective for this task, he’s going to do a hyperparameter training run.

This is a much more complicated AI task than the Llama work we just showed up. Chris is going to generate a cluster of 64 GPU Fly Machines, each running an L40s GPU. On each of these nodes, he needs to:

setup its environment (including native dependencies and GPU bindings)
load the training data
compile a different version of BERT with different parameters, optimizers, etc.
start the fine-tuning
stream its results in real-time to each assigned chart

Here’s the clip. You’ll see the results stream in, in real time, directly back to his Livebook. We’ll wait, because it won’t take long to watch:

This is just the beginning

The suggestion of mixing Livebook and FLAME to elastically scale notebook execution was originally proposed by Chris Grainger during ElixirConf EU. During the next four months, Jonatan Kłosko, Chris McCord, and José Valim worked part-time on making it a reality in time for ElixirConf US. Our ability to deliver such a rich combination of features in such a short period of time is a testament to the capabilities of the Erlang Virtual Machine, which Elixir and Livebook runs on. Other features, such as remote dataframes and distributed GC, were implemented in a weekend. Bringing the same functionality to other ecosystems would take several additional months, sometimes accompanied by millions in funding, and often times as part of a closed-source product.

Furthermore, since we announced this feature, Michael Ruoss stepped in and brought the same functionality to Kubernetes. From Livebook v0.14.1, you can start Livebook runtimes inside a Kubernetes cluster and also use FLAME to elastically scale them. Expect more features and news in this space!

Finally, Fly’s infrastructure played a key role in making it possible to start a cluster of GPUs in seconds rather than minutes, and all it requires is a Docker image. We’re looking forward to see how other technologies and notebook platforms can leverage Fly to also elevate their developer experiences.

Launch a GPU app in seconds

Run your own LLMs or use Livebook for elastic GPU workflows ✨
Go! →

Next post ↑: VSCode’s SSH Agent Is Bananas
Previous post ↓: Accident Forgiveness

Poking a hole in your infrastructure

Elastic scale with FLAME

64-GPUs hyperparameter tuning on a laptop

This is just the beginning

Launch a GPU app in seconds