Using LLama.cpp with Elixir and Rustler

Illustration of elixir droplet, rust crab and a LLama with the elixir droplet holding a magic wand up to the Llama.
Image by Annie Ruygt

We’re Fly.io. We run apps for our users on hardware we host around the world. Fly.io happens to be a great place to use GPUs. Check out how to get started!

Recently I started playing with the idea of using LLama.CPP with Elixir as a NIF. Creating C/C++ Nif’s in Erlang is kind of a project and you need to be especially careful to not cause memory errors bugs. So I found a Rust Wrapper around LLama.cpp and since I have some experience with Rustler I thought I’d give it a go. If this is your first time hearing about Elixir and Rust(ler) you might was to go back and read my initial experiences!

Let’s use LLama.cpp in Elixir and make a new Library!

If you are not familiar with llama.cpp its a project to build a native LLM application that happens to run many models just fine on a Macbook or any regular ole GPU. Today we’re going to do the follow:

  • Setup a new Elixir Library!
  • Use the Rust Library rust-llama-cpp with Rustler
  • Build a proof of concept wrapper in Elixir
  • And see what we can do!

Scaffolding

Let’s start off like every great Elixir Package with a mix new llama_cpp we won’t be needing a supervisor because this will be a thin shim over the Rust library, rust-llama-cpp. Opening up our mix.exs let’s add our single dep:

defp deps do
  [
    {:rustler, "~> 0.30.0", runtime: false}
  ]
end

Next up installing our deps with mix deps.get, and executing the rustler scaffolding mix rustler.new and following the prompts one by one. Finally adding our final dependency to native/llamacpp/Cargo.toml:

[dependencies]
rustler = "0.30.0"
llama_cpp_rs = {version = "0.3.0", features = ["metal"]}

We should be ready to go! Time to dive into some rust!

A little Rust and Elixir

Looking over the Docs for the rust-llama-cpp library the two core function’s we’ll need to implement are LLama::new and llama.predict new being the constructor for the LLama model and predict handling the actual text prediction. Below is our LlamaCpp module in Elixir with the stubs that Rustler requires:

defmodule LlamaCpp do
  use Rustler,
    otp_app: :llama_cpp,
    crate: :llamacpp

  def new(_path), do: :erlang.nif_error(:nif_not_loaded)
  def predict(_llama, _query), do: :erlang.nif_error(:nif_not_loaded)
end

An example of a typical workflow for exercising this code will look like:

{:ok, llama} = LlamaCpp.new("path_to_model.gguf")
LlamaCpp.predict(llama, "Write a poem about elixir and rust being a good mix.")
# Very good poem I definitely wrote..

Now let’s stub out the Rustler code replacing the body of native/llamacpp/src/lib.rs with:

use llama_cpp_rs::{
    options::{ModelOptions, PredictOptions},
    LLama,
};
use rustler::Encoder;
use rustler::{Env, LocalPid, NifStruct, ResourceArc, Term};

#[rustler::nif(schedule = "DirtyCpu")]
fn new(path: String) -> Result<LLama, ()> {
  LLama::new(path.into(), ModelOptions::default())
}

#[rustler::nif(schedule = "DirtyCpu")]
fn predict(llama: Llama, query: String) -> String {
  llama.predict(query.into(), PredictOptions::Default).unwrap()
}

rustler::init!(
    "Elixir.LlamaCpp",
    [predict, new]
);

Right now if we try running this we will multiple warnings and errors but thats okay we’ve got our scaffolding. You might notice that our Elixir code returns a llama resource that we’re expecting to pass through to our other LlamaCpp.predict function, and while it would be awesome to say this worked “automagically” it does not.

Rustler Resources

We’re going to have to setup a Resource that tells the BEAM virtual machine that this type is something we can hold on to, and that it should clean up when the process goes away. To do this we need to make some changes in our lib.rs

use rustler::{NifStruct, ResourceArc, Term};

pub struct ExLLamaRef(pub LLama);

#[derive(NifStruct)]
#[module = "LlamaCpp.Model"]
pub struct ExLLama {
    pub resource: ResourceArc<ExLLamaRef>,
}

impl ExLLama {
    pub fn new(llama: LLama) -> Self {
        Self {
            resource: ResourceArc::new(ExLLamaRef::new(llama)),
        }
    }
}

impl ExLLamaRef {
    pub fn new(llama: LLama) -> Self {
        Self(llama)
    }
}

impl Deref for ExLLama {
    type Target = LLama;

    fn deref(&self) -> &Self::Target {
        &self.resource.0
    }
}

unsafe impl Send for ExLLamaRef {}
unsafe impl Sync for ExLLamaRef {}

Because we cannot alter the LLama library directly without vendoring we need to wrap it and do the various implementations that the Rustler ResourceArc type requires as a type. I do not fully understand why we need 2 types, a ExLLama and ExLLamaRef type but I was using explorers library as a reference. My understanding is that in order to have the BEAM handle garbage collection you need to wrap your Rust Data in a ResourceArc, which requires that your type implement the Send and Sync protocols to work. The Deref protocol is simply a nice to have for us so we don’t need to dereference our reference type.

The benefit for us is that the BEAM will handle our memory for us and we only need to give it a handle to clean up when it’s done.

Now we can update our functions to look like this:

fn on_load(env: Env, _info: Term) -> bool {
    rustler::resource!(ExLLamaRef, env);
    true
}

#[rustler::nif(schedule = "DirtyCpu")]
fn new(path: String, model_options: ExModelOptions) -> Result<ExLLama, ()> {
    let model_options = ModelOptions::default();
    let llama = LLama::new(path.into(), &model_options).unwrap();
    Ok(ExLLama::new(llama))
}
// ...

rustler::init!(
    "Elixir.LlamaCpp",
    [predict, new],
    load = on_load
);

Notice that we added the on_load callback that registers our ExLLamaRef type, and now that Just Works TM and that we are using the defaults for ModelOptions.

Finally we can use that with our predict function updated like so:

#[rustler::nif(schedule = "DirtyCpu")]
fn predict(llama: ExLLama, query: String) -> String {
    let predict_options = PredictOptions::default();
    let result = llama.predict(query.into(), predict_options).unwrap();
    result
}

Here we accept the ExLLama as a parameter and simply call predict on it, unwrapping the result and returning a string!

Example output

Going back to our original example above let’s try it with a real model and see how it does, please note this is running on a M1 MacBook Pro with 16gb of ram:

{:ok, llama} = LlamaCpp.new("openzephyrchat.Q4_K_M.gguf")
query = "Write a poem about elixir and rust being a good mix."
LlamaCpp.predict(llama, "GPT4 User: Follow the instructions below to complete the task:\n #{query}<|end_of_turn|>GPT4 Assistant:")
# a bunch of llama-cpp logs then.. and ~10s later
In the land of ancient lore, where myths entwine and roam,A tale of magic potion whispers and hums.\nElixir and rust, an unlikely pair,\nConjure up a story that leaves us in awe.\n\nBorn from the heart of alchemy's embrace,\nThe elixir, radiant as daylight's face,\nPromises youthful vitality and grace,\nA fountain of immortality to chase.\n\nYet rust, the nemesis of fine metal sheen,\nInfiltrates"

In the land of ancient lore, where myths entwine and roam, A tale of magic potion whispers and hums. Elixir and rust, an unlikely pair, Conjure up a story that leaves us in awe.

Born from the heart of alchemy’s embrace, The elixir, radiant as daylight’s face, Promises youthful vitality and grace, A fountain of immortality to chase.\n\nYet rust, the nemesis of fine metal sheen, Infiltrates"

I don’t think it picked up on our programming language, let’s try again with: “Write a poem about the programming languages elixir and rust being a good mix.”

In the world of code, where ones and zeros dance, There’s Elixir and Rust, a pair that romances. Their union is strong, like a fortress of steel, In harmony they stand, together they heal.

Elixir, with its elegance and grace, A functional language, in the Erlang base. Concurrency and fault tolerance are its strengths, Like an actor system, it never takes rests. Rust, a systems programming powerhouse, Safe and fast, like a stealthy fox. Zero-cost

Conclusion

So as you can imagine this is only the tip of the iceberg here for what’s possible and the rust-llama-cpp library has more API to implement. I’ve started work on that here llama_cpp_ex and as always contributions are very welcome! I hope today we got a taste of using long living structures in Rustler and how we might work with them, As well as had a little bit of fun with a LLM.

If you want to run these models with LLama.cpp on a Fly GPU you can now do that too!

Fly.io ❤️ Elixir

Fly.io is a great way to run your Phoenix LiveView apps. It’s really easy to get started. You can be running in minutes.

Deploy a Phoenix app today!