LLaMa on a CPUs

Image of a LLama riding with Elixir Drop through the Italian countryside.
Image by Annie Ruygt

We’re Fly.io. We run apps for our users on hardware we host around the world. Fly.io happens to be a great place to run Phoenix applications. Check out how to get started!

Large Language Model (LLM) based AI models are all the rage and running them typically requires a pretty beefy GPU or paying for Open AI API access. So I wondered: Can I run a LLM Model on a normal Fly Machine?

Locally on my M2 Mac, I’ve been able to run the incredible llama.cpp project with pretty solid performance. Let’s see how far we get on Fly machines!

llama.cpp

For those unfamiliar, llama.cpp is a C++ implementation of Facebook’s LLaMA model, optimized for executing on normal CPU’s. It will use whatever matrix processing code your CPU makes available and will use all of your ram. They also use quantization tricks to make it require less memory. I’m not an expert, but understanding is they do some math magic and instead of using Float32 they use Float8, and it somehow just works.

Mistal 7b Instruct

The model we’re going to use is the Mistral 7b Instruct model. Luckily for us here is one that’s been pre-quantized just for llama.cpp. Specifically the Q5KM model variant because it requires just shy of 8gb of ram, and will fit into a normal sized CPU machine on Fly.

Setup Fly

On Fly.io a Docker image approximately maxes out around 8gb, so we won’t be able to bake in our model to the Dockerfile, but we can have llama.cpp ready to go, and this is a Elixir Blog, so we’ll also include our dependencies for that.

ARG ELIXIR_VERSION=1.15.6
ARG OTP_VERSION=26.1.1
ARG UBUNTU_VERSION=jammy-20230126

FROM "hexpm/elixir:${ELIXIR_VERSION}-erlang-${OTP_VERSION}-ubuntu-${UBUNTU_VERSION}"

RUN apt-get update -y && apt-get install -y build-essential git libstdc++6 openssl libncurses5 locales \
    && apt-get clean && rm -f /var/lib/apt/lists/*_*

WORKDIR "/app"

COPY app.exs /app
COPY port_wrapper.sh /app

# Set the locale
RUN sed -i '/en_US.UTF-8/s/^# //g' /etc/locale.gen && locale-gen
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US:en
ENV LC_ALL en_US.UTF-8
ENV MIX_HOME="/data/mix"
ENV MIX_INSTALL_DIR="/data/mix"

RUN mix local.hex --force && \
    mix local.rebar --force

# Setup LLaMa.cpp
RUN mkdir "/ggml"
run git clone --depth=1 https://github.com/ggerganov/llama.cpp.git /ggml
WORKDIR "/ggml"
RUN make

WORKDIR "/app"

# Appended by flyctl
ENV ECTO_IPV6 true
ENV ERL_AFLAGS "-proto_dist inet6_tcp"
ENV SHELL=/bin/bash
ENV GGML_PATH="/ggml/server"
ENV MODEL_DIR="/data/models"
ENV WRAPPER_PATH="/app/port_wrapper.sh"

CMD ["elixir", "app.exs"]

This is a fairly normal Elixir Dockerfile with the addition of cloning LLAMA.cpp and building it with make. We are also going to need to add a volume, so let’s add that section to a Fly.toml

app = "ggml-example"
primary_region = "ord"
swap_size_mb = 1024

[mounts]
  source = "data"
  destination = "/data"
fly apps create ggml-example
fly vol create -s 10 -r oad data -y
fly scale

This should make our app, create and attach a volume and now we’re finally ready to write some code:

Mix.install([
  {:req, "~> 0.4.3"}
])

url =
  "https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q5_K_M.gguf"

dir = System.get_env("MODEL_DIR") || "./models"
path = "mistrial-instruct.gguf"
full_path = Path.join([dir, path])

File.mkdir_p!(dir)

unless File.exists?(full_path) do
  Req.get!(url, into: File.stream!(full_path))
else
  IO.puts("#{full_path} already downloaded")
end

This code will set up our project with my favorite dep Req does a basic check that we’ve downloaded the file, if not we download it to our volume.

And finally, we add a step to execute the code at the bottom:

ggml_dir = System.get_env("GGML_PATH")
ggml_exec = Path.expand(Path.join([ggml_dir, "main"]))
model = Path.expand(Path.join([System.get_env("MODEL_DIR"), "mistrial-instruct.gguf"]))

prompt = "Tell me a story!"

System.cmd([
  ggml_exec,
  "-m", full_path,
  "-c", 4096,
  "--temp", 0.7,
  "--repeat_penalty", 1.1,
  "-n", -1,
  "-p", "\"<s>[INST]#{prompt} [/INST]\""
  ])
|> IO.puts()

Here we basically setup the command to call main with the model, and prompt as per the instructions on hugging face

Finally, let’s deploy and watch the magic!

fly deploy --vm-cpu-kind=shared --vm-cpus=4 --vm-memory=8192

This command is deploying on the shared-4x-8gb Machine, which is essentially the smallest machine you can run on. I typically got a reply in ~5 seconds, and it’s better if you give it a bigger machine.

If you’d like to play around with a deployed version with a little LiveView UI here you go! https://fly-ggml.fly.dev/

What’s next?

Llama.cpp comes with a server mode, so what I did is started its server and called it using Req, streaming the results back. I see this to be useful for if you need to batch process something with a LLM and don’t want to spend a ton of money.

While this won’t be as fast or performant as a GPU or TPU, it’s a great way to get started with LLMs and see what they can do. I’m excited to see what people do with this, and I’m sure there are other models that will work well with this setup.

Please reach out if you find something neat!

Fly.io ❤️ Elixir

Fly.io is a great way to run your Phoenix LiveView apps. It’s really easy to get started. You can be running in minutes.

Deploy a Phoenix app today!