Scaling Large Language Models to zero with Ollama

Author

Name: Xe Iaso
pony.social/@cadey: pony.social/@cadey

A cartoon person gazes into an overly realistic picture of a computer. — Image by Annie Ruygt

We’re Fly.io. We have powerful servers worldwide to run your code close to your users. Including GPUs so you can self host your own AI.

Open-source self-hosted AI tools have advanced a lot in the past 6 months. They allow you to create new methods of expression (with QR code generation and Stable Diffusion), easy access to summarization powers that would have made Google blush a decade ago (even with untuned foundation models such as LLaMa 2 and Yi), to conversational assistants that enable people to do more with their time, and to perform speech recognition in real time on moderate hardware (with Whisper et al). With all these capabilities comes the need for more and more raw computational muscle to be able to do inference on bigger and bigger models, and eventually do things that we can’t even imagine right now. Fly.io lets you put your compute where your users are so that you can do machine learning inference tasks on the edge with the power of enterprise-grade GPUs such as the Nvidia A100. You can also scale your GPU nodes to zero running Machines, so you only pay for what you actually need, when you need it.

It’s worth mentioning that “scaling to zero” doesn’t mean what you may think it means. When you “scale to zero” in Fly.io, you actually stop the running Machine. This means the Machine is still laying around on the same computer box that it runs on, but it’s just put to sleep. If there is a capacity issue then your app may be unable to wake back up. We are working on a solution to this, but for now you should be aware that scaling to zero is not the same as spinning down your Machine and spinning it back up again on a new computer box when you need it.

This is a continuation of the last post in this series about how to use GPUs on Fly.io.

Why scale to zero?

Running GPU nodes on top of Fly is expensive. Sure, GPUs enable you to do things a lot faster than CPUs ever could on their own, but you mostly will have things run idle between uses. This is where scaling to zero comes in. With scaling to zero, you can have your GPU nodes shut down when you’re not using them. When your Machine stops, you aren’t paying for the GPU any more. This is good for the environment and your wallet.

In this post, we’re going to be using Ollama to generate text. Ollama is a fancy wrapper around llama.cpp that allows you to run large language models on your own hardware with your choice of model. It also supports GPU acceleration, meaning that you can use Fly.io’s huge GPUs to run your models faster than your RTX 3060 at home ever would on its own.

One of the main downsides of using Ollama in a cloud environment is that it doesn’t have authentication by default. Thanks to the power of about 70 lines of Go, we are able to shim that in after the fact. This will protect your server from random people on the internet using your GPU time (and spending your money) to generate text and integrate it into your own applications.

Create a new folder called ollama-scale-to-0:

mkdir ollama-scale-to-0

Fly app setup

First, we need to create a new Fly app:

fly launch --no-deploy

After selecting a name and an organization to run it in, this command will create the app and write out a fly.toml file for you:

    # fly.toml app configuration file generated for sparkling-violet-709 on 2023-11-14T12:13:53-05:00
#
# See https://fly.io/docs/reference/configuration/ for information about how to use this file.
#

app = "sparkling-violet-709"
primary_region = "ord"

[http_service]
  internal_port = 11434 # change me to 11434!
  force_https = false # change mo to false!
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 0
  processes = ["app"]

  

This is the configuration file that Fly.io uses to know how to run your application. We’re going to be modifying the fly.toml file to add some additional configuration to it, such as enabling GPU support:

    app = "sparkling-violet-709"
primary_region = "ord"
vm.size = "a100-40gb" # the GPU size, see https://fly.io/docs/gpus/gpu-quickstart/ for more info

  

We don’t want to expose the GPU to the internet, so we’re going to create a flycast address to expose it to other services on your private network. To create a flycast address, run this command:

fly ips allocate-v6 --private

The fly ips allocate-v6 command makes a unique address in your private network that you can use to access Ollama from your other services. Make sure to add the --private flag, otherwise you’ll get a globally unique IP address instead of a private one.

Next, you may need to remove all of the other public IP addresses for the app to lock it away from the public. Get a list of them with fly ips list and then remove them with fly ips release <ip>. Delete everything but your flycast IP.

Next, we need to declare the volume for Ollama to store models in. If you don’t do this, then when you scale to zero, your existing models will be destroyed and you will have to re-download them every time the server starts. This is not ideal, so we’re going to create a persistent volume to store the models in. Add the following to your fly.toml:

    [build]
  image = "ollama/ollama"

[mounts]
  source = "models"
  destination = "/root/.ollama"
  initial_size = "100gb"

  

This will create a 100GB volume in the ord region when the app is deployed. This will be used to store the models that you download from the Ollama library. You can make this smaller if you want, but 100GB is a good place to start from.

Now that everything is set up, we can deploy this to Fly.io:

fly deploy

This will take a minute to pull the Ollama image, push it to a Machine, provision your volume, and kick everything else off with hypervisors, GPUs and whatnot. Once it’s done, you should see something like this:

 ✔ Machine 17816141f55489 [app] update succeeded
-------

Visit your newly deployed app at https://sparkling-violet-709.fly.dev/

This is a lie because we just deleted the public IP addresses for this app. You can’t access it from the internet, and by extension, random people can’t access it either. For now, you can run an interactive session with Ollama using an ephemeral Fly Machine:

    fly m run -e OLLAMA_HOST=http://sparkling-violet-709.flycast --shell ollama/ollama

  

And then you can pull an image from the ollama library and generate some text:

    $ ollama run openchat:7b-v3.5-fp16
>>> How do I bake chocolate chip cookies?
 To bake chocolate chip cookies, follow these steps:

1. Preheat the oven to 375°F (190°C) and line a baking sheet with parchment paper or silicone baking mat.

2. In a large bowl, mix together 1 cup of unsalted butter (softened), 3/4 cup granulated sugar, and 3/4
cup packed brown sugar until light and fluffy.

3. Add 2 large eggs, one at a time, to the butter mixture, beating well after each addition. Stir in 1
teaspoon of pure vanilla extract.

4. In a separate bowl, whisk together 2 cups all-purpose flour, 1/2 teaspoon baking soda, and 1/2 teaspoon
salt. Gradually add the dry ingredients to the wet ingredients, stirring until just combined.

5. Fold in 2 cups of chocolate chips (or chunks) into the dough.

6. Drop rounded tablespoons of dough onto the prepared baking sheet, spacing them about 2 inches apart.

7. Bake for 10-12 minutes, or until the edges are golden brown. The centers should still be slightly soft.

8. Allow the cookies to cool on the baking sheet for a few minutes before transferring them to a wire rack
to cool completely.

Enjoy your homemade chocolate chip cookies!

  

If you want a persistent wake-on-use connection to your Ollama instance, you can set up a connection to your Fly network using WireGuard. This will let you use Ollama from your local applications without having to run them on Fly. For example, if you want to figure out the safe cooking temperature for ground beef in Celsius, you can query that in JavaScript with this snippet of code:

    const generateRequest = {
  model: "openchat:7b-v3.5-fp16",
  prompt: "What is the safe cooking temperature for ground beef in celsius?"
  stream: false, // <- important for Node/Deno clients
};

let resp = await fetch("http://sparkling-violet-709.flycast/api/generate", {
  method: "POST",
  body: JSON.stringify(generateRequest),
});

if (resp.status !== 200) {
  throw new Error(`error fetching response: ${resp.status}: ${await resp.text()}`);
}

resp = await resp.json();

console.log(resp.response); // Something like "The safe cooking temperature for ground beef is 71 degrees celsius (160 degrees fahrenheit).

  

Scaling to zero

The best part about all of this is that when you want to scale down to zero running Machines: do nothing, it will automatically shut down when it’s idle. Wait a few minutes and then verify it with fly status:

$ fly status

...

PROCESS ID              VERSION REGION  STATE   ROLE    CHECKS  LAST UPDATED
app     3d8d7949b22089  9       ord     stopped                 2023-11-14T19:34:24Z

The app has been stopped. This means that it’s not running and you’re not paying for it. When you want it to start up again, just make a request. It will automatically start up and you can use it as normal with the CLI or even just arbitrary calls to the API.

You can also upload your own models to the Ollama registry by creating your own Modelfile and pushing it (though you will need to install Ollama locally to publish your own models). At this time, the only way to set a custom system prompt is to use a Modelfile and upload your model to the registry.

Conclusion

Ollama is a fantastic way to run large language models of your choice and the ability to use Fly.io’s powerful GPUs means you can use bigger models with more parameters and a larger context window. This lets you make your assistants more lifelike, your conversations have more context, and your text generation more realistic.

Oh, by the way, this also lets you use the new json mode to have your models call functions, similar to how ChatGPT would. To do this, have a system prompt that looks like this:

You are a helpful research assistant. The following functions are available for you to fetch further data to answer user questions, if relevant:

{
    "function": "search_bing",
    "description": "Search the web for content on Bing. This allows users to search online/the internet/the web for content.",
    "arguments": [
        {
            "name": "query",
            "type": "string",
            "description": "The search query string"
        }
    ]
}

{
    "function": "search_arxiv",
    "description": "Search for research papers on ArXiv. Make use of AND, OR and NOT operators as appropriate to join terms within the query.",
    "arguments": [
        {
            "name": "query",
            "type": "string",
            "description": "The search query string"
        }
    ]
}

To call a function, respond - immediately and only - with a JSON object of the following format:
{
    "function": "function_name",
    "arguments": {
        "argument1": "argument_value",
        "argument2": "argument_value"
    }
}

If no function needs to be called, respond with an empty JSON object: {}

Then you can use the JSON format to receive a JSON response from Ollama (hint: —format=json in the CLI or format: "json" in the API). This is a great way to make your assistants more lifelike and more useful. You will need to use something like Langchain or manual iterations to properly handle the cases where the user doesn’t want to call a function, but that’s a topic for another blog post.

For the best results you may want to use a model with a larger context window such as vicuna:13b-v1.5-16k-fp16 (16k == 16,384 token window) as JSON is very token-expensive. Future advances in the next few weeks (such as the Yi models gaining ludicrous token windows on the line of 200,000 tokens at the cost of ludicrous amounts of VRAM usage) will make this less of an issue. You can also get away with minifying the JSON in the functions and examples a lot, but you may need to experiment to get the best results.

Happy hacking, y'all.

Next post ↑: What are these "GPUs" really?
Previous post ↓: Rethinking Serverless with FLAME