Transcribing on Fly GPU Machines

Author

Name: Chris Fidao

the fly bird whispering into the ear of the transcription computer — Image by Annie Ruygt

Fly.io has GPUs! If you want to run AI (or whatever) workloads, checkout how to get started with GPU Machines!

Fly.io has GPU Machines, which means we can finally ~~play games~~ ~~mine bitcoin~~ ~~baghold NFTs~~ run AI workloads with just a few API calls.

This is exciting! Running GPU workloads yourself is useful when the community™ builds upon available models to make them faster, more useful, or less restrictive than first-party APIs.

One such tool is the Whisper Webservice, which is conveniently packaged in a way that makes it a good candidate to use on Fly GPU Machines.

Let’s see how to use Fly.io GPU by spinning up Whisper Webservice.

Whisper Webservice

Whisper is OpenAI’s voice recognition service - it’s used for audio transcription. To use it anywhere that’s not OpenAI’s platform, you need some Python, a few GB of storage, and (preferably) a GPU.

The aforementioned Whisper Webservice packages this up for us, while making Whisper faster, more useful, and less restricted than OpenAI’s API:

It provides a web API on top of Whisper’s Python library
It (optionally) integrates faster-whisper to make it, you know, faster
It (optionally) uses FFmpeg to process the uploaded audio file, useful for getting audio out of video files or converting audio formats

Luckily for us, and totally not why I chose this as an example - the project provides GPU-friendly Docker images. We’ll use those to spin up Fly GPU Machines and process some audio files.

(I’ll also show examples of making your own Docker image!)

Running a GPU Machine

Spinning up a GPU Machine is very similar to any other Machine. The main difference is the new “GPU kind” option (--vm-gpu-kind), which takes 2 possible values:

a100-pcie-40gb
a100-sxm4-80gb

These are 2 flavors of Nvidia A100 GPUs, the difference worth caring about is 40 vs 80 GB of memory (here’s pricing).

We’ll create machines using a100-pcie-40gb because we don’t need 80 freakin’ GB for what we’re doing.

Using flyctl is a great way to run a GPU Machine. We’ll make an app and run the conveniently created Whisper Webservice Docker image that supports Nvidia GPUs. The flyctl commands will default us into a performance-8x server size (8 CPUs, 16G ram) unless we specify something different.

One caveat: AI model files are big. Docker images ideally aren’t big - sending huge layers across the network angers the spiteful networking gods. If you shove models into your Docker images, you might have a bad time.

We suggest creating a Fly Volume and making your Docker image download needed models when it first spins up. The Whisper service (and in my experience, OpenAI’s Python library) does that for us.

So, we’ll create a volume to house (and cache) the models. In the case of the Whisper project, the models get placed in /root/.cache/whisper on its first boot, and so we’ll mount our disk there.

Alright, let’s create a GPU Machine. Here’s what the process looks like:

    APP_NAME="whispering-zines"

fly apps create $APP_NAME -o personal

# We "hint" --vm-gpu-kind so the volume
# is provisioned on a GPU host
# We choose region ord, where most Fly GPUs
# currently live
fly volumes create whisper_zine_cache -s 10 \
    -a $APP_NAME -r ord --vm-gpu-kind a100-pcie-40gb

# Take note of the volume ID from the output ^

# Run a machine that can accept web requests
# from the public internet
fly machines run onerahmet/openai-whisper-asr-webservice:latest-gpu \
    --vm-gpu-kind a100-pcie-40gb \
    -p 443:9000/tcp:tls:http -p 80:9000/tcp:http \
    -r ord \
    -v <VOLUME_ID>:/root/.cache/whisper \
    -e ASR_MODEL=large -e ASR_ENGINE=faster_whisper \
    -a $APP_NAME

# Allocate IPs so we can view it on the web
fly ips allocate-v4 --shared -a $APP_NAME
fly ips allocate-v6 -a $APP_NAME

  

That’s all pretty standard for Fly Machines, except for the --vm-gpu-kind flags used both for volume and Machine creation. Volumes are pinned to specific hosts - using this flag tells Fly.io to create the volume on a GPU host. Assuming we set the same region (-r ord), creating a GPU Machine with the just-created volume will tell Fly.io to place the Machine on the same host as the volume.

Note: As my machine started up, I saw a log line WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available., which ended up being an issue of timing. Once everything is running, I was able to see things were working by using fly ssh console -a $APP_NAME and running command nvidia-smi to confirm that the VM had a GPU. It also listed the running web service (Python in this case) was running as a GPU process.

Once everything is running, you should be able to head to $APP_NAME.fly.dev and view it in the browser.

The Whisper Webservice UI will let you try out individual calls in its API. This will also give you the information you need to make those calls from your code. There’s a link to the API specification (e.g. $APP_NAME.fly.dev/openapi.json) you can use to, say, have ChatGPT generate a client in your language of choice.

Automating GPU Machines

If you want to automate this, you can use the Machines API (spec here).

An easy way to get started is to spy on the API requests flyctl is making:

    # Debug logs will output the API requests / responses
# made to Fly.io's API.
LOG_LEVEL=debug flyctl machine run ...

  

This helped me figure out why my own initial API attempts failed - it turns out we need some extra parameters in the compute portion of the request JSON for creating a volume, and the guest section for creating a Machine.

For both volumes and Machines, we set the gpu_kind the same way we did in our flyctl command. However we also need the cpu_kind to be set. Additionally, when creating a Machine, we need to set cpus and memory_mb to valid values for performance Machines.

    APP_NAME="whispering-zines"

# Create a volume on a GPU host. Specify both
# cpu_kind and gpu_kind
curl -H "Authorization: Bearer `fly auth token`" \
    -H "Accept: application/json" \
    -H "Content-Type: application/json" \
    https://api.machines.dev/v1/apps/$APP_NAME/volumes \
    -d '{
        "name": "whisper_zine_cache",
        "region": "ord",
        "size_gb": 10,
        "compute": {
            "cpu_kind": "performance",
            "gpu_kind": "a100-pcie-40gb"
        }
    }'

# Take note of the volume ID from the response ^

# Run a machine that can accept web requests
# from the public internet.
curl -H "Authorization: Bearer `fly auth token`" \
    -H "Accept: application/json" \
    -H "Content-Type: application/json" \
    https://api.machines.dev/v1/apps/$APP_NAME/machines \
    -d '{
        "region": "ord",
        "config": {
            "env": {
                "ASR_ENGINE": "faster_whisper",
                "ASR_MODEL": "large",
                "FLY_PROCESS_GROUP": "app",
                "PRIMARY_REGION": "ord"
            },
            "mounts": [
                {
                    "path": "/root/.cache/whisper",
                    "volume": "<VOLUME_ID>",
                    "name": "data"
                }
            ],
            "services": [
                {
                    "protocol": "tcp",
                    "internal_port": 9000,
                    "autostop": false,
                    "ports": [
                        {
                            "port": 80,
                            "handlers": [
                                "http"
                            ],
                            "force_https": true
                        },
                        {
                            "port": 443,
                            "handlers": [
                                "http",
                                "tls"
                            ]
                        }
                    ]
                }
            ],
            "image": "onerahmet/openai-whisper-asr-webservice:latest-gpu",
            "guest": {
                "cpus": 8,
                "memory_mb": 16384,
                "cpu_kind": "performance",
                "gpu_kind": "a100-pcie-40gb"
            }
        }
    }'

  

After that we can assign the app some IPs. You can use flyctl for this, or the graphql API. You can once again use debug mode with flyctl to see what API calls it makes. Side note: Eventually the Machines REST API will include the ability to allocate IP addresses.

    fly ips allocate-v4 --shared -a $APP_NAME
fly ips allocate-v6 -a $APP_NAME

If you’re doing this type of work for your business, you may want to keep these Machines inside a private network anyway, in which case you won’t be assigning it IP addresses.

Making Your Own Images

There is, luckily (for me, a hardware ignoramus) less dark magic to making GPU-friendly Docker images than you might think. Basically you need to just install the correct Nvidia drivers.

A way to cheat at this is to run Nvidia cuda base images, but you’re made of sterner stuff, you can also start with a base Ubuntu image and install your own.

While the Whisper webservice image is based on nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04, I got Whisper (plain, not the webservice) working with ubuntu:22.04:

    # Base image
FROM ubuntu:22.04

RUN apt update -q && apt install -y ca-certificates wget \
    && wget -qO /cuda-keyring.deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb \
    && dpkg -i /cuda-keyring.deb && apt update -q \
    && apt install -y --no-install-recommends ffmpeg libcudnn8 libcublas-12-2 \
                                              git python3 python3-pip

WORKDIR /app
COPY audio.mp3
COPY run.py /app/run.py

CMD ["python3" "run.py"]

  

You can find a full, working version of this here.

This time it’s different, I guess

AI feels a bit different than previous trends in that it has immediately-obvious benefits. No one needs to throw around catchy phrases with a wink-wink nudge-nudge (“we like the art”) for us to find value.

Since AI workloads work most efficiently in GPUs, they remain a hot commodity. For those of us who didn’t purchase enough $NVDA to retire, we can bring more value to our businesses by adding in AI.

Fly Machines have always been a great little piece of tech to run “ephemeral compute workloads” (wait, do I work at AWS!?) - and this is what I like about GPU Machines. You can mix and match all sorts of AI stuff together to make a chain of useful tools!

Next post ↑: How I Fly
Previous post ↓: Skip the API, Ship Your Database