We Were Wrong About GPUs

Author

Name: Kurt Mackey
@mrkurt: @mrkurt

We took the road less traveled by. It was less traveled for a reason. — Image by Annie Ruygt

We’re building a public cloud, on hardware we own. We raised money to do that, and to place some bets; one of them: GPU-enabling our customers. A progress report: GPUs aren’t going anywhere, but: GPUs aren’t going anywhere.

A couple years back, we put a bunch of chips down on the bet that people shipping apps to users on the Internet would want GPUs, so they could do AI/ML inference tasks. To make that happen, we created Fly GPU Machines.

A Fly Machine is a Docker/OCI container running inside a hardware-virtualized virtual machine somewhere on our global fleet of bare-metal worker servers. A GPU Machine is a Fly Machine with a hardware-mapped Nvidia GPU. It’s a Fly Machine that can do fast CUDA.

Like everybody else in our industry, we were right about the importance of AI/ML. If anything, we underestimated its importance. But the product we came up with probably doesn’t fit the moment. It’s a bet that doesn’t feel like it’s paying off.

If you’re using Fly GPU Machines, don’t freak out; we’re not getting rid of them. But if you’re waiting for us to do something bigger with them, a v2 of the product, you’ll probably be waiting awhile.

What It Took

GPU Machines were not a small project for us. Fly Machines run on an idiosyncratically small hypervisor (normally Firecracker, but for GPU Machines Intel’s Cloud Hypervisor, a very similar Rust codebase that supports PCI passthrough). The Nvidia ecosystem is not geared to supporting micro-VM hypervisors.

GPUs terrified our security team. A GPU is just about the worst case hardware peripheral: intense multi-directional direct memory transfers

(not even bidirectional: in common configurations, GPUs talk to each other)

with arbitrary, end-user controlled computation, all operating outside our normal security boundary.

We did a couple expensive things to mitigate the risk. We shipped GPUs on dedicated server hardware, so that GPU- and non-GPU workloads weren’t mixed. Because of that, the only reason for a Fly Machine to be scheduled on a GPU machine was that it needed a PCI BDF for an Nvidia GPU, and there’s a limited number of those available on any box. Those GPU servers were drastically less utilized and thus less cost-effective than our ordinary servers.

We funded two very large security assessments, from Atredis and Tetrel, to evaluate our GPU deployment. Matt Braun is writing up those assessments now. They were not cheap, and they took time.

Security wasn’t directly the biggest cost we had to deal with, but it was an indirect cause for a subtle reason.

We could have shipped GPUs very quickly by doing what Nvidia recommended: standing up a standard K8s cluster to schedule GPU jobs on. Had we taken that path, and let our GPU users share a single Linux kernel, we’d have been on Nvidia’s driver happy-path.

Alternatively, we could have used a conventional hypervisor. Nvidia suggested VMware (heh). But they could have gotten things working had we used QEMU. We like QEMU fine, and could have talked ourselves into a security story for it, but the whole point of Fly Machines is that they take milliseconds to start. We could not have offered our desired Developer Experience on the Nvidia happy-path.

Instead, we burned months trying (and ultimately failing) to get Nvidia’s host drivers working to map virtualized GPUs into Intel Cloud Hypervisor. At one point, we hex-edited the closed-source drivers to trick them into thinking our hypervisor was QEMU.

I’m not sure any of this really mattered in the end. There’s a segment of the market we weren’t ever really able to explore because Nvidia’s driver support kept us from thin-slicing GPUs. We’d have been able to put together a really cheap offering for developers if we hadn’t run up against that, and developers love “cheap”, but I can’t prove that those customers are real.

On the other hand, we’re committed to delivering the Fly Machine DX for GPU workloads. Beyond the PCI/IOMMU drama, just getting an entire hardware GPU working in a Fly Machine was a lift. We needed Fly Machines that would come up with the right Nvidia drivers; our stack was built assuming that the customer’s OCI container almost entirely defined the root filesystem for a Machine. We had to engineer around that in our flyd orchestrator. And almost everything people want to do with GPUs involves efficiently grabbing huge files full of model weights. Also annoying!

And, of course, we bought GPUs. A lot of GPUs. Expensive GPUs.

Why It Isn’t Working

The biggest problem: developers don’t want GPUs. They don’t even want AI/ML models. They want LLMs. System engineers may have smart, fussy opinions on how to get their models loaded with CUDA, and what the best GPU is. But software developers don’t care about any of that. When a software developer shipping an app comes looking for a way for their app to deliver prompts to an LLM, you can’t just give them a GPU.

For those developers, who probably make up most of the market, it doesn’t seem plausible for an insurgent public cloud to compete with OpenAI and Anthropic. Their APIs are fast enough, and developers thinking about performance in terms of “tokens per second” aren’t counting milliseconds.

(you should all feel sympathy for us)

This makes us sad because we really like the point in the solution space we found. Developers shipping apps on Amazon will outsource to other public clouds to get cost-effective access to GPUs. But then they’ll faceplant trying to handle data and model weights, backhauling gigabytes (at significant expense) from S3. We have app servers, GPUs, and object storage all under the same top-of-rack switch. But inference latency just doesn’t seem to matter yet, so the market doesn’t care.

Past that, and just considering the system engineers who do care about GPUs rather than LLMs: the hardware product/market fit here is really rough.

People doing serious AI work want galactically huge amounts of GPU compute. A whole enterprise A100 is a compromise position for them; they want an SXM cluster of H100s.

Near as we can tell, MIG gives you a UUID to talk to the host driver, not a PCI device.

We think there’s probably a market for users doing lightweight ML work getting tiny GPUs. This is what Nvidia MIG does, slicing a big GPU into arbitrarily small virtual GPUs. But for fully-virtualized workloads, it’s not baked; we can’t use it. And I’m not sure how many of those customers there are, or whether we’d get the density of customers per server that we need.

That leaves the L40S customers. There are a bunch of these! We dropped L40S prices last year, not because we were sour on GPUs but because they’re the one part we have in our inventory people seem to get a lot of use out of. We’re happy with them. But they’re just another kind of compute that some apps need; they’re not a driver of our core business. They’re not the GPU bet paying off.

Really, all of this is just a long way of saying that for most software developers, “AI-enabling” their app is best done with API calls to things like Claude and GPT, Replicate and RunPod.

What Did We Learn?

A very useful way to look at a startup is that it’s a race to learn stuff. So, what’s our report card?

First off, when we embarked down this path in 2022, we were (like many other companies) operating in a sort of phlogiston era of AI/ML. The industry attention to AI had not yet collapsed around a small number of foundational LLM models. We expected there to be a diversity of mainstream models, the world Elixir Bumblebee looks forward to, where people pull different AI workloads off the shelf the same way they do Ruby gems.

But Cursor happened, and, as they say, how are you going to keep ‘em down on the farm once they’ve seen Karl Hungus? It seems much clearer where things are heading.

GPUs were a test of a Fly.io company credo: as we think about core features, we design for 10,000 developers, not for 5-6. It took a minute, but the credo wins here: GPU workloads for the 10,001st developer are a niche thing.

Another way to look at a startup is as a series of bets. We put a lot of chips down here. But the buy-in for this tournament gave us a lot of chips to play with. Never making a big bet of any sort isn’t a winning strategy. I’d rather we’d flopped the nut straight, but I think going in on this hand was the right call.

A really important thing to keep in mind here, and something I think a lot of startup thinkers sleep on, is the extent to which this bet involved acquiring assets. Obviously, some of our costs here aren’t recoverable. But the hardware parts that aren’t generating revenue will ultimately get liquidated; like with our portfolio of IPv4 addresses, I’m even more comfortable making bets backed by tradable assets with durable value.

In the end, I don’t think GPU Fly Machines were going to be a hit for us no matter what we did. Because of that, one thing I’m very happy about is that we didn’t compromise the rest of the product for them. Security concerns slowed us down to where we probably learned what we needed to learn a couple months later than we could have otherwise, but we’re scaling back our GPU ambitions without having sacrificed any of our isolation story, and, ironically, GPUs other people run are making that story a lot more important. The same thing goes for our Fly Machine developer experience.

We started this company building a Javascript runtime for edge computing. We learned that our customers didn’t want a new Javascript runtime; they just wanted their native code to work. We shipped containers, and no convincing was needed. We were wrong about Javascript edge functions, and I think we were wrong about GPUs. That’s usually how we figure out the right answers: by being wrong about a lot of stuff.

Next post ↑: Taming A Voracious Rust Proxy
Previous post ↓: The Exit Interview: JP Phillips