We're Cutting L40S Prices In Half

Author

Name: Kurt Mackey
@mrkurt: @mrkurt

A cartoon hot air balloon with a bundle of sandwiches to share with the world. — Image by Annie Ruygt

We’re Fly.io, a new public cloud with simple, developer-friendly ergonomics. And as of today, cheaper GPUs. Try it out; you’ll be deployed in just minutes.

We just lowered the prices on NVIDIA L40s GPUs to $1.25 per hour. Why? Because our feet are cold and we burn processor cycles for heat. But also other reasons.

Let’s back up.

We offer 4 different NVIDIA GPU models; in increasing order of performance, they’re the A10, the L40S, the 40G PCI A100, and the 80G SXM A100. Guess which one is most popular.

We guessed wrong, and spent a lot of time working out how to maximize the amount of GPU power we could deliver to a single Fly Machine. Users surprised us. By a wide margin, the most popular GPU in our inventory is the A10.

The A10 is an older generation of NVIDIA GPU with fewer, slower cores and less memory. It’s the least capable GPU we offer. But that doesn’t matter, because it’s capable enough. It’s solid for random inference tasks, and handles mid-sized generative AI stuff like Mistral Nemo or Stable Diffusion. For those workloads, there’s not that much benefit in getting a beefier GPU.

As a result, we can’t get new A10s in fast enough for our users.

If there’s one thing we’ve learned by talking to our customers over the last 4 years, it’s that y'all love a peek behind the curtain. So we’re going to let you in on a little secret about how a hardware provider like Fly.io formulates GPU strategy: none of us know what the hell we’re doing.

If you had asked us in 2023 what the biggest GPU problem we could solve was, we’d have said “selling fractional A100 slices”. We burned a whole quarter trying to get MIG, or at least vGPUs, working through IOMMU PCI passthrough on Fly Machines, in a project so cursed that Thomas has forsworn ever programming again. Then we went to market selling whole A100s, and for several more months it looked like the biggest problem we needed to solve was finding a secure way to expose NVLink-ganged A100 clusters to VMs so users could run training. Then H100s; can we find H100s anywhere? Maybe in a black market in Shenzhen?

And here we are, a year later, looking at the data, and the least sexy, least interesting GPU part in the catalog is where all the action is.

With actual customer data to back up the hypothesis, here’s what we think is happening today:

Most users who want to plug GPU-accelerated AI workloads into fast networks are doing inference, not training.
The hyperscaler public clouds strangle these customers, first with GPU instance surcharges, and then with egress fees for object storage data when those customers try to outsource the GPU stuff to GPU providers.
If you’re trying to do something GPU-accelerated in response to an HTTP request, the right combination of GPU, instance RAM, fast object storage for datasets and model parameters, and networking is much more important than getting your hands on an H100.

This is a thing we didn’t see coming, but should have: training workloads tend to look more like batch jobs, and inference tends to look more like transactions. Batch training jobs aren’t that sensitive to networking or even reliability. Live inference jobs responding to end-user HTTP requests are. So, given our pricing, of course the A10s are a sweet spot.

The next step up in our lineup after the A10 is the L40S. The L40S is a nice piece of kit. We’re going to take a beat here and sell you on the L40S, because it’s kind of awesome.

The L40S is an AI-optimized version of the L40, which is the data center version of the GeForce RTX 4090, resembling two 4090s stapled together.

If you’re not a GPU hardware person, the RTX 4090 is a gaming GPU, the kind you’d play ray-traced Witcher 3 on. NVIDIA’s high-end gaming GPUs are actually reasonably good at AI workloads! But they suck in a data center rack: they chug power, they’re hard to cool, and they’re less dense. Also, NVIDIA can’t charge as much for them.

Hence the L40: (much) more memory, less energy consumption, designed for a rack, not a tower case. Marked up for “enterprise”.

NVIDIA positioned the L40 as a kind of “graphics” AI GPU. Unlike the super high-end cards like the A100/H100, the L40 keeps all the rendering hardware, so it’s good for 3D graphics and video processing. Which is sort of what you’d expect from a “professionalized” GeForce card.

A funny thing happened in the middle of 2023, though: the market for ultra-high-end NVIDIA cards went absolutely batshit. The huge cards you’d gang up for training jobs got impossible to find, and NVIDIA became one of the most valuable companies in the world. Serious shops started working out plans to acquire groups of L40-type cards to work around the problem, whether or not they had graphics workloads.

The only company in this space that does know what they’re doing is NVIDIA. Nobody has written a highly-ranked Reddit post about GPU workloads without NVIDIA noticing and creating a new SKU. So they launched the L40S, which is an L40 with AI workload compute performance comparable to that of the A100 (without us getting into the details of F32 vs. F16 models).

Long story short, the L40S is an A100-performer that we can price for A10 customers; the Volkswagen GTI of our lineup. We’re going to see if we can make that happen.

We think the combination of just-right-sized inference GPUs and Tigris object storage is pretty killer:

model parameters, data sets, and compute are all close together
everything plugged into an Anycast network that’s fast everywhere in the world
on VM instances that have enough memory to actually run real frameworks on
priced like we actually want you to use it.

You should use L40S cards without thinking hard about it. So we’re making it official. You won’t pay us a dime extra to use one instead of an A10. Have at it! Revolutionize the industry. For $1.25 an hour.

Here are things you can do with an L40S on Fly.io today:

You can run Llama 3.1 70B — a big Llama — for LLM jobs.
You can run Flux from Black Forest Labs for genAI images.
You can run Whisper for automated speech recognition.
You can do whole-genome alignment with SegAlign (Thomas’ biochemist kid who has been snatching free GPU hours for his lab gave us this one, and we’re taking his word for it).
You can run DOOM Eternal, building the Stadia that Google couldn’t pull off, because the L40S hasn’t forgotten that it’s a graphics GPU.

It’s going to get chilly in Chicago in a month or so. Go light some cycles on fire!

Next post ↑: Accident Forgiveness
Previous post ↓: Making Machines Move