Persistent Storage and Fast Remote Builds

Author

Name: Jerome Gravel-Niquet
@jeromegn: @jeromegn

Author

Name: Thomas Ptacek
@tqbf: @tqbf

Author

Name: Jerome Gravel-Niquet
@jeromegn: @jeromegn

If you’ve been keeping up with us at Fly, you may be picking up on a bit of a narrative with us.

Fly launched, in the long-long-ago, with a somewhat narrow use case. We took containers from our customers and transmogrified them into fleets of Firecracker micro-VMs connected to an anycast network that kept code running close to users. When we were talking to investors, we called this “edge computing”, and sold it as a way to speed up websites. And that works great.

But it turns out though if you build a flexible platform for edge apps, you wind up with a pretty good way to run lots of other kinds of applications. And so our users have been doing that. And we’re not going to complain! Instead, we’re working to make it easier to do that.

The Storage Problem

The biggest question mark for people thinking about hosting whole apps on Fly has always been storage.

Until somewhat recently, micro-VMs on Fly were entirely ephemeral. A worker gets an order to run one. It builds a root filesystem for it from the container image. It runs the micro-VM, maybe for a few hours, maybe for a few weeks. The VM exits, the worker cleans it up, and that’s it.

You can get a surprising amount done with a configuration like this and little else, but if you want to run an entire app, front-to-backend, you need more.

Until now, the one really good answer we had was to use external services for storage. You can, for instance, keep your data in something like RDS, and then use a fast, secure WireGuard gateway to bridge your Fly app instances to RDS. Or you can set up S3 buckets in a bunch of regions. That works fine and for some apps might be optimal.

But, obviously, developers want persistent storage. And we’re starting to roll that out.

Fly Volumes

You can, on Fly today, attach persistent storage volumes to Fly apps. It’s straightforward: you run flyctl volumes create data --region ewr --size 25 to create a 25 gig volume named “data” in Newark. Then you tell your app about it in fly.toml:

[[mounts]]
  source = "data"
  destination = "/data"

Once connected to a volume, we’ll only run as many instances of your app as you have matching volumes. You can create lots of volumes, all named “data”, in different regions. We modeled this UX after Docker and tried to make it predictable for people who work with containers.

What’s Happening With Volumes

I’m going to share what’s happening under the hood when you use volumes, and then rattle off some of the limitations, because they are important.

Fly runs on dedicated hardware in data centers around the world. On volume-eligible servers, that hardware includes large amounts of NVME storage. We use Linux LVM to carve out a thin pool. We track the amount of space in all those pools.

LVM2 thin pools are interesting. Rather that preallocating disk space for new logical volumes, thin volumes allocate as data is written. They have in effect a current size and a cap. And the caps of all the volumes in a single thin pool can overlap; you can create more thin volume space than there is disk, like a bank creates money. This was a big deal in enterprise disk storage in the 2000s, is built into LVM now, and we… don’t use it for much. Our disks aren’t oversubscribed, but thin allocation makes administration and snapshot backups a little easier for us, and I just thought you’d like to know.

We encrypt volumes for users, using standard Linux block device crypto (XTS with random keys). This doesn’t mean much to you, but we do it anyways, because somebody somewhere is going to talk to a SOC 2 auditor who is going to want to know that every disk in the shop is “encrypted”, and, OK, these ones are too.

The problem of course is that for us to manage your encrypted devices, we have to have the keys. Which means that anybody who manages to own up our devices also has those keys, and, as you’d hope, people who can’t own up our devices already don’t have access to your disk. About the only problem disk encryption solves for us is somebody forklifting our servers out of their secure data centers in some weird hardware heist.

The problem is mitigated somewhat by our orchestration system. The control plane for Fly.io is Hashicorp Nomad, about which we will be writing more in the future. Nomad is in charge of knowing which piece of our hardware is running which applications. Because we have a fairly intense TLS certificate feature, we also have a deployment of Hashicorp Vault, which is basically the the de facto standard on-prem secret storage system. Nomad knows how to talk to Vault, and we store drive secrets there; when a micro-VM is spun up on a piece of hardware, it gets a lease for the app’s associated secrets, and hardware not running that app doesn’t.

This is all sort of built in to Nomad and Vault (the secret storage and leasing, that is; the disk management is all us) and it’s neat, and might matter a little bit in the future when we start doing volume migrations between servers, but really: serverside full disk encryption is pretty much rubber chicken security, and I’m just taking this opportunity to get that take out there.

Anyways.

When you create a volume with flyctl, you talk to our API server, which finds a compatible server with sufficient space. Once a match is found, we decrement the available space on the server and push a Consul update recording the existence of the new volume.

Workers listen for Consul updates for volume changes and create/remove LVM thin volumes, with ext4 filesystems, as needed.

When an instance of a volume-attached app is scheduled to deploy, our orchestrator treats the volume as a constraint when finding eligible servers. Deployments are routed to (and limited to) hosts with attachable volumes. Before booting up the micro-VM for those apps, we look up the logical volume we made, recreate its block device node in the jail Firecracker runs inside of, and set up mount points.

And that’s pretty much all there is to it. You get a persistent filesystem to play with, which survives multiple reboots and deployments of your app. It’s performance-competitive (usually, a little faster) with EBS.

There Are Implications To This

The storage nerds in our audience are raising their hands in the air waiting to point this out: we haven’t talked about data resilience. And that’s because, right now, there isn’t much to talk about.

Right now, for raw Fly volumes, resilience is your problem. There! I said it!

This is not a storage model that’s appropriate to every application, and we want to be super clear about that. We think it makes sense in two major cases.

The first is cluster storage. In the past few months, we’ve made it easy to boot up clusters of things that talk to each other, and to create ensembles of services that work together. So one way to use theoretically-unreliable storage is with a replicating database cluster, where you’re effectively backing up the data in real time. You do resilience at the app layer.

The second, of course, is for the kinds of data where loss is an inconvenience and not a disaster; caches, metrics, and things you can back up on a slower-than-real-time cadence. This is a big class of applications! We really wanted disks for CDN workloads. Caches can go away, but minimizing cache churn and keeping “warm” caches around between deploys is handy. And big caches are great! Not having cache sizes bound by memory is great! If you’re building a CDN with Fly, go nuts with volumes.

Attaching volumes to apps changes the way they scale. Once you attach a volume, you can only have as many instances of that app as you have volumes. Again, a lot of the time, volume storage will make sense for a cluster of storage/database servers you run alongside the rest of your app.

Another subtle change is deployment. By default, Fly uses “canary” deployments: we spin up a new VM, make sure it passes health checks, and then tear down the old one. But volumes are provisioned 1:1 with instances; if you have 3 volumes for an app, we can’t make a 4th volume appear for the canary deploy to work. So apps with volumes do rolling deploys.

An aside: we don’t currently expose snapshots or volume migration in our API. But if you’re a storage nerd: we can generally do the things you’d expect to be able to do with logical volumes, like shipping LVM snapshots, and you should totally reach out if you need something like that or want to know how it would work.

This is a starting point for us, not the final destination. We’ve got more storage stuff coming. This is going to be a whole thing with us in 2021.

Store data like on a real computer from 1992

You’re less than 10 minutes away from having any container you can build running globally, with attached persistent storage.
Try Fly for free →

In Other News Remote Docker Builds Got Way Faster

Hey, one of the things you can do now that we have volumes available is drastically improve build times.

There’s two ways flyctl will get a Dockerfile deployed on Fly. The first is easy: we’ll talk to your local Docker instance, have it build the image, and then push that image to our registry. Until recently, this was the “good” way to do it, because it’s fast.

The second way it can work is that flyctl can do a remote build. This works by talking to a remote Docker server instead of one running locally — which, incidentally, is how Docker works already if you’re on macOS and it’s running in a VM.

For the last couple years, to do remote builds, we used AWS CodeBuild. CodeBuild is great and all, but if you think about what Fly is, it’s a little weird that we’re using it (our internal CI/CD system runs on our hardware). And our user experience with CodeBuild has been… not the best. Until now, remote build has been the “bad” way to do it; it’s slow enough that, on our own projects, when we saw “remote builds” kicking in, we stopped the deploy, started up our local Docker instance, and started over.

No more! If you’re running recent flyctl, you may have noticed remote builds got a lot faster.

That’s because when you run a remote build now, we first match the build to a “builder” app running in an instance in our network. The builder has attached storage to cache layers and manifests and exposes a Docker server, to which flyctl authenticates and requests a Docker build, which then gets pushed to our repository.

Once you do a remote build, you’ll see them in your app list. Say hi! We’re not billing you for them.

You can make a remote build happen if you deploy without a local Docker running, call deploy --remote-only, or are running on ARM.

Getting this working was interesting, because it adds a new kind of instance to the mix. You don’t want a builder hanging around doing nothing when you’re not deploying. Our builder instances terminate after 10 minutes of inactivity, and are created/deployed automatically by flyctl (we added a GQL call to do this) as needed. Up until now, if your instance died, our orchestration’s job was to immediately restart it; builder jobs are “ephemeral” and aren’t restarted if they exit successfully.

This is something we get a lot of requests for. People want permanent storage and short lived VMs that boot on demand and exit when their job is done. We’ve got that now and are figuring out the best way to surface it. If you’re interested, let us know.

Do you have storage feature requests or questions about the new remote builders? We have a community discussion just for you.

Next post ↑: SSH and User-mode IP WireGuard
Previous post ↓: The Tokio Upgrade from 0.2 to 1.x

The Storage Problem

Fly Volumes

What’s Happening With Volumes

There Are Implications To This

Store data like on a real computer from 1992

In Other News Remote Docker Builds Got Way Faster