Making Machines Move

Author

Name: Thomas Ptacek
@tqbf: @tqbf

Two rectangles on a dark background. A cartoon bird in one. A dotted arrow arcs from the bird to the emptiness within the second rectangle. — Image by Annie Ruygt

We’re Fly.io, a global public cloud with simple, developer-friendly ergonomics. If you’ve got a working Docker image, we’ll transmogrify it into a Fly Machine: a VM running on our hardware anywhere in the world. Try it out; you’ll be deployed in just minutes.

At the heart of our platform is a systems design tradeoff about durable storage for applications. When we added storage three years ago, to support stateful apps, we built it on attached NVMe drives. A benefit: a Fly App accessing a file on a Fly Volume is never more than a bus hop away from the data. A cost: a Fly App with an attached Volume is anchored to a particular worker physical.

bird: a BGP4 route server.

Before offering attached storage, our on-call runbook was almost as simple as “de-bird that edge server”, “tell Nomad to drain that worker”, and “go back to sleep”. NVMe cost us that drain operation, which terribly complicated the lives of our infra team. We’ve spent the last year getting “drain” back. It’s one of the biggest engineering lifts we’ve made, and if you didn’t notice, we lifted it cleanly.

The Goalposts

With stateless apps, draining a worker is easy. For each app instance running on the victim server, start a new instance elsewhere. Confirm it’s healthy, then kill the old one. Rinse, repeat. At our 2020 scale, we could drain a fully loaded worker in just a handful of minutes.

You can see why this process won’t work for apps with attached volumes. Sure, create a new volume elsewhere on the fleet, and boot up a new Fly Machine attached to it. But the new volume is empty. The data’s still stuck on the original worker. We asked, and customers were not OK with this kind of migration.

Of course, we back Volumes snapshots up (at an interval) to off-network storage. But for “drain”, restoring backups isn’t nearly good enough. No matter the backup interval, a “restore from backup migration" will lose data, and a “backup and restore” migration incurs untenable downtime.

The next thought you have is, “OK, copy the volume over”. And, yes, of course you have to do that. But you can’t just copy, boot, and then kill the old Fly Machine. Because the original Fly Machine is still alive and writing, you have to killfirst, then copy, then boot.

Fly Volumes can get pretty big. Even to a rack buddy physical server, you’ll hit a point where draining incurs minutes of interruption, especially if you’re moving lots of volumes simultaneously. Kill, copy, boot is too slow.

There’s a world where even 15 minutes of interruption is tolerable. It’s the world where you run more than one instance of your application to begin with, so prolonged interruption of a single Fly Machine isn’t visible to your users. Do this! But we have to live in the same world as our customers, many of whom don’t run in high-availability configurations.

Behold The Clone-O-Mat

Copy, boot, kill loses data. Kill, copy, boot takes too long. What we needed is a new operation: clone.

Clone is a lazier, asynchronous copy. It creates a new volume elsewhere on our fleet, just like copy would. But instead of blocking, waiting to transfer every byte from the original volume, clone returns immediately, with a transfer running in the background.

A new Fly Machine can be booted with that cloned volume attached. Its blocks are mostly empty. But that’s OK: when the new Fly Machine tries to read from it, the block storage system works out whether the block has been transferred. If it hasn’t, it’s fetched over the network from the original volume; this is called “hydration”. Writes are even easier, and don’t hit the network at all.

Kill, copy, boot is slow. But kill, clone, boot is fast; it can be made asymptotically as fast as stateless migration.

There are three big moving pieces to this design.

First, we have to rig up our OS storage system to make this clone operation work.
Then, to read blocks over the network, we need a network protocol. (Spoiler: iSCSI, though we tried other stuff first.)
Finally, the gnarliest of those pieces is our orchestration logic: what’s running where, what state is it in, and whether it’s plugged in correctly.

Block-Level Clone

The Linux feature we need to make this work already exists; it’s called dm-clone. Given an existing, readable storage device, dm-clone gives us a new device, of identical size, where reads of uninitialized blocks will pull from the original. It sounds terribly complicated, but it’s actually one of the simpler kernel lego bricks. Let’s demystify it.

As far as Unix is concerned, random-access storage devices, be they spinning rust or NVMe drives, are all instances of the common class “block device”. A block device is addressed in fixed-size (say, 4KiB) chunks, and handles (roughly) these operations:

    enum req_opf {
    /* read sectors from the device */
    REQ_OP_READ     = 0,
    /* write sectors to the device */
    REQ_OP_WRITE        = 1,
    /* flush the volatile write cache */
    REQ_OP_FLUSH        = 2,
    /* discard sectors */
    REQ_OP_DISCARD      = 3,
    /* securely erase sectors */
    REQ_OP_SECURE_ERASE = 5,
    /* write the same sector many times */
    REQ_OP_WRITE_SAME   = 7,
    /* write the zero filled sector many times */
    REQ_OP_WRITE_ZEROES = 9,
    /* ... */
};

  

You can imagine designing a simple network protocol that supported all these options. It might have messages that looked something like:

A packet diagram, just skip down to "struct bio" below Good news! The Linux block system is organized as if your computer was a network running a protocol that basically looks just like that. Here’s the message structure:

I’ve stripped a bunch of stuff out of here but you don’t need any of it to understand what’s coming next.

    /*
 * main unit of I/O for the block layer and lower layers (ie drivers and
 * stacking drivers)
 */
struct bio {
    struct gendisk      *bi_disk;
    unsigned int        bi_opf
    unsigned short      bi_flags;   
    unsigned short      bi_ioprio;
    blk_status_t        bi_status;
    unsigned short      bi_vcnt;         /* how many bio_vec's */
    struct bio_vec      bi_inline_vecs[] /* (page, len, offset) tuples */;
};

  

No nerd has ever looked at a fixed-format message like this without thinking about writing a proxy for it, and struct bio is no exception. The proxy system in the Linux kernel for struct bio is called device mapper, or DM.

DM target devices can plug into other DM devices. For that matter, they can do whatever the hell else they want, as long as they honor the interface. It boils down to a map(bio) function, which can dispatch a struct bio, or drop it, or muck with it and ask the kernel to resubmit it.

You can do a whole lot of stuff with this interface: carve a big device into a bunch of smaller ones (dm-linear), make one big striped device out of a bunch of smaller ones (dm-stripe), do software RAID mirroring (dm-raid1), create snapshots of arbitrary existing devices (dm-snap), cryptographically verify boot devices (dm-verity), and a bunch more. Device Mapper is the kernel backend for the userland LVM2 system, which is how we do thin pools and snapshot backups.

Which brings us to dm-clone : it’s a map function that boils down to:

        /* ... */ 
    region_nr = bio_to_region(clone, bio);

    // we have the data
    if (dm_clone_is_region_hydrated(clone->cmd, region_nr)) {
        remap_and_issue(clone, bio);
        return 0;

    // we don't and it's a read
    } else if (bio_data_dir(bio) == READ) {
        remap_to_source(clone, bio);
        return 1;
    }

    // we don't and it's a write
    remap_to_dest(clone, bio);
    hydrate_bio_region(clone, bio);
    return 0;
    /* ... */ 

  

a kcopyd thread runs in the background, rehydrating the device in addition to (and independent of) read accesses.

dm-clone takes, in addition to the source device to clone from, a “metadata” device on which is stored a bitmap of the status of all the blocks: either “rehydrated” from the source, or not. That’s how it knows whether to fetch a block from the original device or the clone.

Network Clone

flyd in a nutshell: worker physical run a service, flyd, which manages a couple of databases that are the source of truth for all the Fly Machines running there. Concepturally, flyd is a server for on-demand instances of durable finite state machines, each representing some operation on a Fly Machine (creation, start, stop, &c), with the transition steps recorded carefully in a BoltDB database. An FSM step might be something like “assign a local IPv6 address to this interface”, or “check out a block device with the contents of this container”, and it’s straightforward to add and manage new ones.

Say we’ve got flyd managing a Fly Machine with a volume on worker-xx-cdg1-1. We want it running on worker-xx-cdg1-2. Our whole fleet is meshed with WireGuard; everything can talk directly to everything else. So, conceptually:

flyd on cdg1-1 stops the Fly Machine, and
sends a message to flyd on cdg1-2 telling it to clone the source volume.
flyd on cdg1-2 starts a dm-clone instance, which creates a clone volume on cdg1-2, populating it, over some kind of network block protocol, from cdg1-1, and
boots a new Fly Machine, attached to the clone volume.
flyd on cdg1-2 monitors the clone operation, and, when the clone completes, converts the clone device to a simple linear device and cleans up.

For step (3) to work, the “original volume” on cdg1-1 has to be visible on cdg1-2, which means we need to mount it over the network.

nbd is so simple that it’s used as a sort of dm-user userland block device; to prototype a new block device, don’t bother writing a kernel module, just write an nbd server.

Take your pick of protocols. iSCSI is the obvious one, but it’s relatively complicated, and Linux has native support for a much simpler one: nbd, the “network block device”. You could implement an nbd server in an afternoon, on top of a file or a SQLite database or S3, and the Linux kernel could mount it as a drive.

We started out using nbd. But we kept getting stuck nbd kernel threads when there was any kind of network disruption. We’re a global public cloud; network disruption happens. Honestly, we could have debugged our way through this. But it was simpler just to spike out an iSCSI implementation, observe that didn’t get jammed up when the network hiccuped, and move on.

Putting The Pieces Together

To drain a worker with minimal downtime and no lost data, we turn workers into a temporary SANs, serving the volumes we need to drain to fresh-booted replica Fly Machines on a bunch of “target” physicals. Those SANs — combinations of dm-clone, iSCSI, and our flyd orchestrator — track the blocks copied from the origin, copying each one exactly once and cleaning up when the original volume has been fully copied.

Problem solved!

No, There Were More Problems

When your problem domain is hard, anything you build whose design you can’t fit completely in your head is going to be a fiasco. Shorter form: “if you see Raft consensus in a design, we’ve done something wrong”.

A virtue of this migration system is that, for as many moving pieces as it has, it fits in your head. What complexity it has is mostly shouldered by strategic bets we’ve already built teams around, most notably the flyd orchestrator. So we’ve been running this system for the better part of a year without much drama. Not no drama, though. Some drama.

Example: we encrypt volumes. Our key management is fussy. We do per-volume encryption keys that provision alongside the volumes themselves, so no one worker has a volume skeleton key.

If you think “migrating those volume keys from worker to worker” is the problem I’m building up to, well, that too, but the bigger problem is trim.

Most people use just a small fraction of the volumes they allocate. A 100GiB volume with just 5MiB used wouldn’t be at all weird. You don’t want to spend minutes copying a volume that could have been fully hydrated in seconds.

And indeed, dm-clone doesn’t want to do that either. Given a source block device (for us, an iSCSI mount) and the clone device, a DISCARD issued on the clone device will get picked up by dm-clone, which will simply short-circuit the read of the relevant blocks by marking them as hydrated in the metadata volume. Simple enough.

To make that work, we need the target worker to see the plaintext of the source volume (so that it can do an fstrim — don’t get us started on how annoying it is to sandbox this — to read the filesystem, identify the unused block, and issue the DISCARDs where dm-clone can see them) Easy enough.

these curses have a lot to do with how hard it was to drain workers!

Except: two different workers, for cursed reasons, might be running different versions of cryptsetup, the userland bridge between LUKS2 and the kernel dm-crypt driver. There are (or were) two different versions of cryptsetup on our network, and they default to different LUKS2 header sizes — 4MiB and 16MiB. Implying two different plaintext volume sizes.

So now part of the migration FSM is an RPC call that carries metadata about the designed LUKS2 configuration for the target VM. Not something we expected to have to build, but, whatever.

Corrosion deserves its own post.

Gnarlier example: workers are the source of truth for information about the Fly Machines running on them. Migration knocks the legs out from under that constraint, which we were relying on in Corrosion, the SWIM-gossip SQLite database we use to connect Fly Machines to our request routing. Race conditions. Debugging. Design changes. Next!

Gnarliest example: our private networks. Recall: we automatically place every Fly Machine into a private network; by default, it’s the one all the other apps in your organization run in. This is super handy for setting up background services, databases, and clustered applications. 20 lines of eBPF code in our worker kernels keeps anybody from “crossing the streams”, sending packets from one private network to another.

we’re members of an elite cadre of idiots who have managed to find designs that made us wish IPv6 addresses were even bigger.

We call this scheme 6PN (for “IPv6 Private Network”). It functions by embedding routing information directly into IPv6 addresses. This is, perhaps, gross. But it allows us to route diverse private networks with constantly changing membership across a global fleet of servers without running a distributed routing protocol. As the beardy wizards who kept the Internet backbone up and running on Cisco AGS+’s once said: the best routing protocol is “static”.

Problem: the embedded routing information in a 6PN address refers in part to specific worker servers.

That’s fine, right? They’re IPv6 addresses. Nobody uses literal IPv6 addresses. Nobody uses IP addresses at all; they use the DNS. When you migrate a host, just give it a new 6PN address, and update the DNS.

Friends, somebody did use literal IPv6 addresses. It was us. In the configurations for Fly Postgres clusters.

It’s also not operationally easy for us to shell into random Fly Machines, for good reason.

The obvious fix for this is not complicated; given flyctl ssh access to a Fly Postgres cluster, it’s like a 30 second ninja edit. But we run a lot of Fly Postgres clusters, and the change has to be coordinated carefully to avoid getting the cluster into a confused state. We went as far as adding feature to our init to do network address mappings to keep old 6PN addresses reachable before biting the bullet and burning several weeks doing the direct configuration fix fleet-wide.

Speedrun your app onto Fly.io.

3…2…1…
Go! →

The Learning, It Burns!

We get asked a lot why we don’t do storage the “obvious” way, with an EBS-type SAN fabric, abstracting it away from our compute. Locally-attached NVMe storage is an idiosyncratic choice, one we’ve had to write disclaimers for (single-node clusters can lose data!) since we first launched it.

One answer is: we’re a startup. Building SAN infrastructure in every region we operate in would be tremendously expensive. Look at any feature in AWS that normal people know the name of, like EC2, EBS, RDS, or S3 — there’s a whole company in there. We launched storage when we were just 10 people, and even at our current size we probably have nothing resembling the resources EBS gets. AWS is pretty great!

But another thing to keep in mind is: we’re learning as we go. And so even if we had the means to do an EBS-style SAN, we might not build it today.

Instead, we’re a lot more interested in log-structured virtual disks (LSVD). LSVD uses NVMe as a local cache, but durably persists writes in object storage. You get most of the performance benefit of bus-hop disk writes, along with unbounded storage and S3-grade reliability.

We launched LSVD experimentally last year; in the intervening year, something happened to make LSVD even more interesting to us: Tigris Data launched S3-compatible object storage in every one our regions, so instead of backhauling updates to Northern Virginia, we can keep them local. We have more to say about LSVD, and a lot more to say about Tigris.

Our first several months of migrations were done gingerly. By summer of 2024, we got to where our infra team can pull “drain this host” out of their toolbelt without much ceremony.

We’re still not to the point where we’re migrating casually. Your Fly Machines are probably not getting migrated! There’d need to be a reason! But the dream is fully-automated luxury space migration, in which you might get migrated semiregularly, as our systems work not just to drain problematic hosts but to rebalance workloads regularly. No time soon. But we’ll get there.

This is the biggest thing our team has done since we replaced Nomad with flyd. Only the new billing system comes close. We did this thing not because it was easy, but because we thought it would be easy. It was not. But: worth it!

Next post ↑: We're Cutting L40S Prices In Half
Previous post ↓: AWS without Access Keys

The Goalposts

Behold The Clone-O-Mat

Block-Level Clone

Network Clone

Putting The Pieces Together

No, There Were More Problems

Speedrun your app onto Fly.io.

The Learning, It Burns!