2024-08-10

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

  • August 7: Partially Broken flyd Deployment (13:30EST): Our Fly Machines development team rolled out a change to flyd, our orchestrator, which altered the way it stored configurations for Fly Machines — from a hand-rolled struct to the on-the-wire version of that struct the Protobuf compiler generates. The change missed a corner case: if it was deployed on a flyd that was in the process of updating metadata for a Machine, the new code could reach an unhandled error path. Additionally, the PR at issue tempted fate by leading off in its description with “We’ve seen enough foot guns with trying to maintain two different representations”. For roughly 20 minutes some, some subset of Fly Machines on a subset of our hosts would have returned API errors on routine Machines API operations; a complete fix was rolled out within an hour.

  • August 7: Long Response Times From Metrics Cluster (11:00EST): A customer alerted us through a support ticket to intermittant severe slow-downs (on the order of low double-digit seconds) for queries to Fly Metrics. The problem was traced to a recent configuration change on our VictoriaMetrics backend (in one region) that was causing its internal request router component to OOM.

  • August 8: Storage Capacity Exhausted In Sweden (15:00EST): Requests to place Fly Machines with volumes in ARN began failing due to inadequate disk capacity on worker servers. The physicals in ARN were running in a nonstandard disk configuration; the workers impacted really were out of disk, annoyingly. We mitigated the issue within 40 minutes by migrating workloads to clear space for new Fly Machines, and initiated some long term capacity-planning and data center backup work for newer Fly.io regions. Note that “creating a new volume in a different region” is an operation that is allowed to fail in our API! With the exception of a few core regions, users are expected to be prepared to retry or place workloads in alternate regions.

This Week In Infra Engineering

Akshit began a project to diversify our edge providers and edge routing. Recall that our production infra is broadly divided into edge hosts (that receive traffic from the Internet, terminate TLS, and route it) and worker hosts (that actually host Fly Machine VMs). We have more flexibility on which providers and datacenters we can run edge hosts on, because they don’t require chonky servers. Akshit is working out the processes and network engineering (like per-region-provider IP blocks) required for us to take better advantage of available host and network inventory for edges. Ideally, we’ll wrap this project up with same-region backup routing (via different providers) in our key regions.

Peter spent the week sick. He wants you to know he feels better now.

Steve is working on rehabilitating RAID10 hosts. This is a beast of an issue that has been taunting us since late 2022: we took delivery of a bunch of extremely chonky worker servers that would handle our workloads just fine for a period, and then lock up in unkillable LVM threads. We solved those problems for customers by migrating workloads off those machines, and now Steve is doing the storage subsystem brain surgery required to find out if we can bring them back into service.

Somtochi has moved from Pet Semetary (our Vault replacement, which she got deployed fleetwide) to Corrosion2 (which she drastically improved the performance and resiliency of) to fly-proxy, the engine for our Anycast network. She’s picking up where Peter left off, with backup routing, by extending it to raw TCP connections and not just HTTP (reminder: if it speaks TCP on any port, you can run it here.)

Dusty is working with one of our upstream hardware providers to get us end-to-end control over machine provisioning, rather than having them hand off physicals with BMC connections for us to provision. Faster, tighter physical host provisioning means we can bring up capacity in regions more quickly; we’ve been O.K. there to date, but that leads us to

John is working on “infinite capacity” burst provisioning processes, which is a shorthand for “you can ask us for 1000 16G Fly Machines” (it has happened) “and we will just automatically spin up the underlying hardware capacity needed to satisfy that request”. We’re a ways off on this but expect it to be a theme of our updates (if it pans out). Again, this is primarily of interest to people who expect to have sudden or sporadic needs for large numbers of Fly Machines in very specific places.