2024-09-21

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

  • September 16: Internal ElasticSearch Outage (16:00EST): Concurrent with an internal deployment of our flyd orchestrator, our internal logs stopped responding to queries for newly ingested logs. Shortly thereafter, we got alerts about ElasticSearch components crashing (we run a large ES cluster with many hosts, the alerts only impacted a few). Diagnosing ES issues is a black art; we were exceeding our field limit (or had been for some time and just noticed); bouncing ES components got things back into a steady state. Minimal known customer impact.

  • September 18: Depot Builders Failing (09:00EST): Most containers that run on Fly.io are built using “remote builders” — Fly Machines that run Buildkit (“Docker”, essentially) so that user laptops, many of which are ARM64, don’t have to. We’ve offered remote builders since 2021, but recently have begun a partnership with Depot, a team that specializes in doing fast container builds. This morning, Depot builds, which were within the last couple weeks made the default on the platform, stopped working; Depot’s telemetry alerted us to Fly Machines Volumes API errors. Fly’s own “native” remote builders continue to work; we posted a workaround on the status page within a few minutes, and then cut the platform over to native builders with a feature flag shift, which resolved the incident. The issue turned out to have been an API incompatibility (really, a database access bug surfaced by an API change) in our orchestrator code.

  • September 18: Internal ElasticSearch Outage (13:00EST): Roughly the same thing happened as on September 16th, but this time with the added bonus fun of certificate expiration issues that forced us to re-authenticate the cluster. Restarting ES this time involved a lengthy recovery process that stalled repeatedly and pushed some components close to their Java heap max sizes; in other words, a lot of handholding was involved in fixing the cluster. This was again primarily an internal issue (we take it seriously, because we rely on these logs for our own incident response); it would have messed up some UI features that print log lines “in passing”, but not customer logs in general.

  • September 19: Capacity Issues In East Coast Regions (12:00EST): A customer running batch jobs allocated a very large number of performance instances across IAD, BOS, and EWR. To this point, our rate and CPU limits have focused largely on shared instances (which are inexpensive and see a lot of abuse usage); this was a totally legit customer that just happened to be demanding a surprising amount of instantaneous compute. For about 30 minutes, compute jobs in these regions saw CPU steal and performance degradation, resolved for the regions by adjusing CPU limits. The medium-term fix, which we’ll talk about in a bit, involves improving workload scheduling to avoid concentrating these kinds of workloads on specific worker hosts.

This Week In Infra Engineering

Will did a deep dive into I/O scheduling in our LHR clusters, after Tigris nudged us about performance/reliability issues in their FoundationDB cluster running in that region. Using metrics, system configuration, and statekeeping data, Will isolated Tigris’s workloads to a concentrated cluster of SSDs with a particular make/model, which we now know to have an iffy performance envelope for the kinds of work we do. The bigger problem wasn’t so much the drives as it was the scheduling we did: because Tigris created their series of volumes for this region in rapid succession, they all got scheduled on a small subset of our storage capacity in the region. Worse, a consequence of our scheduling algorithm was that the periodic snapshot backups of these volumes were all scheduled to fire in tandem, concentrating a large amount of I/O activity on a small number of drives (the graph of what was happening looked like a stable EKG). Scheduling improvements are a theme of the infra work that we’re doing right now, but this issue in particular surfaced an unexpected (and straightforward to fix) issue: we needed to be adding jitter to the timing of our backup jobs.

Ben is hip-deep in a fleetwide upgrade of our worker OS distributions. This is a tricky and annoying problem. Most of what runs on a worker physical for us is software we build and ship ourselves, in some cases several times a week. Beyond that, we have an established runbook for upgrading our OS kernels; we don’t run the distro-version-native kernel version anywhere (we have fussy eBPF dependencies, among other things). But the distro itself, which in particular sticks us with a specific version of systemd, is a gigantic pain to upgrade; we have consistent OS kernels across the fleet, but not consistent distro versions. That’s changing, but it’s a complicated process, involving workload migrations and, in some cases, reprovisioning servers, which surfaces fun bugs like “the semi-random identifiers we create for Fly Machines are influenced by the provenance of the worker physical on which it was created, meaning a reprovisioned host can cause Machine ID collisions”. That bug hasn’t happened, because Ben is auditing the stack for problems like that.

Dusty has been the capacity czar for the past couple months. As you can see from this week’s BOS outage, this is an important issue for us. We’ve integrated our scheduler state with system metrics and used that to create new capacity threshold numbers for the fleet, which now informs our provisioning; a lot of the guesswork has been taken out of where we’re shipping and racking new servers. We’ve shifted some capacity (ORD gave some servers to EWR, for instance), and reallocated a backlog of servers to different regions. We now have a runbook for provisioning new capacity in existing regions that uses a lightweight version of our machine migration (for apps without storage) to rebalance as capacity is added.

JP has been working on improving Fly Machine scheduling across the fleet, which was also implicated in an incident this week. We now have stricter placement logic that ensures multiple Fly Machines for the same app created concurrently are distributed across worker physicals. Our Machines API now handles some of the retry logic that our CLI, flyctl, was using before; unlike flyctl, which is open-source and doesn’t have APIs that can specifically place a Fly Machine on a specific worker physical, our API gateways do have visiblity into the physicals in a region, and we now take advantage of that.