2025-04-05

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact.)

March 24: FKS K8s Failure in ORD (5:00EST): Internal testing of MPG (Managed Postgres), which depends on FKS, alerted the team that something weird was up. Within a minute or two of investigation, we learned that we’d lost a decent chunk of our FKS control plane cluster (etcd, principally) in IAD, and most that cluster in ORD. Skipping ahead to the root cause: a confluence of things happened that should not have happened: a correlated hardware failure on the equipment running the FKS control plane, along with a provisioning failure that left our root volumes on those machines in a nonstandard state. During the 40-minute acute phase of the incident, FKS in ORD was unavailable, as well as the control plane for MPG (which we haven’t released yet). Two weird things about this incident, the first being that our provisioning is all automated so the drive misconfiguration shouldn’t have been possible, and second that an FKS control plane failure shouldn’t have disrupted MPG.
March 25: SJC Networking Issues (11:00EST): Our upstream provider took a data center deployment in SJC (the one we host worker servers in) down for maintenance, and did not bring it back smoothly (we lost LACP pairs, some routing seemed messed up). Having a region in a half-working, half-randomly-not-working state confused some of our internal tooling, which added another couple minutes of drama while services got bounced. The whole thing took about 20 minutes to resolve completely.
March 26: Capacity Issues In FRA (06:30EST): A spike in Fly Machines API errors in FRA woke up the Machines team, who declared an incident. Three things were happening simultaneously: our infrastructure team was migrating workloads between servers, a customer was spinning up an anomalously large number of machines in Europe, and we tripped a corner case in how LVM2 was configured on our workers, which generated disk utilization numbers that prevented our orchestrator from launching machines. Within 45 minutes, we’d mitigated the immediate LVM2 issue, but the underlying capacity issue was genuine and it took another 45 minutes or so to bring burst capacity online in the region; during this time window, launching new machines was unreliable in FRA (something most apparent to people using Depot Docker builders in the region).
**March 28: Historical Log Availability Outage (09:00EST): The Quickwit cluster we use for “historical” logs for customer applications (as opposed to the real-time logs you see when you’re watching a deployment) runs on a set of Fly Machines, with metadata storaged backed by Fly Volumes. The Volumes we used for this cluster were undersized for the amount of metadata growth we experienced, and queries to the cluster began generating 500 errors. We expanded the volume size; logs were available again 30 minutes after the incident started.

Next post ↑: 2025-04-12
Previous post ↓: 2025-03-22