2025-03-01

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact.)

  • February 25: Storage Disruption On SJC Worker (07:30): A lightly loaded worker server in SJC corrupted its LVM metadata. This is a kernel issue we’ve been wrestling with for some time, occurring only on a specific species of worker hardware (the so-called “big bois”), which probably traces back to LVM metadata corruption occurring when the hardware is reset. Ordinarily our answer to this problem would be to migrate all the Fly Machines onto a different SJC server; in this case, the LVM corruption was such that flyd was stalled trying to set up block devices — a “rescue mode” for flyd, where it doesn’t do that kind of initialization but merely gets itself into a place where it can handle migrations — is in progress as a result. This created a host status update for the customers whose workloads were resident on the machine; with the number of worker servers in our fleet, things like this are going to happen semiregularly (the more common variation of this kind of incident is “SSD literally imploded”, which is worse).

  • February 25: Anycast Routing In Stockholm (08:30): Internal health checks flagged a fly-proxy instance in ARN erroring out (complaining of exhausted file descriptors). The proxy was bounced and the the issue resolved relatively quickly; investigation turned up a probably cause in an event loop design issue stemming from our work on regionalizing and sharding the proxy: the proxy now has a persistant “acceptor” layer that takes incoming connections, and hands them off to (dynamic) full instances of the proxy over a Unix socket, and the proxy instance wasn’t keeping up with the batched connections the acceptor was sending. Minimal customer impact, but interesting!

  • February 26: Internal Corrosion/Consul Sync Failure (14:30): Corrosion is our distributed state sharing system and our “replacement” for HashiCorp Consul, which we continue to run. Consul is largely out of the line of fire for customer workloads — we can start and manage Fly Machines without it — but we continue to rely on it for our own infrastructure. We operate a bridge between the two services. Consul has a centralized source of truth, a cluster of very large servers in IAD. Corrosion is distributed, which every running instance across our fleet holding a full complement of state data. Corrosion managed a local table (effectively a sort of cache) for some Consul data that got stuck on a small number of hosts, preventing updates to our infrastructure from propagating. Clearing the cache and bouncing Corrosion resolved the problem. No customer impact.

  • February 27: Partial Amsterdam Upstream Outage (14:00): Our upstream provider experienced a switch failure that broke Internet forwarding to about 1/3rd of our AMS worker fleet, for roughly 2 hours. We retained access to the cut-off machines (they couldn’t receive traffic from the Internet, but could from our other AMS hosts not on the same switch) and began working out a migration process for the workloads on those machines; by the time a plan had been put together for this particular corner-case, the switch had been replaced. Ordinarily, we’d have done host status updates for these hosts (since the outage was particular to a small number of hosts), but it was enough of a fraction of the whole region that we status-paged the whole region.