2025-05-31

A note on incidents: incidents are internal events for our infrastructure team. Incidents often correspond to degraded service on our platform, but not always. This log aims for 100% fidelity to internal incidents, and is a superset both of our status page events and of customer-impacting events on the platform. It includes events reported to subsets of customers on their personal status pages, as well as events without any status page impact.

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact.)

May 25: Machine Snapshotting In SJC Failed (09:30EST): Well, snapshots on two of our SJC workers. What happened here is that the system partition we store snapshot data on filled up (1/3rd of all operational failures on this platform, and I assume every other platform, is some partition filling up) and, while we of course planned for this eventuality in flyd, the FSM steps we wrote to handle this contingency did not handle this contigency. They do now. We manually recovered space on the two worked to free up snapshots (within about 30 minutes) and deployed a fix across the fleet sometime later.
May 27: Two Hosts In Paris Rebooted (09:30EST): Sometimes they do that. No, wait, no they don’t. We are alarmed when this happens. But it happened here. We don’t have an explanation yet. If we don’t figure out why this happened, we’ll put these servers out of their misery like downer cows. Both servers rebooted relatively quickly (though one had some weird LVM2 issue at boot and needed some handholding to get logs and metrics working).
May 28: GRU Falls Off The Planet (03:30EST): No it didn’t. What actually happened is that our system log ingestion, which we’re migrating between storage backends, became nonfunctional, so it looked, to us, like maybe Mexico City had vanished. It had not. Fly Machines in Mexico were fine. Logs from Fly Machines in Mexico City were fine. We were not fine; we were traumatized. But we got over it. No customer impact. Logged for completeness.
May 28: AMS Physicals Hangs (08:00EST): In any given region, we have 1-2 steady-state hardware providers, and 1-3 “burst” providers. We’re going steady with the steady-state providers; your workloads will eventually end up on physicals we get there. But your needs are manifold and sometimes expansive, and it can take time to rack physicals at our steady-state providers. The burst providers can get resources racked quick (but they’re more expensive). We’re auditioning a new burst provider in AMS. They did something funky in their base OS image with respect to udev which confuses flyd, our orchestrator. As a result, Machine starts were hanging on those AMS hosts. They(/we) no longer do that thing. It took us about 90 minutes to mitigate the (host-specific) problems in the region and then fix them everywhere in the fleet this could have happened.
May 28: Kurt Being Annoying (18:30EST): Our first-responder on-call rota works like this: it’s 24 hours a day for a week, which you do ~1.3 times per year (alert load is pretty civilized, but you get to take the week “off” from normal work regardless). Anyways, Kurt, our intrepid CEO, came up for first-responder duty this week. Some alert pattern spooked him, like a deer startled by a discarded candy wrapper. He triggered an incident. No customer impact. Logged for completeness.

Previous post ↓: 2025-05-24