2025-04-19

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact.)

April 14: API Outage (6:00EST): Internal team members notice 503s from the central Fly.io GraphQL API, which drives our web dashboard and much of flyctl, our CLI; both customer monitoring and most new deployments are unavailable while it’s down. After 10 minutes of investigation, new capacity is added to the Rails API server, as a sort of best-practices move. The situation quickly resolves. 50th percentile metrics for GraphQL mutations show a spike for WireGuard configuration requests; it turns out there’s a misconfigured mutex lock bottlenecking the server (adding capacity spread out the load and relieved the bottleneck situationally; fixing the locks resolved the bottleneck completely).
April 14: Sporadic Orchestration Failures (11:30EST): A small number of customers are reporting Fly Machines that won’t boot. Error logging pinpoints a problem with containerd‘s “leasing” management of Device Mapper devices, which result in flyd (our orchestrator) getting out of sync with containerd and using stale block devices to attempt boots. There isn’t an acute phase to this incident so much as an escalation of a slow-burn set of reported problems; it’s mitigated by ad-hoc tooling within an hour and then several days of engineering work followed.
April 17: Data Center Maintenance in GIG and QRO (12:30EST): GIG is Rio, and QRO is Santiago, if you’re wondering. There wasn’t any interactive/operations work on our side on this; it’s an incident we declared internally as a placeholder because our upstream was doing some planned maintenance.

Next post ↑: 2025-04-26
Previous post ↓: 2025-04-12