2025-05-17

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact.)

May 12: API Outage (10:00EST): Metrics alerts, customers, and our support team, all in quick succession, alerted us to our main dashboard returning 500 errors. Some instances of the Rails app powering that API were buckling under a load of WireGuard peer creations, exacerbated by a broken haproxy configuration that was taking instances offline as unhealthy when health checks flapped. The immediate cause of the incident was a Fly Kubernetes orchestration failure (it was stuck in a loop without a rate limit). Central API and UI were unavailable for about 20 minutes, and degraded for a short time after that.
May 14: Excessive 408 Errors From Machines API in FRA (10:30EST): Incident included here for completeness: a large customer reported 408 errors from Machines API operations. They do some of their own orchestration with our API, and are thus especially strenuous users. Two underlying issues, neither of which would ordinarily rise to the level of an global incident: first, they were concentrating all their fire on a region-local instance of the Machines API endpoints, which aren’t load balanced (easy fix: point hardcore API clients at api.machines.dev), and the operations they were requesting all pertained to a host that had experienced a hardware failure. More a customer support issue than an infrastructure issue, but we declared an incident, so here you go.
May 15: Orchestration Instability in NRT and HKG (11:30EST): Health alerts on worker hosts in NRT and HKG alerted us that flyd, our orchestration server, was not running. It came back up and was functioning again within about 20 minutes. We’re still investigating this, but our current belief is that this was due to very high packet loss in the regions, which kept flyd from being able to talk to Consul.
May 15: Token Verification Issues (23:30EST): Oh, boy. Inciting issue: sporadic failures verifying Machines API authentication tokens. Extensive debugging followed, including looking at scaling our token issue. Underling issue: nothing at all to do with tokens; instead, we had just rolled out a structural change to how we do state management in the proxy (“state management in the proxy” are the scariest 5 words in the Fly.io lexicon). Massive blog post about this bug, which took the networking team past the brink of insanity, to follow. Because we watchdog the proxy now, the issue should have had minimal customer impact.
May 16: Consul Cluster Failure (12:30EST): We’re not especially reliant on Consul anymore, which is good because Consul is a bear in our environment. Two Consul servers lost disks at the same time, and we immediately started the process of replacing them. A short while later, the Consul cluster appeared unstable, which disrupted Vector and NATS across our fleet; those two services are responsible for internal telemetry, plus customer metrics and logs. Two underlying problems: first, in the process of syncing with the cluster, the newly provisioned Consul servers were OOM-ing. Second, and more painfully, the stumbling new Consul servers were answering queries before being completely synced, which put us in a split-brain situation. Problem was resolved by redeploying with IP filters blocking query access to the new servers, so they could fully synchronize before being made available. About 30-45 minutes of log disruption, followed by 3 hours of deployment work that, owing to the filters, shouldn’t have disrupted anything for customers; a new trick in our bag of Consul tricks: Consul works best when you don’t let anything talk to it.

Next post ↑: 2025-05-24
Previous post ↓: 2025-05-10