2024-08-24

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

  • August 19: Internal Grafana Outage (10:00EST): No customer impact. Our internal instance of Grafana was brought down by a full disk, a consequence of a bad default that stored graph annotations indefinitely on dashboard graphs; we add those annotations programmatically and they crudded up our storage. This was an internal-only resource; customer Grafana wasn’t impacted. If anything, this was a win for customer Grafana, since we’ve now gotten better at gnarly Grafana restoration scenarios.

  • August 20: Sporadic Authentication Failures With Github SSO (13:00EST): In the process of porting SSO API code from our legacy Rails codebase to Elixir, we added a validation on authentication events that checked for profile names; those names aren’t always present in Github authentication flows. We pulled the validation; Github SSO issues were resolved within an hour.

  • August 22: Fly Machines API Disruption On Gateways (06:00EST): No customer impact (we think). Alerting notified us that some of the Fly Machines API server (flaps) running on our gateway hosts — the servers we use to terminate WireGuard connections for customers — were stuck in a restart loop. The root cause was a previously deployed configuration change having to do with Vault; gateway servers are extremely stable and change rarely, so the change had only just come up. Normal flyctl API patterns don’t use these servers.

This Week In Infra Engineering

Dusty put Fly.io-authored Ansible provisioning for our hardware upstream providers into motion, getting us close to capping off a project to streamline the provisioning of new physicals. Additionally, he set up IPMI sensor alerting across our fleet, so we can do early detection of hardware issues (our fleet is now of the size where hardware failures, while rare as a fraction of the fleet itself, aren’t totally out of the ordinary); we now have better early alerting for server physical issues, which is important because with the completion of the migration project we’re in a much better position to preemptively migrate workloads.

Somtochi is back into Corrosion, responding to incidents from a couple weeks ago. Changes include queue caps (one incident was caused by an unbounded queue of changes from nodes) and fixing a bug that was causing Corrosion to request way more data than it needs when syncing up a new node with the fleet. She also set up sampled otel telemetry for fly-proxy (fly-proxy telemetry is tricky because of the enormous volume of requests we handle).

Tom did important top-secret work that we are not in a position to share but will one day be very fun to talk about. Read between the lines. Also, the previous week, which Peter monopolized with the petertron4000, Tom did a bunch of Postgres monitoring work, because we’re gearing up to get a lot more serious about managing Postgres.

Peter mitigated a longstanding compatibility issue with us and Cloudflare; either our HTTP/2 implementation (which is just, like, Rust hyperium or something) or theirs is doing something broken; when we have problems we now automatically downgrade to HTTP/1.1 for their source IPs.