2025-01-18

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact.)

January 14: Son Of Machine Creation Failures in London (06:00EST): A repeat of the January 10 API failure; once again, a worker server in London lost connectivity. The worker server was made ineligible for new workloads immediately, ending the acute phase of this incident, but complete resolution took most of the day, so we’re calling this a multi-hour event, despite the minimal impact. Leaving out a bunch of time-consuming blind alleys, it turned out: a stale systemd unit configuration entry on one of our proxy components meant that, after a reboot, workers couldn’t automatically restart a component responsible for routing to internal services. Impacted servers had been recently rebooted. A one-line configuration change resolved this fleetwide.
January 15: Regional Blue-Green Deployment Failures (08:30EST): Customers in a small subset of our regions (BOM and OTP in particular), among them our own fullstack team, noted that blue-green deployments were getting stuck on health check failures for “green” machines. We identified a rolling-deployment workaround for the issue within an hour or so, and then spent several hours diagnosing it. Like the most annoying of our issues, this was region specific, so a lot of our investigations ran aground of “works for me” problems. What it ended up being was, we’re in the process of moving all our health-check infrastructure out of HashiCorp Consul and into Corrosion, our in-house statekeeping system; this cut-over is mostly but not completely deployed. After a recent change, a blue-green deployment that spanned both cut-over and not-cut-over regions would see broken health checks. Resolved fully about 2 hours later.
January 16: API Failures (12:30EST): A flood of synthetic alerts came in indicating people were getting 503s from our API. A recent deployment of the API server had failed, and our CI/CD system left it in an inconsistent state. Status paged, and then resolved within 10 minutes.

This Week In Engineering

Lots of little stuff.

Somtochi spent the week hunting a Corrosion bug that caused periodic stalls, and, long story short, it was an IOPS contention issue with a Corrosion server deployed alongside a particularly demanding customer application. OK, moving on.

JP worked out a runbook for a flyd database capacity issue. Recall that every worker server in our fleet is the sole source of truth for the workloads running on it (we run a “decentralized” orchestrator); that source of truth is a BoltDB kept by flyd, our primary orchestrator service. That database records every transition that happens on every Fly Machine on the server (basic operations like starting or stopping a Fly Machine might incur many such transitions under the hood), and, over time, that database accumulates garbage, gradually slowing flyd down. We now have alerting for flyd database operations that exceed a time threshold, and a runbook for compacting the database when it does. Boring, but good. Moving on.

Will bit off all his fingernails keeping tabs on the rollout of our new CPU scheduling. We’re writing a blog post about this. Will’s also working on a long-term infra white whale of ours, which is providing stable private IPv6 addresses for Fly Machines even as they’re migrated to new hardware; this is difficult because our IPv6 addresses encode hardware addresses as part of our routing discipline. There’s some neat engineering happening here.

Kaz has been leading our (P)reventative (H)ost (M)aintenance (P)roject (P.H.M.P.), which addresses a problem we have on large AMD server hardware: because of a firmware bug, those machines can lock up if their uptime exceeds a (very large) threshold, which some of our worker servers do, because we go way out of our way not to reboot them. We’ve been working through a hit list of servers that we’ve succeeded “too” well on, deliberately performing maintenance reboots on them, which involves migrating workloads off them first, because modern servers take forever to reboot, which is why we try to avoid rebooting them. Anyways, that’s what Kaz’s life is now: rebooting lots of servers.

Peter is working on regionalizing fly-proxy, which means altering our Anycast routing system so that every region in the world doesn’t need to maintain state for individual workloads in other regions, a change that addresses two of our most severe outages. But Peter was also mostly on winter vacation during this time period. We’ll check up on him again next week.

Dusty gave me an update that had something like 9 animated emojis in it, about addressing flyd reliability issues. The long and the short of it: he did a study of all recurrent logged errors from flyd and root-caused each of them in turn, regardless of whether metrics and active health checks were nominal for the impacted flyd instances. He found a bunch of Fly Machines in inconsistent states, a lot of machines with dedicated swap block devices that were unhappy, and some edge servers that were only partially-provisoned (we have a lot of edge servers, and our system generally routes around janky edges). Also: he found a bug in our metrics reporting code that had our workers continuing to report metrics for dead Fly Machines, which produced some happy graphs of of metrics volumes once it got deployed.

Next post ↑: 2025-01-25
Previous post ↓: 2025-01-11