2024-06-29

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

June 25: Authorization Errors With NATS Log Shipping (15:00EST): A customer informed our support team about an “Authorization Error” received when connecting NATS (via our internal API endpoint) to ship logs (this is a feature of the platform, normally used with the Fly Log Shipper, intended to allow users to connect their Fly.io platform logs to an off-network log management platform). As it turned out, we’d just done some work tightening up the token handling in our internal API server, and missed a corner case (users using fully-privileged Fly.io API tokens — don’t do this! — to ship logs). It took about 30 minutes to deploy a fix.
June 27: 502s on Some Edges Due To Corrosion Reseeding (13:00EST): Our monitoring picked up HTTP 502 errors from some of our apps, which we tracked down to stale data in Corrosion, our distributed state tracking system. We’d recently done major maintenance with Corrosion’s database, and it had knocked out Corrosion on a small number of our edges, causing it to miss updates for about 30 minutes. The underlying issue was resolved relatively quickly, but a corner-casey interaction with blue/green deploys stuck several apps (roughly 10) that deployed during the outage in a bad state that we had to reconcile manually over the next hour.
June 28: 502s in Sao Paulo (16:45EST): About 5 apps, including our own Elixir app, saw sharply elevated HTTP 502 errors, which we again traced to stale Corrosion data, possibly from the work done the day previously. We mitigated the issue by resyncing our proxy and Corrosion, which cut errors by 3000% but didn’t eliminate them; we narrowed errors to a particular GRU edge server, and stopped advertising it, which eliminated the problem. We’re still investigating what happened here.

This Week In Infra Engineering

Somtochi rolled out a major change to the way we track distributed state with Corrosion. Because Corrosion is a distributed system (based principally on SWIM gossip) and no distributed system is without sin, we have to carefully watch the amount of data it consumes; updates are relatively easy to propagate, but eliminating space for old, overridden data is difficult; this is the “compaction” problem. Somtochi and Jerome worked out a straightforward scheme for doing compaction, but it required adding an index to a table that had been growing without bound for many months, and would potentially trigger multi-minute startup lags everywhere Corrosion needed to get reinstalled. Instead of doing that, we “re-seeded” Corrosion, taking a known-good dataset from one of our nodes, compacted, and then using it as the basis for new Corrosion databases. This was rolled out on many hundreds of hosts without event, and on a small number of edge servers (which have much slower disks) with some events, which you just read about above.

Akshit worked on improving the metrics we’re using for bandwidth billing, putting us in a position to true up bandwidth accounting by more carefully tracking inter-region (like, Virginia to Frankfurt) traffic, especially for users with app clusters where only some of the apps have public addresses. You’ll hear more about this from us! This is just the infra side of that work.

After Peter wrote a brief postmortem of an incident from last week, Ben Ang worked out a system to more carefully track deployments of internal components, especially when those deployments happen piecemeal as opposed to full-system redeploys. Since the first question you ask when you’re handling an incident is “what changed”, anything that gives us quicker answers also gives us shorter incidents.

Dusty, John, Simon, and Peter all worked on draining old servers, migrating Fly Machines to newer, faster hardware. This is all we’ve been talking about here for the last month or so, and it’s happening at scale now.

Andres got tipped off by an I/O performance complaint on a Mumbai worker and ended up tracking down a small network of crypto miners. The hosting business; how do you not love it? Andres did other stuff this week, too, but this was the only one that was fun to write about.

Will wrapped up his NATS log-shipping work. We’ll let him tell the story.

Next post ↑: 2024-07-06
Previous post ↓: 2024-06-22