2025-02-22

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact.)

February 16: Platform Outage (04:45EST): From 4:45AM to roughly 7:45AM, we experienced a severe outage. Normally, “we experienced an outage” is a passive-voice dodge, but in this instance, we, the subject of the sentence, truly were the recipients of the action. A detailed postmortem follows.
February 18: Excessive CPU On IAD Edge Servers (06:30EST): Internal metrics alerts prompted us to call an incident for edge servers — the servers that terminate incoming traffic from the Internet and hand them off to our Anycast network — in IAD that were seeing excessive CPU usage. The problem impacted only a small number of servers, and didn’t materially impact customers. There’s a longer story to tell about this, but we’ll tell it on the blog. The underlying incident lasted over a day (during which it popped up occassionally on other servers and was resolved each time by restarting the proxy); we have a lot of edge servers, and a lot of edge server capacity, so it wasn’t impactful, but we document everything here.
February 18: API Background Queue Cligged (08:30EST): Another minimal-impact issue. Testing flagged a problem with starting new instances of MPG, our Managed Postgres offering (we haven’t launched this yet; this is all internal stuff). It turned out that Oban, the background queuing system we use for our Elixir APIs, was jammed up. Upon investigation, it turned out that we had an absolutely huge number of stale, completed jobs recorded in the Postgres database Oban backs into, as a result of a bad pruning configuration. Manual pruning and configuration tuning resolved the problem. Deployments may have been slower than normal during this incident, but not outside of our normal tolerances.

February 16 Outage postmortem

Narrative

At approximately 04:45EST on February 16, we experienced a total outage of our IAD region. For logistical reasons having to do with the centrality of Ashburn, Virginia to the proper function of the entire Internet, IAD is our primary API region, so this outage took our API down with it. During the four hours on Sunday morning that IAD was down, our users could neither deploy nor modify Fly Machines or applications in any region of our fleet, and Fly Machines hosted in IAD would not have been available to users.

We record lots of incidents in this log, but severe, sustained outages are uncommon. When they do happen, they tend to be relatively gnarly combinations of distributed systems issues and operations fallibility. In other words, they tend to be interesting to write about.

Not this one. This outage was simple: our IAD upstream provider had a core switch (a switch upstream of our racks, deeply embedded in their regional network) that faulted out, and it didn’t have a redundancy. That made recovery an hours-long project rather than a minutes-long project. The switch that failed wasn’t a top-of-rack device in one of our racks, but rather a transit switch.

We operate a global fleet with hardware in over 35 regions. While sustained platform-wide are uncommon, regional network cuts are less uncommon. There are parts of the world that are difficult to run stable networks in relative to London or New York, especially in Latin America. We would not normally write a detailed postmortem of an outage that cut off GRU or SCL. The Fly.io platform is resilient to whole-region outages in most of the world.

IAD is different; it hosts our APIs and much of our API backing stores. If our IAD data centers are cut off completely, our API won’t function.

We operate in several providers in IAD, with multiple upstreams. For example, our HashiCorp cluster — the giant servers managing our Consul and Vault deployments — is hosted at Equinix. Some of our worker servers, on which customer workloads run, are also hosted in Equinix. But the bulk of our servers, including the ones that happen to be hosting our Fly APIs (which are Fly Apps themselves) are not at Equinix (the hardware product we take advantage of at Equinix is both nosebleed expensive and has been sunset).

Much of the operator/engineer time we spent during this outage was aimed at bringing our APIs up in a “backup” region (EWR, in Secaucus, to be specific). To their credit, our team got us there, just as our upstream provider managed to restore connectivity. But we want to be clear that being able to bring our API up in a region other than IAD is not a business goal of ours. As a general rule we accept the risk that if IAD falls off the Internet, people won’t be able to deploy Fly Machines.

This sounds weird (even to some of our own engineers). But maintaining an “IAD-resilient” API is not a costless choice. Keeping our core infrastructure constantly ready to bring up in EWR wouldn’t just be expensive and distracting, but would also add further complexity to our infrastructure. If you scroll back (and back, and back) on this log, you’ll see that overall, it’s complexity, and not backhoes in Virginia, that are our real adversary.

Incident Timeline

2025-02-16 04:45EST: Our first-responders on-call rotation receives a flood of pages for elevated levels of 500 errors in our Fly Machines API service.
2025-02-16 04:45EST: Our object storage partners at Tigris place a direct call into our team asking about the IAD outage.
2025-02-16 05:10EST: An incident team has been formed, an incident channel is created, Tigris’s outage is confirmed, our status page is updated, and our upstream provider is notified about the outage.
2025-02-16 05:20EST: The status page is updated to note the total outage in IAD and the unavailability of our API.
2025-02-16 05:25EST: Work begins to attempt the relocation of our API to EWR, and developers associated with the API are being woken up. This work will ultimately not pay off before the network cut is resolved, and is not something we plan on attempting again.
2025-02-16 05:45EST: Our upstream confirms a dead switch, and has dispatched remote hands to its own data center.
2025-02-16 06:45EST: The incident team has brought part of the Fly Machines API up in an alternate data center, at Equinix. Internal deployments of Fly Machines now function, to an extent, through the direct API; flyctl and our web interface do not.
2025-02-16 07:45EST: Our upstream provider gets remote hands to the cabinet with the defunct transit switch. Meanwhile, the incident team is using its internal deployment capabilities to bring up our API server in EWR. The status page is updated to communicate that effort.
2025-02-16 09:10EST: The incident team has most of the API working in EWR, but is hung up on an haproxy configuration. At this point in the outage, we’re relatively close to bringing the API up in a new region, but there are concerns about opening it up to customers with some of its dependencies offline in IAD. The team deliberates for a long time about whether to hazard it.
2025-02-16 09:45EST: Our upstream provider restores connectivity to IAD. The acute phase of the incident is concluded.
2025-02-16 10:15EST: Our status page is updated to indicate that the incident has been mitigated. The team is hypervigilant about misbehavior of systems inside IAD, because a total, sustained cut of IAD is new territory for them, so the incident is held open on the status page for another several hours.

Forward Looking Statements

Throughout this postmortem, we’ve been at pains to be clear that API resilience from a sustained failure in IAD is not part of our service model. We’ve made a strategic choice to simplify our platform by allowing our API to depend on the availability of the IAD region. This means that if a truly epic bolt of lightning strikes Ashburn, Virginia, we might experience a sustained API outage. Existing Fly Machines outside of IAD will continue to run (that kind of resilience is part of our service model) but deployments and modifications will not function.

We’re impressed with and grateful to our infra and platform engineering teams for coming very close to the finish line of migrating our API out of IAD, on the fly, in the span of just a couple of hours. This postmortem provides some detail about that work out of appreciation for the engineering acumen required to pull that off. But we’re generally waving engineers off of the work of making that kind of migration easier or more reliable.

Instead, all of our effort in response to this is about bulletproofing our connectivity in IAD.

There are two broad things we’re doing to prevent future network cuts in our busiest regions, starting with IAD. The first is working with our upstream provider on network engineering, and the second is diversifying our upstream connectivity. Both efforts are active and ongoing; we should see payoffs (expect them in this infra-log) in the coming weeks.

With our upstream, we’re inventorying and auditing the entire network between our metal and the cross connects to transit providers. We were surprised to learn, that Sunday morning, that our upstream had a non-redundant transit switch in their architecture. We’re going hunting for more of them, and we’ll work in partnership with them to ensure those weaknesses are rectified.

In running postmortems with staff from that upstream, we’ve discovered some important process and communications issues that we’ve been able to resolve. Some of what we’ve uncovered is the result of Fly.io growing organically alongside our upstream partner, which has resulted in a less-than-optimal physical architecture for our hardware and theirs. Some of it is also expectations management; much of the rest of their server footprint is operating in CDN configurations, where the cost of a sustained region outage is less-than-optimal page load times. And some of it is the communications process running between our organizations, with staff engineers on our side talking directly to staff on their side, which has the benefit of warding off downtime (“can we move this server?” “no!”) but also deferring maintance (“can we move this ser—” “no!” “—ver to make the power space we need to install a redundant switch here?”).

Because our strategy depends on high availability for IAD, we’re also investing in additional cross-connects (and network paths to them) from other providers. If this was simply a matter of striking up contracts with transit providers, it would be done by now. Unfortunately for us, as just related, we’ve grown organically and messily in the two data centers our upstream operates in, and server numbering is going to complicate transit diversity for us. We’re in the beginning stages of planning out the automated provisioning and renumbering that will make this possible. It’s not a huge lift, but it’s more of a lift than you’d expect.

Every major incident we deal with surfaces internal process issues we can improve. Here, we’re not especially psyched about how coordinated our immediate response to the outage was: it took us something like 20 minutes to have a coherently communicated understanding of what was happening, from the time the initial pages hit our team. This was complicated by the fact that some of our alerting infrastructure has IAD dependencies. While we don’t plan to make our whole API migrateable, our alerting infrastructure needs to function (better) no matter what.

This infra-log is a product of our engineering organization, and does a pretty good job of surfacing strong work from our infra, platform, and fullstack engineers. In this specific incident, we’d like to go out of our way to call out Matthew Cunningham, our VP of Finance. Matthew is excellent for any number of reasons, but here in particular he’s taken the lead on driving forward work and investments now underway with our upstream to ensure an incident like this is unlikely to recur. You’d never know to read his commentary about cabinets, cross connects, and network inventories that he was a fleece-vest finance person. Thanks, Matthew!

Next post ↑: 2025-03-01
Previous post ↓: 2025-02-15