2024-11-30

Trailing 2 Weeks Incidents

The infra-log took last week off for Thanksgiving; the trailing two weeks in this update are two “fresh” weeks worth of incidents, though basically nothing happened in the first of those weeks.

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

November 21: Log Query Outage (14:00EST): Customers experienced slow queries on our log retention service, which is operated by an upstream partner, and backended onto object storage. We created this incident (and public status updates) to track their work resolving those slow queries. Our upstream log partner scaled up resources, which apparently resolved the issue.
November 24: Volume Issue On Seattle Host (04:00EST): Internal alerts fired about the integrity of volume groups on a particular Seattle worker physical. This turned out to be a LVM2 reporting anomaly, cleared up in under a minute with a rescan, which we should just automate for next time this pops up. Zero customer impact.
November 25: All Hell Breaks Loose (15:00EST): A combination of different incidents occur simultaneously, resulting in both deployment and API outages, during acute periods (two of them) with near-total severity. A detail postmortem follows.
November 26: API Outage (15:15EST): A recurrance of the back half of the previous day’s outage, with the same cause, resolved in about 5 minutes; during those 5 minutes, deployments that required the use of our GraphQL API (mostly: for new apps) would have failed.

November 25 Outage Postmortem

Narrative

At approximately 15:00EST on November 25, we experienced a fleetwide severe orchestration outage. What this means is that for the duration of the incident, both deployments of new applications and changes to existing applications were disrupted; during the acute phase of the outage Fly Machines could not be updated; for the back half of the outage, our API was unavailable. Service was restored completely at 02:30EST.

This was a compound outage with two distinct causes and impacts. The first approximately mirrored the October 22nd orchestration outage and involved a distributed systems failure in Corrosion, our state sharing system. The second was an API limit problem, which combined with an error in a customer app had the effect of denying service to our API. The two outages overlapped chronologically but were resolved serially, extending the duration of the incident.

We’re going to explain the outage, provide a timeline of what happened, and then get into some of what we’re doing to keep anything like it from happening again.

Orchestration is the process by which software from our customers gets translated to virtual machines running on Fly.io’s hardware. When you deploy an app on Fly.io, or when your Github Actions CI kicks off an update after you merge to your main branch, you communicate with our orchestration APIs to package your code as a container, ship it to our container registry, and arrange to have that container unpacked as a virtual machine on one or more of our worker servers.

At the heart of our orchestration scheme is a state-sharing system called Corrosion. Corrosion is a globally-synchronized SQLite database that records the state of every Fly Machine on our platform. Corrosion uses CRDT semantics (via the cr-sqlite crate) to handle SWIM-gossipped updates from worker servers around the world; a reasonable first approximation holds that every edge and worker server in our fleet runs a copy of Corrosion and, through gossip updates, synchronizes its own copy of the (rather large) global state for the Fly.io platform.

The proximate cause of this Corrosion incident is straightforward. About 5 minutes before the incident began, a developer deployed a schema change to Corrosion, fleet-wide.

The change added a nullable (and usually-null) column to the table in Corrosion that tracks all configured services on all Fly Machines (that is to say: if you edit your fly.toml to light up 8443/tcp on your app, this table records that configuration on every Fly Machine started under that app). Surprisingly to the developer, the CRDT semantics on the impacted table meant that Corrosion backfilled every row in the table with the default null value. The table involved is the largest tracked by Corrosion, and this generated an explosion of updates.

As with the previous Corrosion outage, because this is a large-scale distributed system, Corrosion quickly drove tens of gigabytes of traffic, saturating switch links at our upstream.

This outage is prolonged by a belief that the root cause is an inconsistent set of schemas on different instances of Corrosion.

The incident begins (and is alarmed and declared and status-paged) promptly after the schema change is deployed. The deployment is immediately halted, and investigation begins. Corrosion is driving enough traffic in some regions to impact networking, and cr-sqlite‘s CRDT code is consuming enough CPU and memory on many hosts throw Corrosion into a restart loop. Now the deployment is allowed to complete, to rule out inconsistency as a driver of the update storm. The deployment doesn’t worsen the outage, but does take time, and doesn’t improve the situation.

As with the October 22nd outage, the Corrosion problem is resolved when the decision is made to re-seed the database from an external source of truth. This time, the schema change complicates the process: a backup snapshot of the Corrosion database from prior to the schema change is needed, and downloading and uncompressing it adds time to the resolution.

As with the previous outage, once the snapshot is in place, re-seeding Corrosion takes approximately 20 minutes and resolves the Corrosion half of the outage.

At the same time this is happening, a corner-case interaction between a malfunctioning customer app and our API is choking out our API server.

The customer’s app runs untrusted code on behalf of users (this is a core feature of our platform). It does so by creating a new Fly Machine for each run, loading code on to it, running it to completion, and then destroying the Fly Machine. This works, but is not how the platform is meant to be used; rather, our orchestrator assumes users will ahead-of-time create pools of Fly Machines (dynamically resizing them as needed), starting and stopping them to handle incoming workloads; a stop of an existing Fly Machine resets it to its original state. Start and stop are much, much faster than create and destroy.

The customer’s app is suddenly popular, and begins creating dozens of Fly Machines every second, at a rate steadily increasing throughout the outage. This exercises a code path not expected to be run in a tight loop and missing a rate limit. In our central Rails API server, which is implicated in create requests (but not starts and stops), this has the effect of jamming the process up with expensive SQL queries.

A different team is investigating and working on resolving this incident alongside the previous one. The team attempts to scale up to accommodate the load, first at the database layer, and then with larger Rails app servers; dysfunction in the Rails API makes the latter difficult and time-consuming, and ultimately neither scale-up resolves the problem: paradoxically, as we create additional capacity for create requests, the lack of backpressure amplifies the number of incoming create requests we receive.

30 minutes before the end of the outage, we reach the customer, who disables their scheduling application. The API outage promptly resolves.

Incident Timeline

This timeline makes reference to the Corrosion outage as “Incident 1”, and the API flood as “Incident 2”.

2024-11-25 14:43PM EST: (Incident 1) A fleetwide deployment of a change to our Corrosion state-sync system begins; it contains a schema update, adding a new column to a large CRDT-governed table, resulting in explosions of backfill changes on the hosts the deployment hits.
2024-11-25 14:58PM EST: (Incident 1) High-urgency alerts begin; the infra-ops team is paged.
2024-11-25 15:00PM EST: (Incident 2) A demanding customer app is generating 30 Fly Machine creations per second.
2024-11-25 15:02PM EST: (Incident 1) Our upstream alerts us to link saturation in a few regions.
2024-11-25 15:05PM EST: (Incident 1) A formal incident is declared.
2024-11-25 15:05PM EST: (Incident 2) The API team notes higher-than-usual RDS usage for our API’s backing store.
2024-11-25 15:10PM EST: Our status page us updated, reflecting degraded API performance.
2024-11-25 15:16PM EST: (Incident 1) The fleetwide deployment from 14:43PM is halted.
2024-11-25 15:26PM EST: (Incident 1) Corrosion traffic is continuing to increase; time-to-recover and propagation metrics are worsening. Concerned about the impact of inconsistent schemas, the deployment is resumed.
2024-11-25 15:37PM EST: (Incident 1) Corrosion instances on many machines are beginning to restart; at this point, Corrosion is no longer effectively updating state across out fleet.
2024-11-25 16:49PM EST: (Incident 1) An edge physical in Boston and a worker physical in London are offline, due to a combination of CPU and network load.
2024-11-25 17:15PM EST: (Incident 1) The team begins deployment of configuration changes to slow the rate of Corrosion updates; the status page is updated to reflect our diagnosis of the issue. Corrosion is partially functioning (deployments during this phase of the outage are hit-or-miss, especially for new Fly Machines).
2024-11-25 17:36PM EST: (Incident 1) Attempts to restart Corrosion on smaller (edge and gateway) hosts are now timing out, as Corrosion on those hosts struggles to keep up with the rate of updates.
2024-11-25 17:45PM EST: (Incident 1) We throttle WireGuard traffic between our physicals with our upstream.
2024-11-25 18:30PM EST: (Incident 2) The customer app generating the flood of creates has reached 52 requests/sec.
2024-11-25 18:58PM EST: (Incident 1) The Corrosion configuration change completes, but some physicals across our fleet are still distressed. The status page is updated. The Machines API is slow, but functional, at this point.
2024-11-25 19:00PM EST: (Incident 1) The decision is made to “re-seed” Corrosion, creating a new baseline database from external sources of truth, rather than wait out slow recovery on several physicals across our fleet. We begin the process of loading a (large) snapshot of the database on all our servers.
2024-11-25 20:07PM EST: (Incident 1) The process of loading the snapshot across the fleet completes.
2024-11-25 20:48PM EST: (Incident 1) Corrosion is brought up to date with changes from our API and flyd servers occurring after the snapshot time.
2024-11-25 21:30PM EST: (Incident 2) The customer app generating the flood of creates has reached 139 requests/sec.
2024-11-25 21:31PM EST: (Incident 1) The status page is updated to reflect nominal performance of Corrosion and state synchronization.
2024-11-25 22:27PM EST: (Incident 2) Alarms fire about the availability of our web dashboard. Concerns remain about aftereffects from the Corrosion outage, but the problem will turn out to be unrelated.
2024-11-25 22:55PM EST: (Incident 2) Telemetry reveals slow SQL queries jamming up our API serves.
2024-11-25 23:12PM EST: (Incident 2) We begin scaling up our RDS instances, from 8xl to 16xl, and update the status page.
2024-11-26 00:10AM EST: (Incident 2) Scale-up completes, but performance still lags. We begin scaling up our Rails API servers as well; this is complicated by the current load on that API server, which is in the critical path for the planned scale-up deployment. At this point, the team is still attempting to scale out of this problem. The team will continue to attempt permutations of API server scaling, including reversions of recent PRs, for the next 45 minutes.
2024-11-26 01:00AM EST: (Incident 2) The customer app generating the flood of creates has reached 151 requests/sec.
2024-11-26 01:33AM EST: (Incident 2) The customer app generating the flood of creates is stopped.
2024-11-26 01:34AM EST: (Incident 2) The incident resolves.
2024-11-26 02:15AM EST: (Incident 3) The status page is updated to reflect incident resolution.

Forward-Looking Statements

A significant fraction of this outage rhymes with our previous orchestration outage, and much of what we’re working on in response to that outage applies here as well.

The most significant thing we’re doing to minimize the impact of outages like these in the future is to reduce global state. Currently, every physical in our fleet has a high-fidelity record of every individual Fly Machine running on the platform. This is a consequence of the original architecture of Fly.io, and it’s a simplifying assumption (“anywhere we need it, we can get any data we want”) that we’ve taken advantage of over the years.

Because of the increased scale we’re working at, we’ve reached a fork in the road. We can continue running into corner-cases and bottlenecks as we scale and manage high-fidelity global state, and develop the muscles to handle those, or we can break the simplifying assumption and do the engineering required to retrofit “regionality” into our state. As was the case late this summer with fly-proxy, we’re choosing the latter: running multiple regional high-fidelity Corrosion clusters, with Fly-Machine-by-Fly-Machine detail about what’s running where, and a single global low-fidelity cluster with pointers down to the regional clusters.

The payoff for regionalized state is a reduced blast radius for distributed systems failures. There’s effectively nothing that can happen with Corrosion today that doesn’t echo across our whole fleet. In a regionalized configuration, most problems can be contained to a single region, where they’ll be more straightforward to resolve and have drastically less impact on the platform. Corrosion, an open-source project we maintain, is already capable of running this way; the work is in how it’s integrated, particular to our routing layer.

This work has been ongoing for over a month, but it’s a big lift and we can’t rush it. So: we’re cringing this overlap this outage has with our last one, but it’s not for lack of staffing and effort on the long-term fix.

Two immediately evident pieces of low-hanging fruit that we have already picked in the last week:

First, as we said last time, responding to Corrosion issues by efficiently re-seeding state continues to be an effective and relatively fast fix to these issues. Re-seeding was complicated this time by the schema change that precipitated the event. We’ve begun creating processes to simplify and speed up re-seeding under worst-case circumstances. Additionally, some of the delay in kicking off the re-seeding, as with last time, resulted from a cost/benefit calculation: re-seeding requires resynchronizing our API and flyd servers across our fleet with Corrosion, which isn’t automated; the hope was that Corrosion would converge and reach acceptable P95 performance levels soon enough not to need to do that work. We’re building tooling to minimize that work in the future, so that doesn’t need to be part of the calculation.

We’ve also added comprehensive “circuit-breaker” limits to Corrosion. For already-deployed apps, even a total Corrosion outage shouldn’t break routing; Corrosion synchronizes a SQLite database, and our routing layer can simply read that database, whether or not Corrosion is running. But during the acute phase of this outage, Corrosion wasn’t just not running effectively; it was also consuming host and (especially) network resources. Corrosion now has internal rate limits on gossip traffic, and our hosts have external limits, in the network stack and OS scheduler, to stop runaway processes; this is being rolled out now.

Second, the back half of this outage was due to a pathological condition we hit because we lacked a rate limit in an expensive API operation we didn’t expect users to drive in a tight loop. Obviously, that’s a bug, one we’ve fixed. But there was a process issue here as well: we identified the “pathological” app (it wasn’t doing anything malicious, and in fact was trying to do something we built the platform to do! it was just using the wrong API call to do it), but then engaged in heroics to try to scale up to meet the demand. Without backpressure, this doesn’t do anything.

So we’re also building out a process runbook for handling/quarantining applications that have spun out, one that incident response teams don’t have to think hard about when in the middle of high-priority incidents.

Next post ↑: 2024-12-07
Previous post ↓: 2024-11-16