2024-09-07

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

September 1: Fleet-wide Request Routing Outage (15:30EST): A correlated concurrency bug in fly-proxy caused a fleetwide outage in request handling with an acute phase lasting roughly 40 minutes for many customers. A full postmortem for this incident appears at the end of this week’s log.
September 6: Request Routing Disruption in JNB Region (15:30EST): A deployment error regresses the September 1 bug in JNB for about 3 minutes. We spotted this with synthetic alerting and fixed it immediately (ordinarily, we’d disable Anycast routing to the region, but redeploying to fix the root cause was faster.
September 7: Tigris Buckets Created Through Fly.io API Can’t Be Deleted (17:00EST): For a few hours, a name scheme change at Tigris that we missed breaks API compatibility with Fly.io, which stops bucket deletes from working until we fix it. Minimal customer disruption, but we called an incident for it, so it’s documented here.

This Week In Infra Engineering

Simon is deep — perhaps approaching crush depth — on the volume storage problem of making Fly Machines create faster. The most expensive step in the process of bringing up a Fly Machine is preparing its root filesystem. Today, we rely on containerd for this. When a worker brings up a Fly Machine, it makes sure we’ve pulled the most recent version of the app’s container from our container registry into a local containerd, and then “checks out” the container image into a local block device. If we can replace this process with something faster, we can narrow the gap between starting a stopped Fly Machine (already ludicrously fast) and creating one (this can take double-digit seconds today). The general direction we’re exploring is pre-preparing root filesystems and storing them in fast object storage, and serving those images to newly-created Fly Machines over nbd. Essentially, this puts the work we did on inter-host migration to work making the API faster.

JP made all our Machines API servers restart more gracefully, by replacing direct socket creation with systemd socket activation. Prior to this change, a redeployment of flaps, the Machines API server, would bounce hundreds of flaps instances across our fleet, causing dozens of API calls to fail as service ports briefly stopped listening. That’s ~fine, in that flyctl, our CLI server, knows to retry these calls, but obviously it still sucks. Delegating the task of maintaining the socket to systemd eliminates this problem, and improves our API reliability metrics.

Ben and JP began the process of getting Machine migrations onto flyd’s v2 FSM implementation. What’s a v2 FSM? We’re glad you asked! Recall: flyd, our scheduler, is essentially a BoltDB event log of finite state machine steps; “start a existing Fly Machine” is an FSM, as is “create a Fly Machine” or “migrate it to another host”. v1 FSMs (which are all of our in-production flyd FSMs) are pretty minimal. v2 FSMs add observability and introspection to current steps, and tree structures of parent-child relationships to chain FSMs into power combos; they also have a callback API for failure recovery. This is all stuff inter-host migration can make good use of; with observable, coordinated, tracked migration as a first-class API on flyd, we can get more aggressive about using Machine migration to rebalance the fleet and to quickly and automatically react to hardware incidents.

We have a couple customers that make gigantic Machine create requests — many thousands at once. To make these kind of transactions more performant, we parallelized them in flyctl. But these parallel requests are evaluated by our scheduler in isolation, which has resulted in suboptimal placement of Machines (the two most common cases being multiple Machines for the same app scheduled unnecessarily on the same worker, and unbalanced regions with some lightly loaded and some heavily loaded workers; “Katamari scheduling”). Kaz and JP made fixes both to flyctl and to our scheduler backend to resolve this; in particular, we now explicitly manage a “placement ID”, tracked statefully across scheduling requests using memdb, that allows users of our APIs to spread workloads across our hardware.

In related news, Dusty has been working on improved capacity planning. The most distinctive thing about the Fly Machines API, as orchestration APIs go, is that we’re explicit about the possibility of Machine creation operations (new reservations of compute capacity) failing; the most obvious reason a Machine create might fail is that it’s directed to a region that’s at capacity. What we have learned over several years of providing this API is that customers are not as thrilled with the computer science elegance of this “check if it fails and try again elsewhere” service model as we are. So we’ve been moving heaven and, uh, Dusty to make sure this condition happens as rarely as possible. Dusty’s big project over the last week: integrating our existing host metrics (the capacity metrics you’d think about by default, like CPU and disk utilization, IOPS, etc) with Corrosion, our state tracking database. Exported host metrics are a high-fidelity view into what our hosts actually see, while Corrosion is a distilled view into what we are trying to schedule on hosts. We’ve now got Corrosion reflected into Grafana, which has enabled Dusty to build out a bunch of new capacity planning dashboards.

Dusty also moved half of our AMS region to new hardware; half the region to go!

Peter worked a support rotation. We schedule product engineers to multi-day tours of duty alongside our support engineers, which means watching incoming support requests and pitching in to help solve problems. Peter reports his most memorable support interaction was doing Postgres surgery for a customer who had blown out their WAL file by enabling archive_mode, which preserves WAL segments, without setting archive_command, giving Postgres no place to send the segments.

Tom continued his top-secret work that we can’t write about, except that to say this week it involved risk-based CPU priorities and Machine CPU utilization tracking.

Now, deep breath:

September 1 Routing Layer Outage Postmortem

(A less formal version of this postmortem was posted on our community site the day after the incident.)

Narrative

At 3:30PM EST on September 1, we experienced a fleetwide near-total request-routing outage. What this means is that for the duration of the incident, which was acute for roughly 40 minutes and chronically recurring for roughly another hour, apps hosted on Fly.io couldn’t receive requests from the Internet. This is a big deal; our most significant outage since the week we started the infra-log (in which we experienced roughly the same WireGuard mesh outage, which also totally disrupted request routing, twice in a single week). We record lots of incidents in this log, but very few of them disable the entire platform. This one did.

We’re going to explain the outage, provide a timeline of what happened, and then get into some of what we’re doing to keep anything like it from happening again.

Request routing is the mechanism by which we accept connections from the Internet, figure out what protocol they’re using, match them to customer applications, find the nearest worker physical running that application, and shuttle the request over to that physical so customer code can handle it. Our request routing layer is broadly comprised of these four components:

Anycast routing, which allows us to publish BGP4 updates to our upstreams in all our regions to attract traffic to the closest region.
fly-proxy, our Anycast request router, in its “edge” configuration. In this configuration, fly-proxy works a lot like an application-layer version of an IP router: connections come in, the proxy consults a routing table, and forwards the request.
That same fly-proxy code in its “backhaul” configuration, which cooperates with the edge proxy to bring up transports (usually HTTP/2) to efficiently relay requests from edges to customer VMs.
Corrosion, our state propagation system. When a Fly Machine associate with a routable app starts or stops on a worker, flyd publishes an update to Corrosion, which is gossiped across our fleet; fly-proxy subscribes to Corrosion and uses the updates to build a routing table in parallel across all the thousands of proxy instances across our fleet.

During the September 1 outage, practically every instance of fly-proxy running across our fleet became nonresponsive.

Generally, platform/infrastructure components at Fly.io are designed to cleanly survive restarts, so that as a last resort during an incident we can attempt to restore service by doing a fleetwide bounce of some particular service. Bouncing fly-proxy is not that big a deal. We did that here, it restarted cleanly, and service was restored. Briefly. The fleet quickly locked back up again.

Our infra team continued applying the defibrillator paddles to fly-proxy while the proxy team diagnosed what was happening.

The critical clue, identified about 50 minutes into the incident, was that proxyctl, our internal CLI for managing the proxy, was hanging on stuck fly-proxy instances. There’s not a lot of mechanism in between proxyctl and the proxy core; if proxyctl isn’t working, fly-proxy is locked, not just slowly processing some database backlog or grinding through events. The team immediately and correctly guessed the proxy was deadlocked.

fly-proxy is written in Rust. If you’re a Rust programmer, the following code pattern may or may not be familiar to you, and you may taste pennies in your mouth seeing it:

        // RWLock self.load
    if let (Some(Load::Local(load))) = (&self.load.read().get(...)) {
        // do a bunch of stuff with `load`
    } else {
        self.init_for(...);
    }

  

An RWLock is a lock that can be taken multiple times concurrently for readers, but only exclusively during any attempt to write. An if let in Rust is an if-statement that succeeds if a pattern matches; here, if self.load.read().get() returns Some instance, rather than None; this is a Rust error checking idiom. In the success case, the result is available inside the success arm of the if let as load. The else arm fires if self.load.read().get() returns None.

The way this if let statement looks, it would appear that the lifetime of the read lock taken in attempting the success case is only the length of the success arm of the if statement, and that the lock is dropped if the else arm triggers. But that is not what happens in Rust. Rather: if let is syntactic sugar for this code:

        match &self.load.read().get() {
        Some(load) => { /* do a bunch of stuff with `load` */ },
        _ => {
            self.init_for(...);
        },
    }

  

It is clearer, in this de-sugared code block, that the read() lock taken spans the whole conditional, not just the success arm.

Unfortunately for us, buried a funcall below init_for() is an attempt to take a write lock. Deadlock.

This is a code pattern our team was already aware of, and this code was reviewed by two veteran Rust programmers on the team, before it was deployed, but neither spotted the bug, most likely because the conflicting write lock wasn’t lexically apparent in the PR diff.

The PR that introduced this bug had been deployed to production several days earlier. It introduced “virtual services”, which decouple request routing from Fly Apps. Conventionally-routed services on Fly.io are tied to apps; the fly.toml configuration for these apps “advertise” services connected to the app, which ultimately end up pushed through Corrosion into the proxy’s routing table. Virtual services enable flexible query patterns that match subsets of Fly Machines, by metadata labels, to specific URL paths. We’re generally psyched about virtual services, and they’re important for FKS, the Fly Kubernetes Service.

The deadlock code path occurs when a specific configuration of virtual service is received in a Corrosion update. That corner case had not occurred in our staging testing, or on production for several days after deployment, but on September 1 a customer testing out the feature managed to trigger it. When that happened, a poisonous configuration update was rapidly gossiped across our fleet, deadlocking every fly-proxy that saw it. Bouncing fly-proxy broke it out of the deadlock, but only long enough for it to witness another Corrosion subscription update poisoning its service catalog again. Distributed systems. Concurrency. They’re not as easy as computer science classes tell you they are.

Because we had a strong intuition this was a deadlock bug, and because it’s easy for us to isolate recent deployed changes to fly-proxy, and because this particular if let RWLock bug is actually a known Rust foot-gun, we worked out what was happening reasonably quickly. We disabled the API endpoints that enabled users to create virtual services, and rolled out a proxy code fix, restoring services shortly thereafter.

Complicating the diagnosis of this incident was a corner case we ran into a with sysctl change we had made across the fleet. To improve graceful restarts of the proxy, we had applied tcp_migrate_req, which migrates requests across sockets in the same REUSEPORT group. Under certain circumstances with our code, this created a condition where the “backhaul” proxy stopped receiving incoming connection requests. This condition impacted only a very small fraction (roughly 10 servers total) of our physical fleet, and was easily resolved fleetwide by disabling the sysctl; it did slow down our diagnosis of the “real” problem, however.

Incident Timeline

2024-08-28 19:01 UTC: The problematic version of fly-proxy is deployed across all regions. Nothing happens, because nobody is publishing virtual services.
2024-09-01 19:18 UTC: A poisonous virtual services configuration is added by a customer. The configuration is not itself malicious, but triggers the proxy deadlock. The configuration is propagated within seconds to all fly-proxy instances via Corrosion.
2024-09-01 19:25 UTC: Synthetic alerting triggers a formal incident; an incident commander is assigned and creates an incident channel.
2024-09-01 19:30 UTC: Internal host health check alerts begin firing, indicating broad systemwide failures in hardware and software.
2024-09-01 19:31 UTC: Our status page is updated, reporting a “networking outage”, impacting our dashboard, API, and customer apps.
2024-09-01 19:33 UTC: Two proxy developers have joined the incident response team.
2024-09-01 19:36 UTC: The infra team confirms the host health checks are a false alarm triggered by a forwarding dependency for health alerts on the stuck proxies. fly-proxy is implicated in the outage.
2024-09-01 19:41 UTC: The API, which is erroring out on requests, is determined to be blocked on attempts to manage tokens. Token requests from the API are forwarded as internal services to tkdb, an HTTP service that for security reasons runs on isolated hardware. fly-proxy is further implicated in the outage.
2024-09-01 19:56 UTC: Our infra team begins restarting fly-proxy instances. The restart briefly restores service. All attention is now on fly-proxy. The infra team will continue rolling restarts of the proxies as proxy developers attempt to diagnose the problem. The acute phase of the incident has ended.
2024-09-01 20:10 UTC: Noting that proxyctl is failing in addition to request routing, attention is directed to possible concurrency bugs in recent proxy PRs.
2024-09-01 20:12 UTC: Continued rolling restarts of the proxy have cleared deadlocks across the fleet from the original poisoned update; service is nominally restored, and the infra team continues monitoring and restarting. The status page is updated.
2024-09-01 21:13 UTC: The proxy team spots the if let bug in the virtual services PR.
2024-09-01 21:21 UTC: The proxy team disables the Fly Machines API endpoint that configures virtual services, resolving the immediate incident.

Forward-Looking Statements

The most obvious issue to address here is the pattern of concurrency bug we experienced in the proxy codebase. Rust’s library design is intended to make it difficult to compile code with deadlocks, at least without those deadlocks being fairly obvious. This is a gap in those safety ergonomics, but an easy one to spot. In addition to code review guidelines (it is unlikely that another if let concurrency bug is going to make it through code review again soon), we’ve deployed semgrep across all our repositories; this is a straightforward thing to semgrep for.

The deeper problem is the fragility of fly-proxy. The “original sin” of this design is that it operates in a global, flat, unsegmented topology. This simplifies request routing and makes it easy to build the routing features our customers want, but it also increases the blast radius of certain kinds of bugs, particularly anything driven by state updates. We’re exploring multiple avenues of “sharding” fly-proxy across multiple instances, so that edges run parallel deployments of the proxy. Reducing the impact of an outage from all customers to 1/N customers would have simplified recovery and minimized the disruption caused by this incident, with potentially minimum added complexity.

One issue we ran into during this outage was internal dependencies on request routing; one basic isolation method we’re exploring is sharding off internal services, such as those needed to run alerting and observability.

fly-proxy is software, and all software is buggy. Most fly-proxy bugs don’t create fleetwide disasters; at worst, they panic the proxy, causing it to quickly restart and resume service alongside its redundant peers in a region. This outage was severe both because the proxy misbehavior was correlated, and because the proxy hung rather than panicking. A straightforward next step is to watchdog the proxy, so that our systems notice N-second periods during which the proxy is totally unresponsive. Given the proxy architecture and our experience managing this outage, watchdog restarts could function like a pacemaker, restoring some nominal level of service automatically while humans root-cause the problem. This was our first sustained correlated proxy lock-up, which is why we hadn’t already done that.

This incident sucked a lot. We regret the inconvenience it caused. Expect to read more about improvements to request routing resilience in infra-log updates to come.

Next post ↑: 2024-09-14
Previous post ↓: 2024-08-31