2024-10-26

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

  • October 22: Depot Builders Disrupted (10:30EST): About a month ago, we began defaulting to Docker build servers running on our infrastructure but managed by Depot; running good, efficient Docker container builders is Depot’s whole thing and we’re happy to have them do the lifting. Anyways, they deployed a code change that broke the way they handled auth tokens, and, in turn, our default builders, for about 5 minutes. We have fallbacks for Depot (we still have our own builder infrastructure), but this outage didn’t last long enough to warrant changes.

  • October 22: Sustained Orchestration Outage (14:00EST): A cascading failure beginning with a certificate expiration disrupted our orchestration system for over 6 hours, including a 1-hour acute period that broke new deploys of existing applications. A full postmortem follows this update.

  • October 24: Unusual Load In Five Regions (12:00EST): The phased rollout of our new shared CPU scheduling system hit a snag when system commands on 5 physical servers began taking multiple seconds to respond, lagging long enough to generate alerts. These alerts were internal, and we don’t believe they impacted customer workloads; the scheduling change was rolled back 30 minutes into the incident, which resolved it.

October 22 Orchestration Outage Postmortem

Narrative

At 14:00 EST on October 22, we experienced a fleetwide severe orchestration outage. What this means is that for the duration of the incident, both deployments of new applications and changes to existing applications were disrupted; during the most acute stage of the outage, lasting roughly an hour and 40 minutes, that disruption was almost total (Fly Machines could not be updated), and for roughly another 2 hours new application deployments did not function (but changes to existing applications did). Service was restored completely at 21:15 EST.

This outage proceeded through several phases. The earliest acute phase was the worst of it, and subsequent phases restored various functions of the platform, so that towards the end of the outage it was largely functional for most customers. At the same time, up until the end of of the outage, Fly.io’s orchestration layer was significantly disrupted. That makes this the longest significant outage we’ve recorded, not just on this infra-log but in the history of the company.

We’re going to explain the outage, provide a timeline of what happened, and then get into some of what we’re doing to keep anything like it from happening again.

Orchestration is the process by which software from our customers gets translated to virtual machines running on Fly.io’s hardware. When you deploy an app on Fly.io, or when your Github Actions CI kicks off an update after you merge to your main branch, you communicate with our orchestration APIs to package your code as a container, ship it to our container registry, and arrange to have that container unpacked as a virtual machine on one or more of our worker servers.

You can broadly split our orchestration system into three pieces:

  1. flyd, our distributed process supervisor; flyd understands how to download a container, transform it into a block device with a Linux filesystem, boot up a hypervisor on that block device, connect it to the network, and keep track of the state of that hypervisor,
  2. the state sharing system; an individual flyd instance knows only about the Fly Machines running on its own host, by design, and it publishes events to a logically separate state sharing system so that other parts of our platform (most notably our Anycast routers and our API) know what’s running where, and
  3. our APIs, which allow customers to create, start, and stop new Fly Machines; our APIs are comprised of the Fly Machines API, which interacts directly with flyd to start and stop machines, and our GraphQL API, which is used to deploy new applications and manage existing applications.

The outage we experienced broke (2), our state sharing system, but had ripple effects that disrupted (1) and (3).

The outage was a cascading failure with a simple cause: a long-lived CA certificate for a deprecated state-sharing orchestration component expired. Our state sharing system is made up of two major parts:

  1. consul, or “State-Sharing Classic”, manages the configuration of our server components, registers available services on Fly Apps, and manages health checks for individual Fly Machines. consul is a Raft cluster of database servers that take updates from “agent” processes running on all our physical servers. consul used to be the heart of all our state-sharing, but was superseded 18 months ago, by
  2. corrosion, or “New State-Sharing”, tracks the state of every Fly Machine, and every available service, and service health. corrosion is a SWIM-gossip cluster that replicates a SQLite database across our fleet.

We began replacing consul with corrosion because of scaling issues as our fleet grew. It’s the nature of our service that every physical server needs, at least in theory, information about every app deployment, in order to route requests; this is what enables an edge in Sydney to handle requests for an app that’s only deployed in Frankfurt. consul can be deployed with regional Raft clusters, but not in a way that shares information automatically between those clusters. Since 2020, we’ve instead operated it in a single flat global cluster. Rather than do a lot of fussy in-house consul-specific engineering to make regional clusters work, we built our own state sharing system, wrapped around the dynamics of our orchestrator. This project is mostly complete.

What we have not completed is a complete severance of consul from the flyd component of our orchestrator. flyd still updates consul when Fly Machine events (like a start, stop, or create) occur. Those consul updates are slow, because consul doesn’t want to scale the way we’re holding it. But that doesn’t normally matter, because our “live” state-sharing updates come from corrosion, which normally has p95 update times around 1000ms. Still, some of these consul operations do need to complete, especially for Fly Machine creates.

consul runs over mTLS secure connections; that means everything that talks to it needs a CA certificate (to validate the consul server certificate) and a client certificate (to prove that it’s authorized to talk to consul).

At around 14:00EST on the day of the outage, consul’s CA certificate expired. Every component of our fleet which had any dependence on consul stopped working immediately. Fly Machine creates (but not starts and stops) depend on consul, as does some of our telemetry and internal fleet deployment capability. flyctl deploy stopped working.

To resolve this problem, we need to re-key the entire fleet; a new CA certificate, new server certificates, and new client certificates. Complicating matters: our internal deployment system (fcm/fsh) relies on consul to track available physical servers. All told, it takes us about 45 minutes to restore enough connectivity to deploys, and another 45 minutes to completely rekey the fleet.

At this point, basic Fly Machines API operations are completing. But there’s another problem: vault, our old secret storage system, is still used for managing disk encryption secrets, and by our API. The fleet rekeying has broken connectivity to vault. Complicating matters further, vault has extroardinarily high uptime, and so when its configuration is updated and the service is bounced, it doesn’t come back cleanly. For about 90 minutes, a team of infra engineers works to diagnose the problem, which turns out to be a different set of certificates that have expired; we’re able to perform X.509 surgery to restore them without rekeying another cluster.

The biggest problem in the outage (in terms of difficulty, if not raw impact) now emerges. During the window in which consul was completely offline, flyd has been queueing state updates and retrying them on an exponential backoff timer. These updates can’t complete until consul is back online, but all of them are events that corrosion consumes. They pile up, rapidly and dramatically.

By the time consul is restored, corrosion is driving 150gB/s of traffic, saturing switch links with our upstream. The data it’s trying to ship is mostly worthless, but it doesn’t know that. It’s a distributed system based on gossip, so it’s not simple to filter out the garbage.

For 6.5 hours, through the acute phase (in which deploys of existing apps aren’t functioning) and subacute phase (in which deploys of new apps aren’t functioning reliably), this will be the major problem we contend with. We need corrosion in order to inform our Anycast proxies of which Fly Machines are available to route traffic to. During the subacute phase of the outage, routing to existing Fly Machines continues functions, but changes to Machines take forever to propagate: at the beginning of the subacute phase, as much as 30 minutes; by the end, P99 latencies of several minutes — still far too slow for real-time Anycast routing.

Ultimately, the decision is made to restore the corrosion cluster from a snapshot (we made snapshots daily), and fill in the gaps (“reseeding” the cluster) from source-of-truth data. This process begins at 18:00 EST and completes by 18:30 EST, at which time P99 latencies for corrosion are back under 2000ms.

At this point, orchestration is almost fully functional. We have one remaining problem: deploys for apps that involve creating new volumes (which includes most new apps) fail, because our GraphQL API server needs to talk to consul to complete them (and only them), and it’s disconnected due to the rekeying.

Now that corrosion is stabilized, we’re able to safely redeploy the API server. The deployment hits a snag, which results in HTTP 500 errors from the API for about 20 minutes, at which point we’ve successfully redeployed, restoring the API.

Minutes later, with no known disruption or instability in the platform, the incident is put into “Monitoring” mode.

Incident Timeline

  • 2024-10-22 18:00 UTC: The CA certificate for our consul cluster expires, breaking much of our customer-facing APIs and also our internal deployment system. The status page is updated to report widespread API failures.
  • 2024-10-22 18:04 UTC: The infra team has root-caused the outage and made the decision to re-key the consul cluster.
  • 2024-10-22 18:10 UTC: Our support and infra teams observe that existing applications are running, but that applications with auto-stop/start Fly Machines will be impacted.
  • 2024-10-22 18:20 UTC: The re-keying operation has been implemented, automated, and tested against a single server.
  • 2024-10-22 18:35 UTC: The consul server cluster is fully re-keyed, and our deployment server is re-keyed, so we can begin to restore internal deployment capability and re-key the entire worker fleet.
  • 2024-10-22 18:45 UTC: An upstream notifies us that we’re generating so much traffic it’s impacting top-of-rack switches; work begins on resolving the corrosion issue (it will continue for several hours). Meanwhile: internal deployment is restored.
  • 2024-10-22 19:15 UTC: corrosion has been stopped and restarted, but the volume of updates hasn’t been mitigated. The infra and platform teams diagnose the problem: the consul outage has caused a giant backlog of spurious flyd retried state updates.
  • 2024-10-22 19:20 UTC: The whole fleet has been re-keyed for consul. Fly Machine start/stops are now functioning, though state updates are delayed, at this point by as much as 30 minutes. The acute phase of the outage is over; the subacute phase has begun. The status page is updated to report the partial fix, and the state update delays.
  • 2024-10-22 19:30 UTC: vault alarms are going off; the servers, which link to consul, have lost connectivity due to the re-key. Fly Machine operations that require vault fail; for most Fly Machines, this information is cached, but starts of long-quiescent Fly Machines will fail if they have volumes attached.
  • 2024-10-22 19:35 UTC: We quickly reconfigure vault with new consul certificates, but vault restarts into a nonfunctional state. Work diagnosing vault begins.
  • 2024-10-22 20:35 UTC: The infra team has discovered the root cause of the failure; a vault-specific certificate has expired, which went undetected owing to the extremely high uptime of the service. The prospect of re-keying the whole vault cluster is discussed.
  • 2024-10-22 20:45 UTC: The infra team rebuilds a valid certificate around the existing key, restoring the vault cluster.
  • 2024-10-22 20:55 UTC: vault is functioning fleetwide again.
  • 2024-10-22 21:10 UTC: Our GraphQL API server is generating alerts from an excessively high number of errors. The problem is immediately diagnosed: we’ve re-keyed the physical fleet, but the API server also relies on consul.
  • 2024-10-22 21:30 UTC: corrosion may or may not be working its way through the queue of spurious updates in a tractable amount of time. There are three intervals to consider: (a) the amount of time it will take for corrosion to organically resolve this problem, (b) the amount of lag in state updates while corrosion remains in this state, and © the amount of time it will take to restore and reseed the cluster. The platform team begins considering restoring it from a backup and re-seeding it; the “organic resolution” time estimate (a) is stretching into multiple hours, but the “lag time” estimate is dropping rapidly.
  • 2024-10-22 21:50 UTC: Support informs the incident team that customer perception is that the outage has largely resolved, complicating the decision to restore the corrosion cluster; state update “lag times” might be tolerable.
  • 2024-10-22 22:30 UTC: We’re seeing P50 corrosion lag times in the single-digit seconds but P99 lag times of around 3 minutes.
  • 2024-10-22 23:00 UTC: The decision is made to restore corrosion and reseed it.
  • 2024-10-22 23:30 UTC: The corrosion cluster is restored and re-seeded. P99 lag times are now under 2000ms.
  • 2024-10-22 24:30 UTC: Now that corrosion has been restored, our GraphQL API server can be re-deployed. A branch has been prepped and tested during the outage; it deploys now.
  • 2024-10-22 24:31 UTC: API requests are generating HTTP 500 errors.
  • 2024-10-22 24:40 UTC: Though the API server has been deployed with the new consul and vault keys, a corner-case issue is preventing them from being used.
  • 2024-10-22 24:40 UTC: The GraphQL server is re-deployed again; the API is restored. The subacute phase of the outage has ended. The status board status for this incident is set to “Monitoring”.

Forward-Looking Statements

The simplest thing to observe here is that it shouldn’t have been possible for us to approach the expiration time of a load-bearing internal certificate without warnings and escalations. Ordinarily we think about situations like this in terms of proximal causes (“we were missing a critical piece of alerting”) and some root cause; here, it’s more useful to look at two distinct root causes.

The first cause of this incident was that our infra team was overscheduled on project work. For the past year, we’ve pursued a blended ops/development strategy, with increasing responsibilty inside the infra team for platform component development. If you follow the infra log, you’ve seen a lot of that work, especially with the Fly Machine volume migration work, which was largely completed by infra team members. We have developed a tendency to think about reliability work in terms of big projects with reliability payoffs. That makes sense, but needs to be balanced with nuts-and-bolts ops work. The “fix” for this problem will be denominated in big-ticket infrastructure projects we defer into next year to make room for old-fashioned systems and network ops work.

The second cause of this incident is a system architecture decision.

We shipped the first iteration of the Fly.io platform on a single global Consul cluster. That made sense at the time. It took years for us to scale to a point where Consul became problematic. When we approached that point, we had a decision to make:

  • we could do the engineering work to break out single Consul cluster into multiple regional clusters, and then replicate state between them, retaining Consul’s role in our architecture but allowing it to scale, or
  • for a similar amount of engineering effort, we could replace Consul with an in-house state-sharing system that was designed for our workloads.

The latter decision was sensible: we could make Consul scale, but making it fast enough for real-time routing to Fly Machines that start in under 200ms was challenging; a new, gossip-based system, taking advantage of architectural features that eliminated the need for distributed consensus, would make it much easier to address that challenge.

Unfortunately, we chose a half-measure. We replaced Consul with corrosion, but we retained Consul dependencies inside of our orchestration system, using Consul as a kind of backstop for corrosion, and keeping data in Consul relatively fresh so that old components could continue using it. Consul inevitably became a dusty old corner in our architecture, and so nobody was up nights worrying about managing it. Thus, our longest-ever outage.

The moral of the story is, no more half-measures. Work has already begun to completely sever Consul from our orchestration (we’ll still be using it for what it’s good at, which is managing configuration information across our fleet; Consul is great, it’s just not meant to be run, in its default configuration, for a half-million applications globally).

Finally, you may notice from the timeline that it took an odd amount of time to pull the trigger on restoring and reseeding the corrosion cluster, especially since once we did so, the process was completed in just 30 minutes. Restoring corrosion was straightforward because we have a tested runbook for doing so. But that runbook doesn’t have higher-level process information about when to restore corrosion, and what service impact to expect when doing so. If we’d had that information ready, we could have decided to perform the restore much earlier, shaving potentially 4 hours off the disruption.

,