A note on incidents: incidents are internal events for our infrastructure team. Incidents often correspond to degraded service on our platform, but not always. This log aims for 100% fidelity to internal incidents, and is a superset both of our status page events and of customer-impacting events on the platform. It includes events reported to subsets of customers on their personal status pages, as well as events without any status page impact.
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
December 5: Vault Server Outage (16:45EST): An infrastructure deploy run was kicked off, in order to add a new server to our Hashicorp Consul cluster; because of a stale script in our infrastructure tooling, that run had the effect of rolling back a certificate update we’d done a month earlier, which broke our Vault cluster (which has a cross-dependency on Consul, an artifact of our old Nomad+Consul+Vault platform). Fortunately, several months back we deployed Pet Sematary, our in-house replacement for Vault, in a configuration that has PetSem kick in when Vault is unavailable (and vice/versa), so this likely had no customer impact over the 40 minutes it took to investigate and resolve (those 40 minutes also being an artifact of the reduced staffing needed because of the PetSem mitigation).
December 6: Fly Machines API Outage (08:50EST): An engineer deployed a change to our machines API code that had the effect of breaking the Fly Machines create API once deployed in production. The error was discovered immediately through alerts, and rolled back within 10 minutes (it should have been 5, but a CI issue disrupted the first attempted deploy).
Update: Nov 30, 2024
Trailing 2 Weeks Incidents
The infra-log took last week off for Thanksgiving; the trailing two weeks in this update are two “fresh” weeks worth of incidents, though basically nothing happened in the first of those weeks.
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
November 21: Log Query Outage (14:00EST): Customers experienced slow queries on our log retention service, which is operated by an upstream partner, and backended onto object storage. We created this incident (and public status updates) to track their work resolving those slow queries. Our upstream log partner scaled up resources, which apparently resolved the issue.
November 24: Volume Issue On Seattle Host (04:00EST): Internal alerts fired about the integrity of volume groups on a particular Seattle worker physical. This turned out to be a LVM2 reporting anomaly, cleared up in under a minute with a rescan, which we should just automate for next time this pops up. Zero customer impact.
November 25: All Hell Breaks Loose (15:00EST): A combination of different incidents occur simultaneously, resulting in both deployment and API outages, during acute periods (two of them) with near-total severity. A detail postmortem follows.
November 26: API Outage (15:15EST): A recurrance of the back half of the previous day’s outage, with the same cause, resolved in about 5 minutes; during those 5 minutes, deployments that required the use of our GraphQL API (mostly: for new apps) would have failed.
November 25 Outage Postmortem
Narrative
At approximately 15:00EST on November 25, we experienced a fleetwide severe orchestration outage. What this means is that for the duration of the incident, both deployments of new applications and changes to existing applications were disrupted; during the acute phase of the outage Fly Machines could not be updated; for the back half of the outage, our API was unavailable. Service was restored completely at 02:30EST.
This was a compound outage with two distinct causes and impacts. The first approximately mirrored the October 22nd orchestration outage and involved a distributed systems failure in Corrosion, our state sharing system. The second was an API limit problem, which combined with an error in a customer app had the effect of denying service to our API. The two outages overlapped chronologically but were resolved serially, extending the duration of the incident.
We’re going to explain the outage, provide a timeline of what happened, and then get into some of what we’re doing to keep anything like it from happening again.
Orchestration is the process by which software from our customers gets translated to virtual machines running on Fly.io’s hardware. When you deploy an app on Fly.io, or when your Github Actions CI kicks off an update after you merge to your main branch, you communicate with our orchestration APIs to package your code as a container, ship it to our container registry, and arrange to have that container unpacked as a virtual machine on one or more of our worker servers.
At the heart of our orchestration scheme is a state-sharing system called Corrosion. Corrosion is a globally-synchronized SQLite database that records the state of every Fly Machine on our platform. Corrosion uses CRDT semantics (via the cr-sqlite crate) to handle SWIM-gossipped updates from worker servers around the world; a reasonable first approximation holds that every edge and worker server in our fleet runs a copy of Corrosion and, through gossip updates, synchronizes its own copy of the (rather large) global state for the Fly.io platform.
The proximate cause of this Corrosion incident is straightforward. About 5 minutes before the incident began, a developer deployed a schema change to Corrosion, fleet-wide.
The change added a nullable (and usually-null) column to the table in Corrosion that tracks all configured services on all Fly Machines (that is to say: if you edit your fly.toml to light up 8443/tcp on your app, this table records that configuration on every Fly Machine started under that app). Surprisingly to the developer, the CRDT semantics on the impacted table meant that Corrosion backfilled every row in the table with the default null value. The table involved is the largest tracked by Corrosion, and this generated an explosion of updates.
As with the previous Corrosion outage, because this is a large-scale distributed system, Corrosion quickly drove tens of gigabytes of traffic, saturating switch links at our upstream.
This outage is prolonged by a belief that the root cause is an inconsistent set of schemas on different instances of Corrosion.
The incident begins (and is alarmed and declared and status-paged) promptly after the schema change is deployed. The deployment is immediately halted, and investigation begins. Corrosion is driving enough traffic in some regions to impact networking, and cr-sqlite‘s CRDT code is consuming enough CPU and memory on many hosts throw Corrosion into a restart loop. Now the deployment is allowed to complete, to rule out inconsistency as a driver of the update storm. The deployment doesn’t worsen the outage, but does take time, and doesn’t improve the situation.
As with the October 22nd outage, the Corrosion problem is resolved when the decision is made to re-seed the database from an external source of truth. This time, the schema change complicates the process: a backup snapshot of the Corrosion database from prior to the schema change is needed, and downloading and uncompressing it adds time to the resolution.
As with the previous outage, once the snapshot is in place, re-seeding Corrosion takes approximately 20 minutes and resolves the Corrosion half of the outage.
At the same time this is happening, a corner-case interaction between a malfunctioning customer app and our API is choking out our API server.
The customer’s app runs untrusted code on behalf of users (this is a core feature of our platform). It does so by creating a new Fly Machine for each run, loading code on to it, running it to completion, and then destroying the Fly Machine. This works, but is not how the platform is meant to be used; rather, our orchestrator assumes users will ahead-of-time create pools of Fly Machines (dynamically resizing them as needed), starting and stopping them to handle incoming workloads; a stop of an existing Fly Machine resets it to its original state. Start and stop are much, much faster than create and destroy.
The customer’s app is suddenly popular, and begins creating dozens of Fly Machines every second, at a rate steadily increasing throughout the outage. This exercises a code path not expected to be run in a tight loop and missing a rate limit. In our central Rails API server, which is implicated in create requests (but not starts and stops), this has the effect of jamming the process up with expensive SQL queries.
A different team is investigating and working on resolving this incident alongside the previous one. The team attempts to scale up to accommodate the load, first at the database layer, and then with larger Rails app servers; dysfunction in the Rails API makes the latter difficult and time-consuming, and ultimately neither scale-up resolves the problem: paradoxically, as we create additional capacity for create requests, the lack of backpressure amplifies the number of incoming create requests we receive.
30 minutes before the end of the outage, we reach the customer, who disables their scheduling application. The API outage promptly resolves.
Incident Timeline
This timeline makes reference to the Corrosion outage as “Incident 1”, and the API flood as “Incident 2”.
2024-11-25 14:43PM EST: (Incident 1) A fleetwide deployment of a change to our Corrosion state-sync system begins; it contains a schema update, adding a new column to a large CRDT-governed table, resulting in explosions of backfill changes on the hosts the deployment hits.
2024-11-25 14:58PM EST: (Incident 1) High-urgency alerts begin; the infra-ops team is paged.
2024-11-25 15:00PM EST: (Incident 2) A demanding customer app is generating 30 Fly Machine creations per second.
2024-11-25 15:02PM EST: (Incident 1) Our upstream alerts us to link saturation in a few regions.
2024-11-25 15:05PM EST: (Incident 1) A formal incident is declared.
2024-11-25 15:05PM EST: (Incident 2) The API team notes higher-than-usual RDS usage for our API’s backing store.
2024-11-25 15:10PM EST: Our status page us updated, reflecting degraded API performance.
2024-11-25 15:16PM EST: (Incident 1) The fleetwide deployment from 14:43PM is halted.
2024-11-25 15:26PM EST: (Incident 1) Corrosion traffic is continuing to increase; time-to-recover and propagation metrics are worsening. Concerned about the impact of inconsistent schemas, the deployment is resumed.
2024-11-25 15:37PM EST: (Incident 1) Corrosion instances on many machines are beginning to restart; at this point, Corrosion is no longer effectively updating state across out fleet.
2024-11-25 16:49PM EST: (Incident 1) An edge physical in Boston and a worker physical in London are offline, due to a combination of CPU and network load.
2024-11-25 17:15PM EST: (Incident 1) The team begins deployment of configuration changes to slow the rate of Corrosion updates; the status page is updated to reflect our diagnosis of the issue. Corrosion is partially functioning (deployments during this phase of the outage are hit-or-miss, especially for new Fly Machines).
2024-11-25 17:36PM EST: (Incident 1) Attempts to restart Corrosion on smaller (edge and gateway) hosts are now timing out, as Corrosion on those hosts struggles to keep up with the rate of updates.
2024-11-25 17:45PM EST: (Incident 1) We throttle WireGuard traffic between our physicals with our upstream.
2024-11-25 18:30PM EST: (Incident 2) The customer app generating the flood of creates has reached 52 requests/sec.
2024-11-25 18:58PM EST: (Incident 1) The Corrosion configuration change completes, but some physicals across our fleet are still distressed. The status page is updated. The Machines API is slow, but functional, at this point.
2024-11-25 19:00PM EST: (Incident 1) The decision is made to “re-seed” Corrosion, creating a new baseline database from external sources of truth, rather than wait out slow recovery on several physicals across our fleet. We begin the process of loading a (large) snapshot of the database on all our servers.
2024-11-25 20:07PM EST: (Incident 1) The process of loading the snapshot across the fleet completes.
2024-11-25 20:48PM EST: (Incident 1) Corrosion is brought up to date with changes from our API and flyd servers occurring after the snapshot time.
2024-11-25 21:30PM EST: (Incident 2) The customer app generating the flood of creates has reached 139 requests/sec.
2024-11-25 21:31PM EST: (Incident 1) The status page is updated to reflect nominal performance of Corrosion and state synchronization.
2024-11-25 22:27PM EST: (Incident 2) Alarms fire about the availability of our web dashboard. Concerns remain about aftereffects from the Corrosion outage, but the problem will turn out to be unrelated.
2024-11-25 22:55PM EST: (Incident 2) Telemetry reveals slow SQL queries jamming up our API serves.
2024-11-25 23:12PM EST: (Incident 2) We begin scaling up our RDS instances, from 8xl to 16xl, and update the status page.
2024-11-26 00:10AM EST: (Incident 2) Scale-up completes, but performance still lags. We begin scaling up our Rails API servers as well; this is complicated by the current load on that API server, which is in the critical path for the planned scale-up deployment. At this point, the team is still attempting to scale out of this problem. The team will continue to attempt permutations of API server scaling, including reversions of recent PRs, for the next 45 minutes.
2024-11-26 01:00AM EST: (Incident 2) The customer app generating the flood of creates has reached 151 requests/sec.
2024-11-26 01:33AM EST: (Incident 2) The customer app generating the flood of creates is stopped.
2024-11-26 01:34AM EST: (Incident 2) The incident resolves.
2024-11-26 02:15AM EST: (Incident 3) The status page is updated to reflect incident resolution.
Forward-Looking Statements
A significant fraction of this outage rhymes with our previous orchestration outage, and much of what we’re working on in response to that outage applies here as well.
The most significant thing we’re doing to minimize the impact of outages like these in the future is to reduce global state. Currently, every physical in our fleet has a high-fidelity record of every individual Fly Machine running on the platform. This is a consequence of the original architecture of Fly.io, and it’s a simplifying assumption (“anywhere we need it, we can get any data we want”) that we’ve taken advantage of over the years.
Because of the increased scale we’re working at, we’ve reached a fork in the road. We can continue running into corner-cases and bottlenecks as we scale and manage high-fidelity global state, and develop the muscles to handle those, or we can break the simplifying assumption and do the engineering required to retrofit “regionality” into our state. As was the case late this summer with fly-proxy, we’re choosing the latter: running multiple regional high-fidelity Corrosion clusters, with Fly-Machine-by-Fly-Machine detail about what’s running where, and a single global low-fidelity cluster with pointers down to the regional clusters.
The payoff for regionalized state is a reduced blast radius for distributed systems failures. There’s effectively nothing that can happen with Corrosion today that doesn’t echo across our whole fleet. In a regionalized configuration, most problems can be contained to a single region, where they’ll be more straightforward to resolve and have drastically less impact on the platform. Corrosion, an open-source project we maintain, is already capable of running this way; the work is in how it’s integrated, particular to our routing layer.
This work has been ongoing for over a month, but it’s a big lift and we can’t rush it. So: we’re cringing this overlap this outage has with our last one, but it’s not for lack of staffing and effort on the long-term fix.
Two immediately evident pieces of low-hanging fruit that we have already picked in the last week:
First, as we said last time, responding to Corrosion issues by efficiently re-seeding state continues to be an effective and relatively fast fix to these issues. Re-seeding was complicated this time by the schema change that precipitated the event. We’ve begun creating processes to simplify and speed up re-seeding under worst-case circumstances. Additionally, some of the delay in kicking off the re-seeding, as with last time, resulted from a cost/benefit calculation: re-seeding requires resynchronizing our API and flyd servers across our fleet with Corrosion, which isn’t automated; the hope was that Corrosion would converge and reach acceptable P95 performance levels soon enough not to need to do that work. We’re building tooling to minimize that work in the future, so that doesn’t need to be part of the calculation.
We’ve also added comprehensive “circuit-breaker” limits to Corrosion. For already-deployed apps, even a total Corrosion outage shouldn’t break routing; Corrosion synchronizes a SQLite database, and our routing layer can simply read that database, whether or not Corrosion is running. But during the acute phase of this outage, Corrosion wasn’t just not running effectively; it was also consuming host and (especially) network resources. Corrosion now has internal rate limits on gossip traffic, and our hosts have external limits, in the network stack and OS scheduler, to stop runaway processes; this is being rolled out now.
Second, the back half of this outage was due to a pathological condition we hit because we lacked a rate limit in an expensive API operation we didn’t expect users to drive in a tight loop. Obviously, that’s a bug, one we’ve fixed. But there was a process issue here as well: we identified the “pathological” app (it wasn’t doing anything malicious, and in fact was trying to do something we built the platform to do! it was just using the wrong API call to do it), but then engaged in heroics to try to scale up to meet the demand. Without backpressure, this doesn’t do anything.
So we’re also building out a process runbook for handling/quarantining applications that have spun out, one that incident response teams don’t have to think hard about when in the middle of high-priority incidents.
Update: Nov 16, 2024
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
November 12: Network Disruption in Querétaro (14:00EST): We experienced a complete network outage (couldn’t reach anything in our DC), which cleared up inside of 3 minutes, followed by almost 4 hours of sporadic severe routing disruption (latency, packet drops, occasional loss of connectivity), sufficient to disrupt orchestration in this region. The problem was traced to an upstream of our upstream in Mexico.
This Week In Engineering
Despite a quiet week incident-wise, the whole team was unusually interrupt-driven this week; a consequence of catching a bunch of stuff before it could actually become an incident.
Update: Nov 9, 2024
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
November 5: Customer Metrics Ingestion Disrupted (13:30EST): We operate a large VictoriaMetrics cluster that collects Prometheus-style metrics from Fly Machines (this is a built-in feature for all apps running on Fly.io; we’re generating metrics even if you don’t define any, but you can add a [metrics] configuration to your app to add additional ones. We attempted an upgrade of our cluster, which only partially succeeded, leaving the cluster in a state where queries and raw storage were functioning, but ingestion of metrics (vminsert) was compromised and generating a large and growing backlog of new entries. Several hours later, we determined that the ingestion app instances had an aggressive GOGC setting configured; restoring it to the default unjammed the backlog, and upgrading the ingestion service machine types and redeploying them cleared the backlog. Ingestion began functioning at 16:20EST and was restored around 17:00EST.
This Week In Engineering
Somtochi is working on bringing up regional Corrosion clusters. Recall that Corrosion is our state-sync service; think of it as a replacement for Consul, driven by SWIM gossip rather than Raft consensus, and with a SQLite interface (it essentially gossip-syncs a big SQLite database of everything running). In the wake of the Anycast outage from a few months back, we’ve been working on splitting Corrosion into a much smaller global cluster than we currently run (that is: gossiping less state into the global cluster) and then supporting it with regional clusters. A good first approximation of what we’re talking about: the global cluster knows every Fly App running in every region on the fleet, but the regional clusters know the specific Fly Machines for those Fly Apps running on the worker physicals in their region. Anyways, that’s what Somtochi is working on; this week, that mostly involved teaching our Corrosion-backed internal DNS service how to fetch informatin for machines in another regional cluster.
JP and Jerome diagnosed and fixed a gnarly volume migration bug that temporarily broke Jerome’s Fediverse server. If a Fly Volume is extended while that volume is in the process of being migrated (meaning that behind the scenes, dm-clone is still “hydrating” the volume over a temporary iSCSI connection from the origin worker physical), the underlying volume operation could apply to the wrong block device (the temporary clone device, not the final device). This was a missing step in the flyd FSM for restarting Fly Machines, now fixed.
Will upgraded VictoriaMetrics. The one incident we had last week was from an aborted partial attempt to upgrade Vicki. Well, we succeeded this week. In the process of investigating that outage and completing the upgrade this week, Will spotted a perf issue in upstream Vicki that degraded cache performance in Vicki clusters with large numbers of tenants (like we operate), and wrote an upstream PR for it.
Steve was on support rotation. Engineers across the team all do time, a couple days at a time, as technical escalation for our support team. Our support team is great, but being directly exposed to customers is as helpful for product engineering as it is for the support team. Steve and Peter are also hip-deep in working out plans to reboot large numbers of worker physicals, which is a fun problem we’ll be writing about in the weeks to come. Nothing dramatic is going on, we just need a process to reliably schedule reboots and maintenance windows.
Peter spent the week rolling out lazy-loading Corrosion state in fly-proxy. Currently, all the state we hold about every app running on the fleet is kept in-memory in fly-proxy (the component that picks up your HTTP requests from the Internet and relays them to your Fly Machines). As part of the work we’re doing to make Corrosion more resilient (along with regional clusters), we’re changing this, so that we load state for apps only when they’re actually requested. By way of example: the author of this log had one bourbon too many back in 2022 and booted up “paulgra-ham.com” on Fly.io, which is an app that has never once been requested since. Ever since that moment, fly-proxy has assiduously kept abreast of the current state of “paulgra-ham.com”, every minute of every day on every edge and worker in the fleet. This is dumb, and makes fly-proxy brittle. So we’re not doing it anymore.
Update: Nov 2, 2024
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
October 31: Degraded Time-To-Recover In Corrosion (13:00EST): Corrosion is our state propagation system; it’s a SQLite database synchronized across our fleet by SWIM gossip messages. When a Fly Machine starts or stops, its status is delivered via Corrosion to our Anycast network so that our proxies can route to it. For about an hour, our telemetry detected that an IAD worker in our fleet was lagging; further investigation showed that it was jammed up in SQLite, and that CPU utilization on the host itself was high. It turned out this host was being used as the primary “bridge” for Corrosion clients (many of our fleet components use Corrosion “directly” by querying the underlying SQLite database, but some use remote clients to avoid synchronizing their own database), including our internal administration application. Two problems ensued: first, we weren’t effectively spreading the load of Corrosion bridge clients, and second, the admin app had some very dumb SQL queries. During the incident, a very small number of applications (we’re already talking to you if you were impacted) running on a single physical in our fleet would have seen some lag between Fly Machine start/stop and request routing.
This Week In Engineering
We apologize for the delay this week. We’re a US company and the US was eventful! Also, there wasn’t much incident stuff to write about. We pledge to be more timely in the weeks to come.
Somtochi is back to doing surgery on Corrosion. It now exposes a lighter-weight update interface that streams the primary keys of updated nodes over an HTTP connection, rather than repeatedly applying queries. Corrosion also favors newer updates over older ones during sync, which speeds time-to-recovery when bringing nodes online and dealing with large volumes of updates.
Akshit has been working on static egress IPs. Some of our customers run Fly Machines that interface with remote APIs, and some of those APIs have IP filters. Normally, Fly Machines aren’t promised any particular egress IP address for outgoing connections, but we’re rolling out a feature that assigns a routable static egress IP. Akshit also wrote a runbook for diagnosing issues with egress IPs and its integration with nftables and our internal routing system.
Dusty built a custom iPXE installation process for bringing up new hardware on the fleet. Our hardware providers rack and plug in our servers, and PXE pulls a custom initrd and kernel down from our own infrastructure, eliminating an old process where we effectively had to uninstall an operating system configuration our hardware was shipped to us with, making it faster to roll out new hardware, and hardening our installation process.
In response to capacity issues in some regions (particularly in Europe), Kaz rolled out default per-organization capacity limits. These kinds of circuit-breaker limits are par for the course in public clouds, but we’re relatively new and had been getting away with not having them. We’re happy to let you scale up pretty much arbitrarily! But it’s to everyone’s benefit if we default to some kind of cap, because our DX makes it really easy to scale to lots of Fly Machines without thinking about it. Most capacity issues we’ve had over the last year have taken the form of “someone decided to spontaneously create 10,000 Fly Machines in one very specific region”, so this should be an impactful change.
We run a relatively large (for our age) hardware fleet, and we generate a lot of logs. We have a relatively large (for our age) logging cluster that absorbs all those logs. Well, now we absorb 80% less log traffic, because Tom spent a week using Vector (or, rather, holding it better) to parse and drop dumb and duplicative stuff.
Update: Oct 26, 2024
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
October 22: Depot Builders Disrupted (10:30EST): About a month ago, we began defaulting to Docker build servers running on our infrastructure but managed by Depot; running good, efficient Docker container builders is Depot’s whole thing and we’re happy to have them do the lifting. Anyways, they deployed a code change that broke the way they handled auth tokens, and, in turn, our default builders, for about 5 minutes. We have fallbacks for Depot (we still have our own builder infrastructure), but this outage didn’t last long enough to warrant changes.
October 22: Sustained Orchestration Outage (14:00EST): A cascading failure beginning with a certificate expiration disrupted our orchestration system for over 6 hours, including a 1-hour acute period that broke new deploys of existing applications. A full postmortem follows this update.
October 24: Unusual Load In Five Regions (12:00EST): The phased rollout of our new shared CPU scheduling system hit a snag when system commands on 5 physical servers began taking multiple seconds to respond, lagging long enough to generate alerts. These alerts were internal, and we don’t believe they impacted customer workloads; the scheduling change was rolled back 30 minutes into the incident, which resolved it.
October 22 Orchestration Outage Postmortem
Narrative
At 14:00 EST on October 22, we experienced a fleetwide severe orchestration outage. What this means is that for the duration of the incident, both deployments of new applications and changes to existing applications were disrupted; during the most acute stage of the outage, lasting roughly an hour and 40 minutes, that disruption was almost total (Fly Machines could not be updated), and for roughly another 2 hours new application deployments did not function (but changes to existing applications did). Service was restored completely at 21:15 EST.
This outage proceeded through several phases. The earliest acute phase was the worst of it, and subsequent phases restored various functions of the platform, so that towards the end of the outage it was largely functional for most customers. At the same time, up until the end of of the outage, Fly.io’s orchestration layer was significantly disrupted. That makes this the longest significant outage we’ve recorded, not just on this infra-log but in the history of the company.
We’re going to explain the outage, provide a timeline of what happened, and then get into some of what we’re doing to keep anything like it from happening again.
Orchestration is the process by which software from our customers gets translated to virtual machines running on Fly.io’s hardware. When you deploy an app on Fly.io, or when your Github Actions CI kicks off an update after you merge to your main branch, you communicate with our orchestration APIs to package your code as a container, ship it to our container registry, and arrange to have that container unpacked as a virtual machine on one or more of our worker servers.
You can broadly split our orchestration system into three pieces:
flyd, our distributed process supervisor; flyd understands how to download a container, transform it into a block device with a Linux filesystem, boot up a hypervisor on that block device, connect it to the network, and keep track of the state of that hypervisor,
the state sharing system; an individual flyd instance knows only about the Fly Machines running on its own host, by design, and it publishes events to a logically separate state sharing system so that other parts of our platform (most notably our Anycast routers and our API) know what’s running where, and
our APIs, which allow customers to create, start, and stop new Fly Machines; our APIs are comprised of the Fly Machines API, which interacts directly with flyd to start and stop machines, and our GraphQL API, which is used to deploy new applications and manage existing applications.
The outage we experienced broke (2), our state sharing system, but had ripple effects that disrupted (1) and (3).
The outage was a cascading failure with a simple cause: a long-lived CA certificate for a deprecated state-sharing orchestration component expired. Our state sharing system is made up of two major parts:
consul, or “State-Sharing Classic”, manages the configuration of our server components, registers available services on Fly Apps, and manages health checks for individual Fly Machines. consul is a Raft cluster of database servers that take updates from “agent” processes running on all our physical servers. consul used to be the heart of all our state-sharing, but was superseded 18 months ago, by
corrosion, or “New State-Sharing”, tracks the state of every Fly Machine, and every available service, and service health. corrosion is a SWIM-gossip cluster that replicates a SQLite database across our fleet.
We began replacing consul with corrosion because of scaling issues as our fleet grew. It’s the nature of our service that every physical server needs, at least in theory, information about every app deployment, in order to route requests; this is what enables an edge in Sydney to handle requests for an app that’s only deployed in Frankfurt. consul can be deployed with regional Raft clusters, but not in a way that shares information automatically between those clusters. Since 2020, we’ve instead operated it in a single flat global cluster. Rather than do a lot of fussy in-house consul-specific engineering to make regional clusters work, we built our own state sharing system, wrapped around the dynamics of our orchestrator. This project is mostly complete.
What we have not completed is a complete severance of consul from the flyd component of our orchestrator. flyd still updates consul when Fly Machine events (like a start, stop, or create) occur. Those consul updates are slow, because consul doesn’t want to scale the way we’re holding it. But that doesn’t normally matter, because our “live” state-sharing updates come from corrosion, which normally has p95 update times around 1000ms. Still, some of these consul operations do need to complete, especially for Fly Machine creates.
consul runs over mTLS secure connections; that means everything that talks to it needs a CA certificate (to validate the consul server certificate) and a client certificate (to prove that it’s authorized to talk to consul).
At around 14:00EST on the day of the outage, consul’s CA certificate expired. Every component of our fleet which had any dependence on consul stopped working immediately. Fly Machine creates (but not starts and stops) depend on consul, as does some of our telemetry and internal fleet deployment capability. flyctl deploy stopped working.
To resolve this problem, we need to re-key the entire fleet; a new CA certificate, new server certificates, and new client certificates. Complicating matters: our internal deployment system (fcm/fsh) relies on consul to track available physical servers. All told, it takes us about 45 minutes to restore enough connectivity to deploys, and another 45 minutes to completely rekey the fleet.
At this point, basic Fly Machines API operations are completing. But there’s another problem: vault, our old secret storage system, is still used for managing disk encryption secrets, and by our API. The fleet rekeying has broken connectivity to vault. Complicating matters further, vault has extroardinarily high uptime, and so when its configuration is updated and the service is bounced, it doesn’t come back cleanly. For about 90 minutes, a team of infra engineers works to diagnose the problem, which turns out to be a different set of certificates that have expired; we’re able to perform X.509 surgery to restore them without rekeying another cluster.
The biggest problem in the outage (in terms of difficulty, if not raw impact) now emerges. During the window in which consul was completely offline, flyd has been queueing state updates and retrying them on an exponential backoff timer. These updates can’t complete until consul is back online, but all of them are events that corrosion consumes. They pile up, rapidly and dramatically.
By the time consul is restored, corrosion is driving 150gB/s of traffic, saturing switch links with our upstream. The data it’s trying to ship is mostly worthless, but it doesn’t know that. It’s a distributed system based on gossip, so it’s not simple to filter out the garbage.
For 6.5 hours, through the acute phase (in which deploys of existing apps aren’t functioning) and subacute phase (in which deploys of new apps aren’t functioning reliably), this will be the major problem we contend with. We need corrosion in order to inform our Anycast proxies of which Fly Machines are available to route traffic to. During the subacute phase of the outage, routing to existing Fly Machines continues functions, but changes to Machines take forever to propagate: at the beginning of the subacute phase, as much as 30 minutes; by the end, P99 latencies of several minutes — still far too slow for real-time Anycast routing.
Ultimately, the decision is made to restore the corrosion cluster from a snapshot (we made snapshots daily), and fill in the gaps (“reseeding” the cluster) from source-of-truth data. This process begins at 18:00 EST and completes by 18:30 EST, at which time P99 latencies for corrosion are back under 2000ms.
At this point, orchestration is almost fully functional. We have one remaining problem: deploys for apps that involve creating new volumes (which includes most new apps) fail, because our GraphQL API server needs to talk to consul to complete them (and only them), and it’s disconnected due to the rekeying.
Now that corrosion is stabilized, we’re able to safely redeploy the API server. The deployment hits a snag, which results in HTTP 500 errors from the API for about 20 minutes, at which point we’ve successfully redeployed, restoring the API.
Minutes later, with no known disruption or instability in the platform, the incident is put into “Monitoring” mode.
Incident Timeline
2024-10-22 18:00 UTC: The CA certificate for our consul cluster expires, breaking much of our customer-facing APIs and also our internal deployment system. The status page is updated to report widespread API failures.
2024-10-22 18:04 UTC: The infra team has root-caused the outage and made the decision to re-key the consul cluster.
2024-10-22 18:10 UTC: Our support and infra teams observe that existing applications are running, but that applications with auto-stop/start Fly Machines will be impacted.
2024-10-22 18:20 UTC: The re-keying operation has been implemented, automated, and tested against a single server.
2024-10-22 18:35 UTC: The consul server cluster is fully re-keyed, and our deployment server is re-keyed, so we can begin to restore internal deployment capability and re-key the entire worker fleet.
2024-10-22 18:45 UTC: An upstream notifies us that we’re generating so much traffic it’s impacting top-of-rack switches; work begins on resolving the corrosion issue (it will continue for several hours). Meanwhile: internal deployment is restored.
2024-10-22 19:15 UTC: corrosion has been stopped and restarted, but the volume of updates hasn’t been mitigated. The infra and platform teams diagnose the problem: the consul outage has caused a giant backlog of spurious flyd retried state updates.
2024-10-22 19:20 UTC: The whole fleet has been re-keyed for consul. Fly Machine start/stops are now functioning, though state updates are delayed, at this point by as much as 30 minutes. The acute phase of the outage is over; the subacute phase has begun. The status page is updated to report the partial fix, and the state update delays.
2024-10-22 19:30 UTC: vault alarms are going off; the servers, which link to consul, have lost connectivity due to the re-key. Fly Machine operations that require vault fail; for most Fly Machines, this information is cached, but starts of long-quiescent Fly Machines will fail if they have volumes attached.
2024-10-22 19:35 UTC: We quickly reconfigure vault with new consul certificates, but vault restarts into a nonfunctional state. Work diagnosing vault begins.
2024-10-22 20:35 UTC: The infra team has discovered the root cause of the failure; a vault-specific certificate has expired, which went undetected owing to the extremely high uptime of the service. The prospect of re-keying the whole vault cluster is discussed.
2024-10-22 20:45 UTC: The infra team rebuilds a valid certificate around the existing key, restoring the vault cluster.
2024-10-22 20:55 UTC: vault is functioning fleetwide again.
2024-10-22 21:10 UTC: Our GraphQL API server is generating alerts from an excessively high number of errors. The problem is immediately diagnosed: we’ve re-keyed the physical fleet, but the API server also relies on consul.
2024-10-22 21:50 UTC: Support informs the incident team that customer perception is that the outage has largely resolved, complicating the decision to restore the corrosion cluster; state update “lag times” might be tolerable.
2024-10-22 22:30 UTC: We’re seeing P50 corrosion lag times in the single-digit seconds but P99 lag times of around 3 minutes.
2024-10-22 23:00 UTC: The decision is made to restore corrosion and reseed it.
2024-10-22 23:30 UTC: The corrosion cluster is restored and re-seeded. P99 lag times are now under 2000ms.
2024-10-22 24:30 UTC: Now that corrosion has been restored, our GraphQL API server can be re-deployed. A branch has been prepped and tested during the outage; it deploys now.
2024-10-22 24:31 UTC: API requests are generating HTTP 500 errors.
2024-10-22 24:40 UTC: Though the API server has been deployed with the new consul and vault keys, a corner-case issue is preventing them from being used.
2024-10-22 24:40 UTC: The GraphQL server is re-deployed again; the API is restored. The subacute phase of the outage has ended. The status board status for this incident is set to “Monitoring”.
Forward-Looking Statements
The simplest thing to observe here is that it shouldn’t have been possible for us to approach the expiration time of a load-bearing internal certificate without warnings and escalations. Ordinarily we think about situations like this in terms of proximal causes (“we were missing a critical piece of alerting”) and some root cause; here, it’s more useful to look at two distinct root causes.
The first cause of this incident was that our infra team was overscheduled on project work. For the past year, we’ve pursued a blended ops/development strategy, with increasing responsibilty inside the infra team for platform component development. If you follow the infra log, you’ve seen a lot of that work, especially with the Fly Machine volume migration work, which was largely completed by infra team members. We have developed a tendency to think about reliability work in terms of big projects with reliability payoffs. That makes sense, but needs to be balanced with nuts-and-bolts ops work. The “fix” for this problem will be denominated in big-ticket infrastructure projects we defer into next year to make room for old-fashioned systems and network ops work.
The second cause of this incident is a system architecture decision.
We shipped the first iteration of the Fly.io platform on a single global Consul cluster. That made sense at the time. It took years for us to scale to a point where Consul became problematic. When we approached that point, we had a decision to make:
we could do the engineering work to break out single Consul cluster into multiple regional clusters, and then replicate state between them, retaining Consul’s role in our architecture but allowing it to scale, or
for a similar amount of engineering effort, we could replace Consul with an in-house state-sharing system that was designed for our workloads.
The latter decision was sensible: we could make Consul scale, but making it fast enough for real-time routing to Fly Machines that start in under 200ms was challenging; a new, gossip-based system, taking advantage of architectural features that eliminated the need for distributed consensus, would make it much easier to address that challenge.
Unfortunately, we chose a half-measure. We replaced Consul with corrosion, but we retained Consul dependencies inside of our orchestration system, using Consul as a kind of backstop for corrosion, and keeping data in Consul relatively fresh so that old components could continue using it. Consul inevitably became a dusty old corner in our architecture, and so nobody was up nights worrying about managing it. Thus, our longest-ever outage.
The moral of the story is, no more half-measures. Work has already begun to completely sever Consul from our orchestration (we’ll still be using it for what it’s good at, which is managing configuration information across our fleet; Consul is great, it’s just not meant to be run, in its default configuration, for a half-million applications globally).
Finally, you may notice from the timeline that it took an odd amount of time to pull the trigger on restoring and reseeding the corrosion cluster, especially since once we did so, the process was completed in just 30 minutes. Restoring corrosion was straightforward because we have a tested runbook for doing so. But that runbook doesn’t have higher-level process information about when to restore corrosion, and what service impact to expect when doing so. If we’d had that information ready, we could have decided to perform the restore much earlier, shaving potentially 4 hours off the disruption.
,
Update: Oct 19, 2024
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
October 17: Flycast Internal Network Outage (11:00EST): A change was deployed in our fly-proxy Anycast request router, which had the effect of breaking Flycast (internal networking at Fly.io that runs through the proxy, as opposed to direct networking, which we call 6PN). The change was reverted fleetwide within about 10 minutes. There’s not much interesting to say about the breaking change itself (it was tested for a week in staging and in a remote region prior to rollout; it was rolled back quickly), but more to say about internal alerting; we had internal alerts firing on the staging change, but they were inadequately escalated.
October 19: Networking Outage in Denver (20:00EST): We lost Denver for about 16 hours. Well, “we” didn’t lose it. Our upstream network provider did. Specifically: a large switch in their data center threw a rod, and the spare equipment they had in the data center turned out not to be adequate to resolving the outage. Our hardware was fine, just sitting there wondering where the Internet went. Our physical footprint in Denver is small (8 physicals, give or take); this was a broader outage that didn’t just hit us. Still: not OK. It is the case that we have large regions with heavily diversified connectivity and a major hardware footprint, and smaller regions (generally speaking: if neither GCP, OCI, nor AWS are in a region, it’s probably a small region) with potentially longer disaster recovery times, and we’re clearly not communicating this well. More to come.
This Week In Infra Engineering
Stuff got done, but to generate these updates, the author of the infra-log needs to go interview infra people 1:1, and infra is heads-down responding to the incident from this week to foreclose on something like it happening again; some of that work is the same as the work we’re doing responding to the August Anycast routing incident (also a state explosion problem, also addressible by regionalizing state propagation and distributing aggregates globally instead of fine-grained updates), some of it isn’t; we’ll write more about it next week.
Update: Oct 12, 2024
A note on incidents: incidents are internal events for our infrastructure team. Incidents often correspond to degraded service on our platform, but not always. This log aims for 100% fidelity to internal incidents, and is a superset both of our status page events and of customer-impacting events on the platform. It includes events reported to subsets of customers on their personal status pages, as well as events without any status page impact.
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
October 9: Networking Issues in SEA (16:00EST): We detected a BGP4 upstream misconfiguration, routing some of our traffic from SEA towards LON. In the course of investigating this issue, which added significant latency to some connections out of SEA, our regional upstream uncovered a problem at SIX, the Seattle area exchange point. While working around that issue, our upstream attempted to reroute traffic to a different peering provider at SIX, which had the effect of breaking all Seattle routing for a period of about 10 minutes, which was the point at which we called the incident (though our reporting here starts it several hours earlier).
October 11: Capacity Issues in ARN and AMS (06:00EST): A customer spun up thousands of Fly Machines across Northern Europe, saturating our capacity in two regions. Some of this utilization was apparently from a mistake in their own scheduling; much of it was genuine. For several hours, we had severely limited (and inconsistent) ability to create new Fly Machines in ARN and AMS. We added additional hardware to ARN and, with some guidance from the customer, collected resources from stale instances of their applications; completely resolving this incident, and bringing capacity in these two regions to within our comfort level, took the better part of the day.
This Week In Infra Engineering
Let’s make up for some lost time.
Peter has deployed the first stage of the “sharding” of fly-proxy, the engine of our Anycast request routing system. Recall from our September 1 Anycast outage that one major identified problem was that we run a global, flat, unsegmented topology of proxies; as a result, a control-plane outage is as likely to disrupt the entire fleet as it is to disrupt a single proxy. We’re pursuing two strategies to address that: regional segmentation, which limits the propagation of control-plane updates (in potentially somewhat the same fashion as an OSPF area does) and sharding of instances. Sharding here means that, within a single region and on a single edge physical server, we run multiple instances of the proxy.
The first stage of making that happen is to add a layer of indirection between the kernel network stack and our proxy; that layer, the fly-proxy-acceptor, picks up incoming TCP connections from the kernel, and then routes them to particular instances of the “full” proxy using a Unix domain socket and file descriptor passing. This allows us to add and remove proxy instances without reconfiguring or contending for the same network ports. In the early stages of deployment, both the proxy-acceptor and the proxy itself listen for TCP connections (meaning the acceptor can blow up, and we’ll continue to handle connections, though nothing has blown up yet).
Unix file descriptor passing is textbook Unix systems programming, literally, you can find it in the W. Richard Stevens books, but it’s surprisingly tricky to get right; for instance, connect and accept completion are separate events, and we have to be fastidious about which instances we route file descriptors to (the bug where you let two different proxies see the same request file descriptor is very problematic).
Peter, Dov, and Pavel have been in a protracted disagreement with systemd. From a few weeks back: Dov added systemd watchdog support to fly-proxy. Recall that the diagnosis of the September 1 outage involved us noticing that the entire proxy event loop had locked up (it was a mutex deadlock, that’s what happens in a deadlock). It shouldn’t have been possible for the proxy to lock up without us noticing, and now it can’t.
Anyways, Dov read the systemd source code as it relates to watchdogs to make sure that when the proxy entered a shutting down state, the watchdog would be disabled. Things seemed fine, but then alerts began firing every time we did a deploy; the watchdog was tripping while the proxy was doing its orderly shutdown. Peter discovered a bug in systemd: it assumes that signal handling and watchdog logic share a thread.
In our case they don’t, which created a race condition that triggered watchdogs right after the systemd unit went into stopping state, which caused systemd to re-enable the watchdog. We stopped preempting the watchdog task and let it run until proxy’s bitter end.
There was more. In some cases it can take greater than 10 seconds (our watchdog length) for the fly-proxy to exit, after our tokio::main is complete. Boom, watchdog kill. “Ok, fine, you got us!” we said to systemd, and simply disabled the watchdog at runtime when the watchdog task was preempted. This, finally, worked, and proxies would no longer get watchdog killed when shutting down.
Except that sometimes they did? Turns out that our few older-distro hosts (remember: we have up-to-date kernels everywhere, but not up-to-date distros; systemd is the one big problem with that) use a pretty old version of systemd. That systemd does not support disabling the watchdog at runtime. Peter landed what we hope is the final blow this week; instead of disabling the watchdog at runtime, he set it to a very large non-zero value. You may read further adventures of Peter, Dov, and Pavel in their battles with systemd next week.
Speaking of distro updates, Steve continues our steady march towards getting our whole fleet on a recent distro. He’s picked up where Ben left off a few weeks ago, testing and re-testing and re-re-testing our provisioning to ensure that swapping distros out from under our running workloads doesn’t confuse our orchestration; we now have something approaching unit/integration testing for our OS provisioning process.
Tom spent the week spiking alternative log infrastructure to replace ElasticSearch, with which we are now at our wit’s end. We’re generally pretty reliable at log ingestion with ES, but experience sporadic ES outages with log retrieval. What we’ve come to learn as a business is that our customers are less sanguine about log disruption than we are; what sometimes feels to us like secondary infrastructure reads as core platform health to them. That being the case, we can’t keep limping with the ES architecture we booted up in 2021.
Finally: a couple weeks ago, Daniel had an on-call shift, and was, like everyone working an on-call shift here, triggered by alerts about storage capacity issues; everybody on-call sees at least a couple of these. You check to make sure the host isn’t actually running out of space, clear the alert, and go back to sleep. Unless you’re Daniel.
Daniel has had it in for the way we track available volume storage since back when he shipped GPU support for Fly Machines. There are two big problems with the way we’ve been doing this: the first is, going back to 2021 when we first shipped volumes, the system of record for available storage has been the RDS database backing our GraphQL API; that’s a design that predates flyd and our move away from centralized resource scheduling. The second big problem is that flyd itself has erroneous logic for querying available storage in our LVM configuration (it pulls disk usage from the wrong LVM object, causing it to misreport available space.
The result of this situation is that we’ve been managing available storage, and, worse, storage resource scheduling (deciding which physical server to boot up new Fly Machines on) manually — and, not just manually, but largely in response to alerts, some of which are arriving in the middle of the night.
Daniel fixed the flyd resource calculation and surfaced it to our Fly Machines API service, starting in Sao Paolo, where our API storage tracking went from reporting an average 95% storage utilization across all our physicals to an average 5%. The change has since been rolled out fleetwide, and, in addition to reducing alert load, has drastically improved Fly Machine scheduling. In every region we now have significantly more headroom and, just as importantly, more diversity in our options for deploying new Machines.
Update: Oct 5, 2024
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
October 2: API Errors Creating Machines (14:00EST): A change was deployed in our Rails GraphQL API code that broke a model validation, causing attempts to create or update machines to fail; we detected the problem immediately and resolved it in about 10 minutes.
This Week In Infra Engineering
It’s coming! But the infra log author was late getting to this update and doesn’t want to put all the infra people on the spot, so we’re getting the update about our one incident this week up first.
Update: Sep 28, 2024
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
September 25: Loss of LATAM Connectivity (09:00EST): An upstream routing issue (impacting a whole provider network) took QRO, GIG, EZE, and SCL down for about 20 minutes (recovery took a few more minutes for QRO).
September 27: Connectivity Issues in ORD (01:00EST): A top-of-rack switch misconfiguration at our upstream provider, possibly involving a LACP issue and possibly involving upstream routing, generated high packet loss (but not total loss of connectivity) on a subset of our ORD hosts for roughly 60 minutes.
This Week In Infra Engineering
Peter worked on restructuring the connection handling code in fly-proxy, the engine for our Anycast layer, to support process-based sharding of proxy instances. This is work responding to the September 1 Anycast outage; the proximate cause of that outage was a Rust concurrency bug, which we’ve now audited for, but the root cause was the fact that a single concurrency bug could create a fleetwide outage to begin with. Process-based sharding runs multiple instances of fly-proxy on every edge, spreading customer load across them, not for performance (the single-process fly-proxy is probably marginally more performant) but to reduce the blast radius of any given bug in the proxy.
Kaz is rolling out size-aware Fly Machine limits. Obviously (it may have been more obvious to you than to us), you can’t expose something like the Fly Machines API without some kind of circuit-breaker-style limits on the resources a single user can request). Our current limits are coarse-grained: N concurrent Fly Machines, regardless of size. Clearly, these limits should be expressed in terms of underlying resources — a shared-1x is a tiny fraction of a performance-16x. Getting this working has required us to rationalize and internally document the relationships between these scaling parameters. Most of our users will never notice this (especially if we do it well), but it should make it less likely that you’ll hit a limit and have to ask support to remove it.
Somtochi has continued working on Corrosion, the SWIM-gossip SQLite state-sharing system the proxy uses to route traffic. The net effect of Corrosion is a synchronized SQLite database that is currently available across our whole fleet of edges, workers, and gateway servers. We’re refactoring this architecture to reduce the number of machines that will keep Corrosion replicas, and allowing them to subscribe and track changes to Corrosion databases stored elsewhere (this allows us to deploy more edges, by reducing the compute and storage requirements for those hosts).
Will did a bunch of reliability and ergonomic work on fsh, our internal deployment tool; better DX for fsh means more reliable deploys of new code means fewer incidents. fsh now integrates with PagerDuty to abort deploys automatically if incidents occur during a deployment; it will also fail fast on errors (this is an issue on fleet-scale rollouts, where it can be hard to spot errors across hundreds of servers being updated); it now directly supports staged deploys (something we were hacking together with shell scripts previously) and stepwise concurrency (slow-start style, running a single deploy, then 2 on success, and so on).
Dusty is continuing his capacity planning work by integrating our business intelligence tools with our capacity dashboard, so we can reflect dollar costs and revenue into our technical capacity planning.
Update: Sep 21, 2024
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
September 16: Internal ElasticSearch Outage (16:00EST): Concurrent with an internal deployment of our flyd orchestrator, our internal logs stopped responding to queries for newly ingested logs. Shortly thereafter, we got alerts about ElasticSearch components crashing (we run a large ES cluster with many hosts, the alerts only impacted a few). Diagnosing ES issues is a black art; we were exceeding our field limit (or had been for some time and just noticed); bouncing ES components got things back into a steady state. Minimal known customer impact.
September 18: Depot Builders Failing (09:00EST): Most containers that run on Fly.io are built using “remote builders” — Fly Machines that run Buildkit (“Docker”, essentially) so that user laptops, many of which are ARM64, don’t have to. We’ve offered remote builders since 2021, but recently have begun a partnership with Depot, a team that specializes in doing fast container builds. This morning, Depot builds, which were within the last couple weeks made the default on the platform, stopped working; Depot’s telemetry alerted us to Fly Machines Volumes API errors. Fly’s own “native” remote builders continue to work; we posted a workaround on the status page within a few minutes, and then cut the platform over to native builders with a feature flag shift, which resolved the incident. The issue turned out to have been an API incompatibility (really, a database access bug surfaced by an API change) in our orchestrator code.
September 18: Internal ElasticSearch Outage (13:00EST): Roughly the same thing happened as on September 16th, but this time with the added bonus fun of certificate expiration issues that forced us to re-authenticate the cluster. Restarting ES this time involved a lengthy recovery process that stalled repeatedly and pushed some components close to their Java heap max sizes; in other words, a lot of handholding was involved in fixing the cluster. This was again primarily an internal issue (we take it seriously, because we rely on these logs for our own incident response); it would have messed up some UI features that print log lines “in passing”, but not customer logs in general.
September 19: Capacity Issues In East Coast Regions (12:00EST): A customer running batch jobs allocated a very large number of performance instances across IAD, BOS, and EWR. To this point, our rate and CPU limits have focused largely on shared instances (which are inexpensive and see a lot of abuse usage); this was a totally legit customer that just happened to be demanding a surprising amount of instantaneous compute. For about 30 minutes, compute jobs in these regions saw CPU steal and performance degradation, resolved for the regions by adjusing CPU limits. The medium-term fix, which we’ll talk about in a bit, involves improving workload scheduling to avoid concentrating these kinds of workloads on specific worker hosts.
This Week In Infra Engineering
Will did a deep dive into I/O scheduling in our LHR clusters, after Tigris nudged us about performance/reliability issues in their FoundationDB cluster running in that region. Using metrics, system configuration, and statekeeping data, Will isolated Tigris’s workloads to a concentrated cluster of SSDs with a particular make/model, which we now know to have an iffy performance envelope for the kinds of work we do. The bigger problem wasn’t so much the drives as it was the scheduling we did: because Tigris created their series of volumes for this region in rapid succession, they all got scheduled on a small subset of our storage capacity in the region. Worse, a consequence of our scheduling algorithm was that the periodic snapshot backups of these volumes were all scheduled to fire in tandem, concentrating a large amount of I/O activity on a small number of drives (the graph of what was happening looked like a stable EKG). Scheduling improvements are a theme of the infra work that we’re doing right now, but this issue in particular surfaced an unexpected (and straightforward to fix) issue: we needed to be adding jitter to the timing of our backup jobs.
Ben is hip-deep in a fleetwide upgrade of our worker OS distributions. This is a tricky and annoying problem. Most of what runs on a worker physical for us is software we build and ship ourselves, in some cases several times a week. Beyond that, we have an established runbook for upgrading our OS kernels; we don’t run the distro-version-native kernel version anywhere (we have fussy eBPF dependencies, among other things). But the distro itself, which in particular sticks us with a specific version of systemd, is a gigantic pain to upgrade; we have consistent OS kernels across the fleet, but not consistent distro versions. That’s changing, but it’s a complicated process, involving workload migrations and, in some cases, reprovisioning servers, which surfaces fun bugs like “the semi-random identifiers we create for Fly Machines are influenced by the provenance of the worker physical on which it was created, meaning a reprovisioned host can cause Machine ID collisions”. That bug hasn’t happened, because Ben is auditing the stack for problems like that.
Dusty has been the capacity czar for the past couple months. As you can see from this week’s BOS outage, this is an important issue for us. We’ve integrated our scheduler state with system metrics and used that to create new capacity threshold numbers for the fleet, which now informs our provisioning; a lot of the guesswork has been taken out of where we’re shipping and racking new servers. We’ve shifted some capacity (ORD gave some servers to EWR, for instance), and reallocated a backlog of servers to different regions. We now have a runbook for provisioning new capacity in existing regions that uses a lightweight version of our machine migration (for apps without storage) to rebalance as capacity is added.
JP has been working on improving Fly Machine scheduling across the fleet, which was also implicated in an incident this week. We now have stricter placement logic that ensures multiple Fly Machines for the same app created concurrently are distributed across worker physicals. Our Machines API now handles some of the retry logic that our CLI, flyctl, was using before; unlike flyctl, which is open-source and doesn’t have APIs that can specifically place a Fly Machine on a specific worker physical, our API gateways do have visiblity into the physicals in a region, and we now take advantage of that.
Update: Sep 14, 2024
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
September 9: Scheduler Instability On IAD Hosts (9:00EST): Internal metrics alerts from trace telemetry showed a spike in latency for “add new machine” calls in our flyd orchestrator on a subset of IAD events; normally these calls are very fast, and now they weren’t. We’ve seen that before and correlated it with errors in our APIs, though none were evident; we called an incident, and updated the status page with a warning about potential slowness in the API. After about 80 minutes of debugging, the culprit was identified as a series of internal apps running on Fly Machines that were caught in a particular weird crash loop; deleting those apps resolved the immediate problem.
September 9: Fly-Metrics Log Cluster Outage (21:00EST): The log retrieval interface for a Quickwit log cluster used for fly-metrics (but not flyctl logs, our internal logs, or our log shipper) stopped returning logs; logging were ingested and saved but not returned in queries. After roughly an hour of diagnosis, the culprit was determined to be a broken indexing service; destroying and recreating its Fly Machine resolved the problem.
September 13: Scheduler Outage (13:00EST): For a period of about 10 minutes, a large fraction of our flyd scheduler services in multiple regions were locked in a crash loop. The proximate cause of the outage was us rotating a Vault token used by the service; the root cause was an infrastructure orchestration bug in how we managed those tokens (a configuration management tool got into a state where it held on to and attempted to renew a non-renewable token).
September 13: DNS Failures in Europe (17:30EST): For about 10 minutes, we observed internal DNS failures in European regions. We briefly stopped advertising FRA edges (which resolved the problem, though not to our satisfaction) and then bounced DNS services across edges in the region (which also resolved the problem). We identified some ongoing upstream networking issues and kept some edges un-advertised into the next week. Minimal, uncertain customer impact.
This Week In Infra Engineering
In the interests of getting this infra log update up in a timely fashion and also giving the infra log writer a break, we’re going to talk about this week in infra engineering… next week.
Update: Sep 7, 2024
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
September 1: Fleet-wide Request Routing Outage (15:30EST): A correlated concurrency bug in fly-proxy caused a fleetwide outage in request handling with an acute phase lasting roughly 40 minutes for many customers. A full postmortem for this incident appears at the end of this week’s log.
September 6: Request Routing Disruption in JNB Region (15:30EST): A deployment error regresses the September 1 bug in JNB for about 3 minutes. We spotted this with synthetic alerting and fixed it immediately (ordinarily, we’d disable Anycast routing to the region, but redeploying to fix the root cause was faster.
September 7: Tigris Buckets Created Through Fly.io API Can’t Be Deleted (17:00EST): For a few hours, a name scheme change at Tigris that we missed breaks API compatibility with Fly.io, which stops bucket deletes from working until we fix it. Minimal customer disruption, but we called an incident for it, so it’s documented here.
This Week In Infra Engineering
Simon is deep — perhaps approaching crush depth — on the volume storage problem of making Fly Machines create faster. The most expensive step in the process of bringing up a Fly Machine is preparing its root filesystem. Today, we rely on containerd for this. When a worker brings up a Fly Machine, it makes sure we’ve pulled the most recent version of the app’s container from our container registry into a local containerd, and then “checks out” the container image into a local block device. If we can replace this process with something faster, we can narrow the gap between starting a stopped Fly Machine (already ludicrously fast) and creating one (this can take double-digit seconds today). The general direction we’re exploring is pre-preparing root filesystems and storing them in fast object storage, and serving those images to newly-created Fly Machines over nbd. Essentially, this puts the work we did on inter-host migration to work making the API faster.
JP made all our Machines API servers restart more gracefully, by replacing direct socket creation with
systemd socket activation. Prior to this change, a redeployment of flaps, the Machines API server, would bounce hundreds of flaps instances across our fleet, causing dozens of API calls to fail as service ports briefly stopped listening. That’s ~fine, in that flyctl, our CLI server, knows to retry these calls, but obviously it still sucks. Delegating the task of maintaining the socket to systemd eliminates this problem, and improves our API reliability metrics.
Ben and JP began the process of getting Machine migrations onto
flyd’s v2 FSM implementation. What’s a v2 FSM? We’re glad you asked! Recall: flyd, our scheduler, is essentially a BoltDB event log of finite state machine steps; “start a existing Fly Machine” is an FSM, as is “create a Fly Machine” or “migrate it to another host”. v1 FSMs (which are all of our in-production flyd FSMs) are pretty minimal. v2 FSMs add observability and introspection to current steps, and tree structures of parent-child relationships to chain FSMs into power combos; they also have a callback API for failure recovery. This is all stuff inter-host migration can make good use of; with observable, coordinated, tracked migration as a first-class API on flyd, we can get more aggressive about using Machine migration to rebalance the fleet and to quickly and automatically react to hardware incidents.
We have a couple customers that make gigantic Machine create requests — many thousands at once. To make these kind of transactions more performant, we parallelized them in flyctl. But these parallel requests are evaluated by our scheduler in isolation, which has resulted in suboptimal placement of Machines (the two most common cases being multiple Machines for the same app scheduled unnecessarily on the same worker, and unbalanced regions with some lightly loaded and some heavily loaded workers; “Katamari scheduling”). Kaz and JP made fixes both to flyctl and to our scheduler backend to resolve this; in particular, we now explicitly manage a “placement ID”, tracked statefully across scheduling requests using memdb, that allows users of our APIs to spread workloads across our hardware.
In related news, Dusty has been working on improved capacity planning. The most distinctive thing about the Fly Machines API, as orchestration APIs go, is that we’re explicit about the possibility of Machine creation operations (new reservations of compute capacity) failing; the most obvious reason a Machine create might fail is that it’s directed to a region that’s at capacity. What we have learned over several years of providing this API is that customers are not as thrilled with the computer science elegance of this “check if it fails and try again elsewhere” service model as we are. So we’ve been moving heaven and, uh, Dusty to make sure this condition happens as rarely as possible. Dusty’s big project over the last week: integrating our existing host metrics (the capacity metrics you’d think about by default, like CPU and disk utilization, IOPS, etc) with Corrosion, our state tracking database. Exported host metrics are a high-fidelity view into what our hosts actually see, while Corrosion is a distilled view into what we are trying to schedule on hosts. We’ve now got Corrosion reflected into Grafana, which has enabled Dusty to build out a bunch of new capacity planning dashboards.
Dusty also moved half of our AMS region to new hardware; half the region to go!
Peter worked a support rotation. We schedule product engineers to multi-day tours of duty alongside our support engineers, which means watching incoming support requests and pitching in to help solve problems. Peter reports his most memorable support interaction was doing Postgres surgery for a customer who had blown out their WAL file by enabling archive_mode, which preserves WAL segments, without setting archive_command, giving Postgres no place to send the segments.
Tom continued his top-secret work that we can’t write about, except that to say this week it involved risk-based CPU priorities and Machine CPU utilization tracking.
Now, deep breath:
September 1 Routing Layer Outage Postmortem
(A less formal version of this postmortem was posted on our community site the day after the incident.)
Narrative
At 3:30PM EST on September 1, we experienced a fleetwide near-total request-routing outage. What this means is that for the duration of the incident, which was acute for roughly 40 minutes and chronically recurring for roughly another hour, apps hosted on Fly.io couldn’t receive requests from the Internet. This is a big deal; our most significant outage since the week we started the infra-log (in which we experienced roughly the same WireGuard mesh outage, which also totally disrupted request routing, twice in a single week). We record lots of incidents in this log, but very few of them disable the entire platform. This one did.
We’re going to explain the outage, provide a timeline of what happened, and then get into some of what we’re doing to keep anything like it from happening again.
Request routing is the mechanism by which we accept connections from the Internet, figure out what protocol they’re using, match them to customer applications, find the nearest worker physical running that application, and shuttle the request over to that physical so customer code can handle it. Our request routing layer is broadly comprised of these four components:
Anycast routing, which allows us to publish BGP4 updates to our upstreams in all our regions to attract traffic to the closest region.
fly-proxy, our Anycast request router, in its “edge” configuration. In this configuration, fly-proxy works a lot like an application-layer version of an IP router: connections come in, the proxy consults a routing table, and forwards the request.
That same fly-proxy code in its “backhaul” configuration, which cooperates with the edge proxy to bring up transports (usually HTTP/2) to efficiently relay requests from edges to customer VMs.
Corrosion, our state propagation system. When a Fly Machine associate with a routable app starts or stops on a worker, flyd publishes an update to Corrosion, which is gossiped across our fleet; fly-proxy subscribes to Corrosion and uses the updates to build a routing table in parallel across all the thousands of proxy instances across our fleet.
During the September 1 outage, practically every instance of fly-proxy running across our fleet became nonresponsive.
Generally, platform/infrastructure components at Fly.io are designed to cleanly survive restarts, so that as a last resort during an incident we can attempt to restore service by doing a fleetwide bounce of some particular service. Bouncing fly-proxy is not that big a deal. We did that here, it restarted cleanly, and service was restored. Briefly. The fleet quickly locked back up again.
Our infra team continued applying the defibrillator paddles to fly-proxy while the proxy team diagnosed what was happening.
The critical clue, identified about 50 minutes into the incident, was that proxyctl, our internal CLI for managing the proxy, was hanging on stuck fly-proxy instances. There’s not a lot of mechanism in between proxyctl and the proxy core; if proxyctl isn’t working, fly-proxy is locked, not just slowly processing some database backlog or grinding through events. The team immediately and correctly guessed the proxy was deadlocked.
fly-proxy is written in Rust. If you’re a Rust programmer, the following code pattern may or may not be familiar to you, and you may taste pennies in your mouth seeing it:
// RWLock self.loadiflet(Some(Load::Local(load)))=(&self.load.read().get(...)){// do a bunch of stuff with `load`}else{self.init_for(...);}
An RWLock is a lock that can be taken multiple times concurrently for readers, but only exclusively during any attempt to write. An if let in Rust is an if-statement that succeeds if a pattern matches; here, if
self.load.read().get() returns Some instance, rather than None; this is a Rust error checking idiom. In
the success case, the result is available inside the success arm of the if let as load. The else arm fires if
self.load.read().get() returns None.
The way this if let statement looks, it would appear that the lifetime of the read lock taken in attempting the success case is only the length of the success arm of the if statement, and that the lock is dropped if the else arm triggers. But that is not what happens in Rust. Rather: if let is syntactic sugar for this code:
match&self.load.read().get(){Some(load)=>{/* do a bunch of stuff with `load` */},_=>{self.init_for(...);},}
It is clearer, in this de-sugared code block, that the read() lock taken spans the whole conditional, not just the success arm.
Unfortunately for us, buried a funcall below init_for() is an attempt to take a write lock. Deadlock.
This is a code pattern our team was already aware of, and this code was reviewed by two veteran Rust programmers on the team, before it was deployed, but neither spotted the bug, most likely because the conflicting write lock wasn’t lexically apparent in the PR diff.
The PR that introduced this bug had been deployed to production several days earlier. It introduced “virtual services”, which decouple request routing from Fly Apps. Conventionally-routed services on Fly.io are tied to apps; the fly.toml configuration for these apps “advertise” services connected to the app, which ultimately end up pushed through Corrosion into the proxy’s routing table. Virtual services enable flexible query patterns that match subsets of Fly Machines, by metadata labels, to specific URL paths. We’re generally psyched about virtual services, and they’re important for FKS, the Fly Kubernetes Service.
The deadlock code path occurs when a specific configuration of virtual service is received in a Corrosion update. That corner case had not occurred in our staging testing, or on production for several days after deployment, but on September 1 a customer testing out the feature managed to trigger it. When that happened, a poisonous configuration update was rapidly gossiped across our fleet, deadlocking every fly-proxy that saw it. Bouncing fly-proxy broke it out of the deadlock, but only long enough for it to witness another Corrosion subscription update poisoning its service catalog again. Distributed systems. Concurrency. They’re not as easy as computer science classes tell you they are.
Because we had a strong intuition this was a deadlock bug, and because it’s easy for us to isolate recent deployed changes to fly-proxy, and because this particular if letRWLock bug is actually a known Rust foot-gun, we worked out what was happening reasonably quickly. We disabled the API endpoints that enabled users to create virtual services, and rolled out a proxy code fix, restoring services shortly thereafter.
Complicating the diagnosis of this incident was a corner case we ran into a with sysctl change we had made across the fleet. To improve graceful restarts of the proxy, we had applied tcp_migrate_req, which migrates requests across sockets in the same REUSEPORT group. Under certain circumstances with our code, this created a condition where the “backhaul” proxy stopped receiving incoming connection requests. This condition impacted only a very small fraction (roughly 10 servers total) of our physical fleet, and was easily resolved fleetwide by disabling the sysctl; it did slow down our diagnosis of the “real” problem, however.
Incident Timeline
2024-08-28 19:01 UTC: The problematic version of fly-proxy is deployed across all regions. Nothing happens, because nobody is publishing virtual services.
2024-09-01 19:18 UTC: A poisonous virtual services configuration is added by a customer. The configuration is not itself malicious, but triggers the proxy deadlock. The configuration is propagated within seconds to all fly-proxy instances via Corrosion.
2024-09-01 19:25 UTC: Synthetic alerting triggers a formal incident; an incident commander is assigned and creates an incident channel.
2024-09-01 19:30 UTC: Internal host health check alerts begin firing, indicating broad systemwide failures in hardware and software.
2024-09-01 19:31 UTC: Our status page is updated, reporting a “networking outage”, impacting our dashboard, API, and customer apps.
2024-09-01 19:33 UTC: Two proxy developers have joined the incident response team.
2024-09-01 19:36 UTC: The infra team confirms the host health checks are a false alarm triggered by a forwarding dependency for health alerts on the stuck proxies. fly-proxy is implicated in the outage.
2024-09-01 19:41 UTC: The API, which is erroring out on requests, is determined to be blocked on attempts to manage tokens. Token requests from the API are forwarded as internal services to tkdb, an HTTP service that for security reasons runs on isolated hardware. fly-proxy is further implicated in the outage.
2024-09-01 19:56 UTC: Our infra team begins restarting fly-proxy instances. The restart briefly restores service. All attention is now on fly-proxy. The infra team will continue rolling restarts of the proxies as proxy developers attempt to diagnose the problem. The acute phase of the incident has ended.
2024-09-01 20:10 UTC: Noting that proxyctl is failing in addition to request routing, attention is directed to possible concurrency bugs in recent proxy PRs.
2024-09-01 20:12 UTC: Continued rolling restarts of the proxy have cleared deadlocks across the fleet from the original poisoned update; service is nominally restored, and the infra team continues monitoring and restarting. The status page is updated.
2024-09-01 21:13 UTC: The proxy team spots the if let bug in the virtual services PR.
2024-09-01 21:21 UTC: The proxy team disables the Fly Machines API endpoint that configures virtual services, resolving the immediate incident.
Forward-Looking Statements
The most obvious issue to address here is the pattern of concurrency bug we experienced in the proxy codebase. Rust’s library design is intended to make it difficult to compile code with deadlocks, at least without those deadlocks being fairly obvious. This is a gap in those safety ergonomics, but an easy one to spot. In addition to code review guidelines (it is unlikely that another if let concurrency bug is going to make it through code review again soon), we’ve deployed semgrep across all our repositories; this is a straightforward thing to semgrep for.
The deeper problem is the fragility of fly-proxy. The “original sin” of this design is that it operates in a global, flat, unsegmented topology. This simplifies request routing and makes it easy to build the routing features our customers want, but it also increases the blast radius of certain kinds of bugs, particularly anything driven by state updates. We’re exploring multiple avenues of “sharding” fly-proxy across multiple instances, so that edges run parallel deployments of the proxy. Reducing the impact of an outage from all customers to 1/N customers would have simplified recovery and minimized the disruption caused by this incident, with potentially minimum added complexity.
One issue we ran into during this outage was internal dependencies on request routing; one basic isolation method we’re exploring is sharding off internal services, such as those needed to run alerting and observability.
fly-proxy is software, and all software is buggy. Most fly-proxy bugs don’t create fleetwide disasters; at worst, they panic the proxy, causing it to quickly restart and resume service alongside its redundant peers in a region. This outage was severe both because the proxy misbehavior was correlated, and because the proxy hung rather than panicking. A straightforward next step is to watchdog the proxy, so that our systems notice N-second periods during which the proxy is totally unresponsive. Given the proxy architecture and our experience managing this outage, watchdog restarts could function like a pacemaker, restoring some nominal level of service automatically while humans root-cause the problem. This was our first sustained correlated proxy lock-up, which is why we hadn’t already done that.
This incident sucked a lot. We regret the inconvenience it caused. Expect to read more about improvements to request routing resilience in infra-log updates to come.
Update: Aug 31, 2024
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
August 26: Fly Machines Lease Acquisition Failures (05:30EST): We began observing elevated HTTP 500 errors for the lease endpoint in flyd, our orchestrator kernel, on certain hosts. Leases are used to take exclusive control of a particular Fly Machine in order to update it; leasing is a basic component of doing a deployment. Our initial investigation turned up a particular Fly Application that appeared to be absolutely hammering the flyd lease endpoint. That turned out to be a customer CI job that was (reasonably!) rapidly responding to a lease-timeout deployment failure by re-queueing jobs. After an exhaustive investigation, we determined that the flyd BoltDB on three impacted hosts had found their way to a pathological state similar to the August 15 issue previously in the log. Rebuilding the database on the impacted hosts resolved the problem. Customers on the small number of impacted hosts would have seen sporadic deployment timeouts during the 3-4 hours this investigation took. That’s longer than we’re comfortable with; we’ve added substantial telemetry for this particular problem.
August 27: Background Job Starvation In API Server (10:30EST): In a previous episode of this log we discussed the July 12 incident in which a Redis/Sidekiq interaction locked up all our background job processing, which caused a 5-minute incident in which deploys failed. In response to that issue, we ported our API server to managed Redis. We ran into problems (all us, not them) that delayed background jobs and caused our billing pages not to update for approximately an hour; we rolled back the change. Minimal customer impact.
This Week In Infra Engineering
This was a holiday week during which we experienced a significant reliability incident that has the infra-log busily copyediting an in-progress postmortem, so the infra-log is giving itself (and the team) a break. More of the continuing adventures of Peter and Somtochi in the weeks to come. Thanks!
Update: Aug 24, 2024
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
August 19: Internal Grafana Outage (10:00EST): No customer impact. Our internal instance of Grafana was brought down by a full disk, a consequence of a bad default that stored graph annotations indefinitely on dashboard graphs; we add those annotations programmatically and they crudded up our storage. This was an internal-only resource; customer Grafana wasn’t impacted. If anything, this was a win for customer Grafana, since we’ve now gotten better at gnarly Grafana restoration scenarios.
August 20: Sporadic Authentication Failures With Github SSO (13:00EST): In the process of porting SSO API code from our legacy Rails codebase to Elixir, we added a validation on authentication events that checked for profile names; those names aren’t always present in Github authentication flows. We pulled the validation; Github SSO issues were resolved within an hour.
August 22: Fly Machines API Disruption On Gateways (06:00EST): No customer impact (we think). Alerting notified us that some of the Fly Machines API server (flaps) running on our gateway hosts — the servers we use to terminate WireGuard connections for customers — were stuck in a restart loop. The root cause was a previously deployed configuration change having to do with Vault; gateway servers are extremely stable and change rarely, so the change had only just come up. Normal flyctl API patterns don’t use these servers.
This Week In Infra Engineering
Dusty put Fly.io-authored Ansible provisioning for our hardware upstream providers into motion, getting us close to capping off a project to streamline the provisioning of new physicals. Additionally, he set up IPMI sensor alerting across our fleet, so we can do early detection of hardware issues (our fleet is now of the size where hardware failures, while rare as a fraction of the fleet itself, aren’t totally out of the ordinary); we now have better early alerting for server physical issues, which is important because with the completion of the migration project we’re in a much better position to preemptively migrate workloads.
Somtochi is back into Corrosion, responding to incidents from a couple weeks ago. Changes include queue caps (one incident was caused by an unbounded queue of changes from nodes) and fixing a bug that was causing Corrosion to request way more data than it needs when syncing up a new node with the fleet. She also set up sampled otel telemetry for fly-proxy (fly-proxy telemetry is tricky because of the enormous volume of requests we handle).
Tom did important top-secret work that we are not in a position to share but will one day be very fun to talk about. Read between the lines. Also, the previous week, which Peter monopolized with the petertron4000, Tom did a bunch of Postgres monitoring work, because we’re gearing up to get a lot more serious about managing Postgres.
Peter mitigated a longstanding compatibility issue with us and Cloudflare; either our HTTP/2 implementation (which is just, like, Rust hyperium or something) or theirs is doing something broken; when we have problems we now automatically downgrade to HTTP/1.1 for their source IPs.
Update: Aug 17, 2024
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
August 13: Edge Capacity Saturated in DFW (04:00EST): Synthetic alerts indicated to the on-call team that we were having capacity issues in the DFW region, which may have been due to a routing configuration that effectively had some of our DFW edge capacity offline. We mitigated the problem within minutes, and added additional long-term edge capacity to the region.
August 15: Network Latency Spike In CDG (09:00EST): A customer deployment of a very high-traffic public API saw a spike in requests that overwhelmed their deployment; they at the same time had hard request limits (the threshold past which our load balancer won’t forward traffic to a loaded instance) effectively disabled. The combination of the under-scaled service and over-ridden load balancing set up a small request storm that degraded network performance regionally for about 30 minutes.
August 15: Regional Orchestrator Outage In EWR (09:30EST): The first of three related incidents. Internal monitoring alerted us to slow and timing-out Fly Machines API operations in EWR. Investigation turned up a stuck instance of Corrosion, our internal state sharing system. Most services in our orchestration system are customers of Corrosion, and queries to Corrosion were effectively hanging, which caused cascading failures in the region. Corrosion (in EWR) was stuck with a very full queue of writes to apply and a very large WAL. Bouncing Corrosion and arranging the truncation of the WAL mitigated the problem, which was acute in EWR for about an hour.
August 15: Orchestrator Out-Of-Memory Crashes (14:30EST): The second and most central of three related issues. In flyd, the core of our orchestrator, metrics had begun showing that very old Fly Machines, with long histories recorded in flyd‘s BoltDB store, could see slow queries on account of flyd needing to iterate through their entire history to load their current state. A PR was merged to consistently track the current state, in order to speed those queries up. Over the course of several months, the extra per-Machine metadata this change tracked bloated both flyd’s BoltDB database and the in-memory state flyd tracks (specifically: the process of cloning and merging current state over time caused some keys in that metadata to balloon with duplicated data). On a small number of busy worker servers (roughly 10), this problem became acute and resulted in OOM crashes. A fix was rolled out over the course of several hours (this involved both a flyd code change and some brain surgery to manage the embiggened databases).
August 15: Elixir API Server Deployment Failure (16:30EST): The third of three related incidents. A failed deployment and rollback of ui-ex, our Elixir API server, left that service in an inconsistent state with respect to our load balancer, with old, defunct instances remaining installed in our routing table and newly online instances remaining cordoned off. Two things went wrong here; the deployment failure itself was organic (a developer did something wrong), but it was amplified by orchestrator instability. During the acute phase of this incident, lasting around 30 minutes, API clients and UI users may have seen sporadic HTTP 502 errors. A careful manual redeploy resolved this issue in about 45 minutes. (Full disclosure: this incident merges two tracked incidents over the same time period, both of which were directly handling this issue).
August 15th was spicy; with the exception of the Elixir API issue, the problems were contained regionally or to a small minority of specific hosts, but it was an infra-intensive day.
This Week In Infra Engineering
Several infra people were out on vacation this week, and the rest of the team did interesting work that deserves a showcase, but we’re giving this week in infra engineering over to Peter; everybody else will get their due next week.
One of the features we offer to applications hosted on Fly.io are a dense set of Prometheus performance metrics, along with Grafana dashboards to render them. One of the things our users can monitor with these metrics is TLS handshake latency. Why do we expose TLS latency metrics? We don’t know. It was easy to do.
Anyways, a consequence of exposing TLS latency metrics is that customers can, if they’re watchful, notice spikes in TLS latency. And so someone did, and reported to us anomalously slow TLS handshakes originated from AWS, which set off several weeks of engineering effort to isolate the problem.
Two things complicated the analysis. First, while we could identify slower-than-usual latencies, we couldn’t (yet) isolate extreme latency. Second, packet-level analytics didn’t show any weird intra-handshake packet latency originating “from us”; the TCP 3WH completed quickly, our TLS ServerHello messages rapidly followed after ClientHello, etc.
What we did notice were specific clients that appeared to have large penalties on “their side” inside the TLS handshake; a delay of up to half a second between the TCP ACK for our ServerHello and the ChangeCipherSpec message that would move the handshake forward. Reverse DNS for all these events traced back to AWS. This was interesting evidence of something, but doesn’t dispose of the question of whether there are events causing our own Anycast proxies to lag.
To better catch and distinguish these cases, what we needed was a more fine-grained breakdown of the handshake. A global tcpdump would produce a huge pcap file on each of our edges within within seconds, and even if we could do that, it would be painful to filter out connections with slow TLS handshakes from those files.
Instead, Peter ended up (1) making fly-proxy log all slow handshakes; and (2) writing a tool called petertron4000 (real infra-log heads may remember its progenitor, petertron3000) to temporarily save all packets in a fixed-size ring buffer, ingest the slow-handshake log events from systemd-journald, and (3) write durable logs of the flows corresponding to those slow handshakes.
With petertron4000 in hand, he was able to spot check a random selection of edges. The outcome:
Several latency issues we were able to trace to bad routing upstream.
Recurrances of the issue described above, where clients are lagging, on “their side”, in the middle of the handshake, overwhelmingly sourced from AWS.
Peter was able to reproduce the problem directly on EC2: running 8 concurrent curl requests on a t2.medium reliably triggers it. Accompanying the TLS lag: high CPU usage on the instance. Further corroborating it: we can get the same thing to happen curl-ing Google.
This was an investigation set in motion by an ugly looking graph in the metrics we expose, and not from any “organically observed” performance issues; a customer saw a spike and wanted to know what it was. We’re pretty sure we know now: during the times of those spikes, they were getting requests from overloaded EC2 machines, which slow-played a TLS handshake and made our metrics look gross. :shaking-fist-emoji:.
That said: by dragnetting slow TLS handshakes, we still uncovered a bunch of pessimal routing issues, and we’re on a quest this summer to shake all of those out. The big question we’re working on now: what’s the best way to roll something like the petertron4000 (the petertron4500) so that it can run semipermanently and alert for us? Fun engineering problem! We definitely don’t want to be doing high-volume low-level packet capture work indefinitely on all our edges, right? Stay tuned.
Update: Aug 10, 2024
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
August 7: Partially Broken flyd Deployment (13:30EST): Our Fly Machines development team rolled out a change to flyd, our orchestrator, which altered the way it stored configurations for Fly Machines — from a hand-rolled struct to the on-the-wire version of that struct the Protobuf compiler generates. The change missed a corner case: if it was deployed on a flyd that was in the process of updating metadata for a Machine, the new code could reach an unhandled error path. Additionally, the PR at issue tempted fate by leading off in its description with “We’ve seen enough foot guns with trying to maintain two different representations”. For roughly 20 minutes some, some subset of Fly Machines on a subset of our hosts would have returned API errors on routine Machines API operations; a complete fix was rolled out within an hour.
August 7: Long Response Times From Metrics Cluster (11:00EST): A customer alerted us through a support ticket to intermittant severe slow-downs (on the order of low double-digit seconds) for queries to Fly Metrics. The problem was traced to a recent configuration change on our VictoriaMetrics backend (in one region) that was causing its internal request router component to OOM.
August 8: Storage Capacity Exhausted In Sweden (15:00EST): Requests to place Fly Machines with volumes in ARN began failing due to inadequate disk capacity on worker servers. The physicals in ARN were running in a nonstandard disk configuration; the workers impacted really were out of disk, annoyingly. We mitigated the issue within 40 minutes by migrating workloads to clear space for new Fly Machines, and initiated some long term capacity-planning and data center backup work for newer Fly.io regions. Note that “creating a new volume in a different region” is an operation that is allowed to fail in our API! With the exception of a few core regions, users are expected to be prepared to retry or place workloads in alternate regions.
This Week In Infra Engineering
Akshit began a project to diversify our edge providers and edge routing. Recall that our production infra is broadly divided into edge hosts (that receive traffic from the Internet, terminate TLS, and route it) and worker hosts (that actually host Fly Machine VMs). We have more flexibility on which providers and datacenters we can run edge hosts on, because they don’t require chonky servers. Akshit is working out the processes and network engineering (like per-region-provider IP blocks) required for us to take better advantage of available host and network inventory for edges. Ideally, we’ll wrap this project up with same-region backup routing (via different providers) in our key regions.
Peter spent the week sick. He wants you to know he feels better now.
Steve is working on rehabilitating RAID10 hosts. This is a beast of an issue that has been taunting us since late 2022: we took delivery of a bunch of extremely chonky worker servers that would handle our workloads just fine for a period, and then lock up in unkillable LVM threads. We solved those problems for customers by migrating workloads off those machines, and now Steve is doing the storage subsystem brain surgery required to find out if we can bring them back into service.
Somtochi has moved from Pet Semetary (our Vault replacement, which she got deployed fleetwide) to Corrosion2 (which she drastically improved the performance and resiliency of) to fly-proxy, the engine for our Anycast network. She’s picking up where Peter left off, with backup routing, by extending it to raw TCP connections and not just HTTP (reminder: if it speaks TCP on any port, you can run it here.)
Dusty is working with one of our upstream hardware providers to get us end-to-end control over machine provisioning, rather than having them hand off physicals with BMC connections for us to provision. Faster, tighter physical host provisioning means we can bring up capacity in regions more quickly; we’ve been O.K. there to date, but that leads us to
John is working on “infinite capacity” burst provisioning processes, which is a shorthand for “you can ask us for 1000 16G Fly Machines” (it has happened) “and we will just automatically spin up the underlying hardware capacity needed to satisfy that request”. We’re a ways off on this but expect it to be a theme of our updates (if it pans out). Again, this is primarily of interest to people who expect to have sudden or sporadic needs for large numbers of Fly Machines in very specific places.
Update: Aug 3, 2024
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
July 29: GraphQL API Unavailable (04:00EST): This is a pretty big deal, because deploys depend on the GQL API. The GQL API is a (legacy) Rails app. It depends on a Golang service to talk to our Macaroon validation server; that service is called tksidecar. tksidecar is built from a seperate repo, and our CI system pulls it into the container build for the API. Somehow, we managed to build a tksidecar that was truncated; somehow, the SHA256 checksum for the build didn’t prevent this from getting deployed. The resulting build brought up a GQL API server that HTTP 503’d any useful request. Our API was unavailable for about 90 minutes while we performed build and CI surgery to roll back the change. Numerous process improvements ensued.
July 31: Elevated GraphQL API Errors (19:30EST): Metrics indicated elevated errors from our Machines API, which we traced to a callback to our Rails GraphQL server. Like other Rails apps of the same vintage, our GQL API relies on Sidekiq to process background jobs; those jobs include code that records new deploys in Corrosion, our fleetwide state sharing system. We stopped seeing reliable Corrosion updates (thus causing the Machines API errors). A restart and rescale patched up the problem 20 minutes later; several hours of investigation uncovered that a new set of billing jobs were driving the Fly Machines powering Sidekiq jobs to out-of-memory death.
August 2: GraphQL API Unavailable (13:30EST): Similar outcome as July 29, but shorter duration. This time, we deployed an update with a “benign Postgres migration” (there is no such thing as a benign migration) — all it did was add a single column. Unbeknownst to the deployer, we run recurring business analytics queries against that Postgres database that take upwards of 30 minutes to complete. This is ordinarily not that big of a deal; the Postgres server is beefy and the analytics queries don’t block writes. Unfortunately, the “benign Postgres migration”, like any DDL change, takes an exclusive lock on the database that does manage to conflict with the analytics query. The result: a hanging GraphQL API server. We reverted this change and restored the API server to good function within 15 minutes, and moved these analytics queries to an OLAP database.
This Week In Infra Engineering
Akshit rolled out opt-in granular bandwidth billing. The new bandwidth billing scheme saves most of our customers money (especially if you make good use of our private networks, for instance by running highly utilized Postgres clusters), but, because it can end up costing a bit extra for users that don’t use private networks are are deploying in expensive regions (most notably India), it’s opt-in for existing customers. This work involved working through bugs with our upstream billing partner; Akshit has our sympathies. Akshit was also part of the response to the July 29 GQL outage, which meant they spent a chunk of this week reworking parts of our GQL server CI/CD system.
Steve got cross-connects deployed between Fly.io and Oracle Cloud, in order to accelerate object storage for our partners; objects stored off-network should no longer traverse the public Internet.
Andres and Matt improved synthetic monitoring (we built a new synthetic monitoring system a few weeks back), notably by creating and deploying new reference apps for us to measure. We have improved visibility into behavior we weren’t directly alerting on, like obviously-broken routing (think Asia->Europe->Asia). Synthetics surfaces some fly-proxy bugs, which got fixed. We flirted with making synthetics a customer-visible feature and decided we hadn’t worked out the privacy issues yet.
Ben, Dusty, Steve and John continued migrating workloads from old servers to newly provisioned ones; this involved building out more migration tooling, fixing bugs in migration tooling, and wrestling with particularly persnickity physical servers in Asia. We are asked to relay the following: “Servers go out. Servers come in. We are thus ever trapped in samsara”. Ok then.
Peterrolled out fallback routing fleetwide. He writes it up better than I do, as usual. In addition to metrics-based fallback routing, we now have rule-based routing that takes known backbone topology issues into account. Peter also resolved the LVM2 metadata issues we talked about a few weeks ago, and is deep into debugging (very) sporadic TLS handshake time delays.
Kazupdated and simplified the public status page, which now does a better job of answering the most important question (is the problem my app, or something going wrong at Fly.io?).
Update: Jul 27, 2024
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
July 26: Elevated HTTP 500 Errors From Rails API (03:00EST): Metrics flagged elevated errors from our Rails API server, though synthetics were fine. Whatever it was, it appeared to have been tied to a deploy 15 minutes earlier; it resolved minutes after the incident was called. Indeterminate (probably minimal) customer impact.
This Week In Infra Engineering
The Infra Log author had the flu this (mercifully uneventful) week, so let’s
just do this week and the next together in one block.
Update: Jul 20, 2024
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
July 14: CPU Steal On A Subset of IAD (01:00EST): A customer turned up a large, multi-machine, high performance CPU reservation, which impacted shared CPU Machines on the servers their own Machines ran on. We provisioned additional IAD capacity within about an hour and rebalanced workloads.
July 15: Storage System Failure in ORD (15:00EST): A newly-provisioned physical in ORD had an incompatible RAID10 configuration, which historically causes LVM2 to get stuck. This tripped alerts; the problem was diagnosed within a few minutes and resolved by draining the server, which was later reprovisioned.
July 16: Upstream Network Connectivity Between IAD and ORD (06:00EST): The on-call team raised an incident when our product team couldn’t get a working deploy for our Elixir API backend. After a great deal of time pointlessly debugging whether a kernel update had resulted in incompatible WireGuard implementations (it’s never kernel WireGuard), we tracked the issue down to a routing loop in ORD that involved Cogent (iykyk). Resolved by our upstream providers at around 08:00EST.
July 16: Elevated Registry Errors (17:00EST): Upstream network problems at Cogent persisted throughout the day. Our container registry is a distributed application running in multiple regions with an object storage backend; it began generating persistent 500 errors, which would have corresponded to transient deployment failures for someone running flyctl. The interim solution: kill the registry Fly Machines that were throwing errors; sometimes it’s good to be distributed; we’re not default-free routed, so we have limited control over our transit, but we have application-layer control over the routing of the portions of our stack that are fully distributed.
July 18: Elevated DNS Errors (10:30EST): A team member called an incident after getting
lots of metrics errors for DNS services. These weren’t customer-visible issues; they were internal DNS, and were caused by maintenance by an upstream in Melbourne. No impact, but we document everything here.
July 18: Memory Capacity Limited in SJC (16:00EST): A customer suddenly bursted to several hundred high-memory performance instances in SJC, which request we were able to satisfy, but left us out of capacity for additional deployments in the region. We provisioned additional capacity, they naturally scaled down their workload without us doing anything, and we worked on the long term resource limit policy we’re going to use to address these issues in the future (we can handle almost literally any load customers want to generate, but for very large allocations in some regions we’re going to want some notice).
This Week In Infra Engineering
One thing you’re starting to see now is that Fly Machine migration and host draining is ironed out enough to be a casual solution to problems that would not have been casually resolved a year ago; “bring up new capacity, move the noisy workloads there” is a no-escalation runbook for us now. See the last 10 infra-logs for some of the effort that it took us to get to this point.
Akshit shipped a new egress billing model, which applies only to new customers for now. Under the new scheme, we segregate egress bandwidth by region in invoices, and private networking (between Fly Machines in different regions) is now cheaper. Our product team shipped a new billing system last month, and billing improvements are likely to be a continuing theme of our work.
Andres continued improving our internal synthetics monitoring systems.
Ben fixed several Fly Machine migration bugs: migration RPCs were breaking the configuration for static assets (we can serve static file HTTP directly off our proxies without bouncing HTTP requests off your Fly Machines, if you ask us to); we had a coordination bug in one of the FSMs our orchestrator flyd uses to migrate volumes; and high-availability Postgres cluster migrations were made less tricky (we do these by hand currently, for reasons we’re blogging about this week).
Matt shipped alert-critic, a chat-ops service that monitors our busiest alerting channels and tracks first-responder satisfaction with those alerts, in order to generate reports that spotlight problematic alerts that are either poorly reviewed or that don’t end up needing responses at all.
Peter generated network telemetry data to inform a fleetwide rollout of fly-proxy fallback routing, which routes requests through our overlay network during periods of network instability, at the application layer, automatically. This was deployed in Singapore last week, and is deployed more widely this week.
Tom overhauled our alerting layer for internal server health check alerts; we have hundreds of these, and they currently route directly from our health check system to our on-call system (and thus, ultimately, PagerDuty). We’ve scaled past our alert system’s ability to reliably alert (for very ambitious values of “reliably”). The new alert system routes through Vector, like the rest of our logs and telemetry, and fires alerts from Grafana; both these systems are used for customer workloads and were built to scale, unlike our internal server health check system.
Update: Jul 13, 2024
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
July 8: Capacity Issues In ORD (10:00EST): For roughly an hour, Machine launches in ORD failed for lack of physical server capacity. This was a combination of issues: constrained capacity due to decommissioning older physicals and Machine migration, physical hosts being marked ineligible due to maintenance that completed awhile ago, and also just user growth in the ORD region — which normally wouldn’t cause problems, but did in this case because of the preceding two problems. Fixing the eligibility status resolve the immediate incident, and we’ve provisioned additional capacity in ORD.
July 10: Elixir API Server Down (23:00EST): A failed deploy took down our Elixir API server. Most of our day-to-day APIs are served from our legacy Rails API server, and our Machines API server is served from a fleet of Golang API servers deployed around the world, but we have some internal APIs used by partners that are served from Elixir. This should have had minimal customer impact. A revert fixed the problem within a few minutes.
July 11: Request Routing Disruption in LAX (9:00EST): A failed deploy took Corrosion, our state-management system, down in the LAX region for 5-10 minutes. During that window of time and within the LAX region, request routing and deployment information may have been stale.
July 12: Redis Capacity Issues Disrupted APIs (18:00EST): For legacy reasons, our legacy Rails API, which serves the majority of our user-facing API calls (including our GraphQL API) is backed by a Redis server we manage ourselves on an ad-hoc basis. A change in how we track Sidekiq background jobs caused a spike in the amount of storage we demand from that Redis server, which got us to a place where Redis was erroring for about 5 minutes while we extended the underlying volume. During that window, deployments would have failed.
This Week In Infra Engineering
Will shipped bottomless storage volumes backed by Tigris. This is big! Last fall, Matthew Ingwersen announced log-structured virtual disks that cache blocks while writing them to object storage for durability — the net effect is a “bottomless volume” that is continuously in snapshotted state. The tradeoff for this is, you had to write them to off-network object storage, like S3, which adds an order of magnitude latency to uncached blocks. Tigris is S3-flavored object storage that is both directly attached to Fly.io and also localized to the regions we operate in, which drastically improves performance. It’s early days yet, and this feature is experimental, but we’d like to get this tuned well enough to be a sane default choice for general-purpose storage.
Andres shipped a first cut of a new synthetic monitoring system (“synthetics” is the cool-kid way of saying “actually making requests and seeing if they complete”, as opposed to watching metrics). We had some synthetic monitoring, but now we have substantially more, broken out into regions, particularly for the APIs reachable from flyctl, our CLI.
Akshit and Steve worked on internal bandwidth tracking, in part to support the egress pricing work Akshit talked about a few weeks back. Steve’s work gives us improved visibility for our own internal traffic between all pairs of servers, regions, and data centers.
John worked on our continuing theme of migrating from and decommissioning older hardware, and, in the process, resolved a gnarly problem with LVM2 metadata stores running near capacity. LVM2 is the userland correspondant to devicemapper, the kernel’s block storage framework; if you think of LVM2 and devicemapper together as an implementation of a software RAID controller, you’re not far off. LVM2 virtualizes block storage devices on top of physical devices, and reserves space on each physical to track metadata about which sectors are being used where; if space runs out, all hell breaks loose, and extending metadata space is tricky to do, but is much less tricky now. This is one of these random backend infra engineering problems that make migrations tricky (to balance workloads between servers and migrate off old servers, you sometimes want to migrate jobs to places where there’s LVM2 metadata pressure) which, once solved, makes it much easier for us to migrate jobs without ceremony. Maybe you have to have dealt with LVM2 PV metadata issues for them to be as interesting to you as they are to us. We’ll shut up now.
Dusty is on a top-secret mission to increase the speed of OCI image pulls from containerd. Recall: you deploy, and push a Docker image to our registry. Then, a worker server, running containerd, pulls that image from the registry into its own local storage, and converts it to a block storage device we can boot a VM on. That containerd image pull is the dominant factor in how long it takes to create a Fly Machine, and we’re like create to be asymptotically as fast as start (which is so fast you can start a Fly Machine to handle an incoming HTTP request on the fly).
Peter shipped fallback routing in fly-proxy, and we can’t write it up any better than he did, so go follow that link.
Tom did a bunch of anti-abuse stuff we’re not allowed to talk about. In lieu of a fun writeup of the anti-abuse stuff Tom did, we’ve instead been asked to describe the on-call drama that kept him busy for much of the week:
ElasticSearch randomly exploded when we rolled over an index because of an incompatibility between our log ingestion (which expects JSON logs), Vector (which expects and manipulates JSON logs), and the feature flagging library we use, which does not log JSON.
Our adoption of OverlayFS for containers sharply increases the number of LVM2 volumes we need to track, which puts pressure on LVM2’s metadata storage (see above), which requires us to reprovision physical storage disks with increased metadata storage. This is especially painful because ongoing Machine migration has the side-effect of converting Fly Machines to OverlayFS backing store, and we’re migrating a lot of stuff.
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
July 1: Consul Template Tooling (Internal) (23:00EST): Following a system update earlier in the day, consul-templaterb began exploding on a small number of our edge servers. It’s been over a year since we used Consul to track user application state, so this tooling isn’t in the critical path for user applications; in other words, this incident had no customer impact. It turned out to be an incompatible system Ruby configuration (our software update zapped a Ruby gem consul-templaterb depended on).
July 2: Poor Network Performance for Tigris in IAD (4:00EST):Tigris is our object storage partner and you should definitely check them out. At 4AM on Tuesday, they reported that they were seeing slow downloads from east coast regions, especially IAD. This turned out to be an upstream networking issue, resolved by a transit provider by adjusting routes, roughly 2 hours later.
July 2: Connectivity Loss in IAD (15:00EST): A BGP change at an upstream provider broke connectivity to our IAD data center for several minutes; this was unrelated to the previous incident, but much more severe (and thankfully brief).
July 3: Hardware Failure Breaks Upstash Redis in IAD (12:00EST):Upstash is our Redis parrnet and you should definitely check them out. Upstash runs distributed clusters of Redis servers in each of our regions. A quorum of the Fly Machines running their IAD cluster were scheduled onto a single server, which, months later, noticed and blew up. The server was recovered several hours later, and during the interval the Upstash cluster was rebuilt with a different Fly Machine on a different IAD server. This problem impacted only the IAD region, but IAD is an important region.
July 4: LiteFS Cloud RAFT Cluster Failure in IAD (20:00EST):LiteFS Cloud is a managed LiteFS service we run for our customers. Our internal LiteFS clusters run a RAFT quorum system scheme for leader election and cluster tracking. An open files rlimit configuration bug forced a node in the lfsc-iad-1 cluster to restart, which in turn tickled a bug in dragonboat, the Golang RAFT library the service used, which in turn forced us to rebuild the cluster. This incident had marginal customer impact and maximal Ben Johnson impact.
July 5: Elevated Machine Creation Alerts (1:00EST): Our infra team was alerted about elevated errors from the Fly Machines API. A different internal team had created a Fly Kubernetes cluster with an invalid name. Not a real incident, no customer impact, but we document everything here; that’s the rule.
This Week In Infra Engineering
The 4th of July hit on a Thursday this year, making this an extended holiday weekend for a big chunk of our team.
The big stories this week are mostly the same as last week. We continued deploying and ironing out bugs in Corrosion record compaction, we migrated off a bunch of old physical servers and continued building out migration tooling to make it even easier to drain workloads from arbitrary servers, and we improved incident alerting for customers in our UI and in flyctl.
The most important work that happened in this abbreviated week was all internal process stuff: we roadmapped out the next 12 weeks of infra work for networking, block storage, observability, hardware provisioning, and Corrosion. Lots of new projects hitting, which we’ll be talking about in upcoming posts.
Update: Jun 29, 2024
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
June 25: Authorization Errors With NATS Log Shipping (15:00EST): A customer informed our support team about an “Authorization Error” received when connecting NATS (via our internal API endpoint) to ship logs (this is a feature of the platform, normally used with the Fly Log Shipper, intended to allow users to connect their Fly.io platform logs to an off-network log management platform). As it turned out, we’d just done some work tightening up the token handling in our internal API server, and missed a corner case (users using fully-privileged Fly.io API tokens — don’t do this! — to ship logs). It took about 30 minutes to deploy a fix.
June 27: 502s on Some Edges Due To Corrosion Reseeding (13:00EST): Our monitoring picked up HTTP 502 errors from some of our apps, which we tracked down to stale data in Corrosion, our distributed state tracking system. We’d recently done major maintenance with Corrosion’s database, and it had knocked out Corrosion on a small number of our edges, causing it to miss updates for about 30 minutes. The underlying issue was resolved relatively quickly, but a corner-casey interaction with blue/green deploys stuck several apps (roughly 10) that deployed during the outage in a bad state that we had to reconcile manually over the next hour.
June 28: 502s in Sao Paulo (16:45EST): About 5 apps, including our own Elixir app, saw sharply elevated HTTP 502 errors, which we again traced to stale Corrosion data, possibly from the work done the day previously. We mitigated the issue by resyncing our proxy and Corrosion, which cut errors by 3000% but didn’t eliminate them; we narrowed errors to a particular GRU edge server, and stopped advertising it, which eliminated the problem. We’re still investigating what happened here.
This Week In Infra Engineering
Somtochi rolled out a major change to the way we track distributed state with Corrosion. Because Corrosion is a distributed system (based principally on SWIM gossip) and no distributed system is without sin, we have to carefully watch the amount of data it consumes; updates are relatively easy to propagate, but eliminating space for old, overridden data is difficult; this is the “compaction” problem. Somtochi and Jerome worked out a straightforward scheme for doing compaction, but it required adding an index to a table that had been growing without bound for many months, and would potentially trigger multi-minute startup lags everywhere Corrosion needed to get reinstalled. Instead of doing that, we “re-seeded” Corrosion, taking a known-good dataset from one of our nodes, compacted, and then using it as the basis for new Corrosion databases. This was rolled out on many hundreds of hosts without event, and on a small number of edge servers (which have much slower disks) with some events, which you just read about above.
Akshit worked on improving the metrics we’re using for bandwidth billing, putting us in a position to true up bandwidth accounting by more carefully tracking inter-region (like, Virginia to Frankfurt) traffic, especially for users with app clusters where only some of the apps have public addresses. You’ll hear more about this from us! This is just the infra side of that work.
After Peter wrote a brief postmortem of an incident from last week, Ben Ang worked out a system to more carefully track deployments of internal components, especially when those deployments happen piecemeal as opposed to full-system redeploys. Since the first question you ask when you’re handling an incident is “what changed”, anything that gives us quicker answers also gives us shorter incidents.
Dusty, John, Simon, and Peter all worked on draining old servers, migrating Fly Machines to newer, faster hardware. This is all we’ve been talking about here for the last month or so, and it’s happening at scale now.
Andres got tipped off by an I/O performance complaint on a Mumbai worker and ended up tracking down a small network of crypto miners. The hosting business; how do you not love it? Andres did other stuff this week, too, but this was the only one that was fun to write about.
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
June 17: Southeast Asia Connectivity (15:00EST): We saw high packet loss and interface flapping on
edge servers in Singapore. We stopped announcing SIN Anycast routes, redirecting SIN traffic to other nearby regions, while we investigated the problem, which resolved roughly an hour later. There would have
been minimal customer impact (for the duration of the event, which didn’t impact worker servers, we would have had somewhat worse Anycast routes.)
June 17: Internal App Deployment Failure That Turned Out To Be Nothing (20:30EST): An infra team member made a change to our API server and later deployed it; the deployment took upwards of an hour, and in a “where there’s smoke there’s fire” move called an incident. The incident: a typo in the code they were deploying, along with a bug in our API server that exited with a non-failure status. No customer impact.**
June 18: Volume Capacity in Brazil (10:00EST): The platform began reporting a lack of available space for new volumes in our GRU region. We were not in fact low on available volume space; rather, a change we pushed out to Corrosion, our internal state-sharing system, had a SQL bug that mis-sorted worker servers (on a condition that only occurred in GRU). We had a workaround published within 15 minutes (you could restart your “builder” Machine, the thing we run to build containers for you, and dodge the problem), and a sitewide fix within 90 minutes.
June 21: Midsummer Night’s Billing Outage (3:00EST): For an interval of about 2 hours, an upstream billing provider had an outage, which in turn broke some of our invoice reporting features; notably, if you had been issued a credit that you only tried to redeem during the outage, it would not have shown up (you wouldn’t have lost the credit, but you couldn’t have used it at 3:00EST).
This Week In Infra Engineering
Intra-region host migrations are unblocked again! This is huge for us.
Peter worked with our upstream providers to eliminate pathological AS-path routes impacted by recent APAC undersea cable cuts. This work started with us noticing relatively high packet loss in Asia regions, and resulted in us drastically reducing timeouts in our own telemetry and tooling, and network quality for users. A very big win that we’re looking to compound with better monitoring and tooling. He also figured out a configuration bug that was causing Fly Machines not to use BBR congestion control on private networking traffic, which is now fixed.
Dusty and Matt got all our multi-node Postgres clusters in condition to migrate (recall: multi-node Postgres clusters had been problematic for us, because they were configured to use literal IPv6 addresses for their peer configurations, and migration breaks those addresses, which embed routing information).
In addition to spending 30 working hours getting a single email announcement (about migrations) out to customers, John shipped our 6PN address forwarding tooling, along with Ben, out to the fleet, making it possible to migrate clusters that refer to literal IPv6 addresses. Dusty, Peter, John and Matt began draining hosts, moving the Machines running on them to most stable, modern, resilient systems on better upstreams, and lining us up to decom the much older machines. Ben drained an old server live during our internal Town Hall meeting. It was an emotional moment.
Still a bunch of people out this week! It’s summer (for most of us)!
Update: Jun 15, 2024
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
June 11: WireGuard Connectivity Issues In California and Frankfurt (12:00EST): We spent a few hours debugging roughly ten minutes of widespread but intermittant WireGuard failures from flyctl (if you were impacted, you’d have seen a “failed probing… context deadline exceeded” error). This turned out to be a transient networking problem at an upstream network provider.
June 13: Networking and Deployment Failures in Singapore (10:30EST): We (and our customers) saw elevated packet loss and sporadic errors in Singapore. This too turned out to be a problem with an upstream networking provider, who was in turn having a problem with one of their upstreams (Cogent, solved by disabling Cogent).
Can’t complain too much. There may come a day when we are large enough not to experience transient failures somewhere in the world, but that day is not this day. Two things we’re working aggressively on:
Monitoring systems sufficient to be sure our infra team are the first to detect these things and call incidents, rather than our support team (we’re good at this, but the bar is high).
Eliminating cursed Golang error messages like “context deadline exceeded” and “context cancelled” from our flyctl output; these content-free errors are all essentially bugs we need to fix.
This Week In Infra Engineering
Bunch of people out this week! It’s summer (for most of us)!
Andres shipped a long-overdue feature for flyctl: if you run a flyctl command that involves some physical host on our platform (most commonly: the worker server your Machine is on), we’ll warn you if we’re currently dealing with an issue on that host. We’ve had these notices in the UI for a bit, and Andres recently shipped email alerts for any host drama that impacts your Fly Machines, but we suspect this might be the more important reporting channel, since so many of our users are CLI users.
Ben integrated some work from Saleem on our ProdSec team that, during a Fly Machine migration, makes the original Machine’s 6PN address still appear to work for other members of the same network. Recall: our 6PN private network feature works under the hood by embedding routing information into IPv6 addresses; moving a Machine from one physical worker to another breaks that routing. This is only a problem for a small subset of apps that embed literal IPv6 addresses in their configurations. Saleem’s work applies network address translation during and after migrations; Ben’s work links this capability into Corrosion, our global state sharing system, to keep everyone’s Machine updated.
Peter is working on stalking cluster apps people have deployed that use statically-configured 6PN addresses, and thus need the mitigation Ben is working on. He’s doing that by detecting connections that originate prior to DNS lookups, and tracking them in SQLite databases, using a tool we call petertron3000.
Akshit and Ben did a bunch of work this week updating and improving metrics, for internal vs. edge traffic, FlyCast traffic, gateways, and flyd. Ben also caught and fixed some flyd migration bugs.
Kaz did a bunch of bug fixing and ops work in the background, but this week we’ll call out the stuff he’s been doing with customer comms, in particular this Machine Create success rate metric on our public status page, which is now much more accurate.
Simon did some rocket surgery on flyd to ensure that applications that are migrated with multiple deployed instances are migrated serially rather than concurrently, to eliminate corner cases in distributed applications.
Steve spent some time talking to Oracle about cross connects, because we have users and partners that want especially fast and reliable connectivity to Oracle OCI. So that’ll happen.
Steve also spent a bunch of time this week refactoring parts of fcm, our bespoke, Bourne Shell based physical host provisioning tool, so that it can be run from arbitrary production hosts rather than the specially designated host that it runs from now. I mean, it can’t be, not yet, but we’re… steps closer to that? We don’t know why he did this work. Sometimes people just get nerd sniped. This page is all about transparency, and Steve is this week’s designated Victim Of Transparency.
Will is working with Shaun on our platform team on a volumes project so awesome that we don’t want to spoil it yet. (Similarly: Somtochi is still working on the huge Corrosion project she was working on last week which is also such a big deal you won’t hear about it until it ships or fails).
Update: Jun 8, 2024
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
June 2: Network Outage in Bucharest (05:45EST): Our upstream provider had a network hardware
outage, which took our OTP region offline for about 90 minutes.
June 3: Sporadic TLS Hangs From Github Actions (11:30EST): We spent about an hour diagnosing sporadic connection failures to Fly.io apps specifically from Github Actions. Github Actions run from VMs on Microsoft Azure. Something on the Azure network path causes repeated connections to reuse the same
source port, which we think may have tripped a network flood control countermeasure. This should have
been minimally (if at all) impactful to users, but it ate a bunch of infra time.
June 4: Single Host WireGuard Mesh Disruption (17:45EST): Depending on whether you ask Tom or not, either a bug in a script we use to decommission hosts or a bug in Consul resulted in two nodes in our WireGuard mesh being deleted, the intended host we were decommissioning and an extra-credit host we were not (the bug was unexpected prefix matching on a Consul KV path). This very briefly broke connectivity to the extra-credit host (low single-digit minutes). However, rather than restoring the backup WireGuard configuration we maintain, somebody (Tom) regenerated a WireGuard configuration, giving the victim host a new IPv4 address on our WireGuard mesh. This broke 6PN private networking on the host for about 20 minutes (for a small number of apps, whose operators we contacted).
June 5: Interruption In Machine Creation (11:45EST): A deployment picked up an unexpected change to our init binary, which broke boots for about 15 minutes for physical servers that got the init update.
June 6: Hardware Failure In IAD (22:00EST): A single machine in our old Equinix data center in IAD had an NVMe disk failure. Fly Machines without associated volumes were immediately migrated to our other, newer IAD data center deployment; over the course of several hours, Fly Machines with volumes attached were manually migrated. If you were affected, we’ve reached out directly. We’re in the process of decommissioning these hosts, in part because they have less-resilient disk configurations.
This Week In Infra Engineering
This week’s series of small regional incidents kept the infra team hopping.
Apart from incident response, this week’s work looked a lot like the last week’s. Rather than break it out by person, we’ll just document the themes:
Physical host migration remains the biggest ticket item for the infra team. We’re pushing forward on
decommissioning old Equinix data center deployments and moving them to newer, more resilient, more cost-effective hardware. The big obstacle we’re facing right now remains applications that may
(sometimes surreptitiously) be saving and reusing literal 6PN IPv6 private networking addresses, rather than DNS names. Because 6PN addresses are bound to specific physical hardware, these apps may break when migrated, which isn’t acceptable. We’re doing lots of things, from careful manual migration of apps (like Fly Postgres) where we control the cluster, to alerting and eBPF-based fixes. We knocked out another dozen or two old physicals this week.
Better host status alerting is a big deal for us. We’re going to keep seeing regional and host-local outages, which is just the nature of running a large fleet of physical servers. We’re now doing email alerting for customers on impacted hosts, and have PRs in for displaying alerts whenever a user touches an app impacted by a host alert with flyctl to continue closing that loop.
Corrosion scalability and reliability work continues; Somtochi has some design changes that could further minimize the amount of state we have to share, which we’ll talk about more when they pay off. Corrosion is a super important service in our infrastructure (it’s the basis for our request routing) and reliability improvements since infra took it over have been a big win.
Update: Jun 1, 2024
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
May 31: Billing Issue With Upstash Redis (8:00EST): We’re in the middle of a transition from our old billing system to a new one based on Metronome. Billing is a gnarly problem. On Friday morning, someone called an incident after noticing a bug wherein a small number of Upstash Redis customers might have gotten double billed for something. We refunded them. This was an issue we detected internally, with no customer impact, but technically we called an incident for it and by the rules of this page we have to log it.
May 31: Network Filtering Breaks Flycast (13:00EST): As part of a project we’re running to do automatic authenticated connections between Fly Machines, our ProdSec team rolled out an nftables change. It was tested in our dev region, but had an unexpected interaction with our deployment tooling (something about the order in which tables are dropped and rebuilt). The net effect was that the fleetwide deployment broke FlyCast. Diagnosis and remediation took about 30 minutes.
This Week In Infra Engineering
Short week. Couple people out sick.
Kaz worked on getting Fly Machine creation success rates onto our status page, which you should see soon. The two most important things you can know about the Fly Machines API: “create” and “start” are two different operations (“start” is the fast one; you can pre-“create” a bunch of stopped machines and start them whenever you need them), and “create” can fail; for instance, you can ask for more resources than are available in the region you target. Read more about that here. We (well, Kaz, but we agree with him) want the success rates for this operation to be visible to customers.
Dusty and Simon spent the week heads-down on Postgres cluster migration. Read last week’s bulletin for more on that. We’re getting somewhere, but we’re not done until we can push a button and safely clear all the Machines of a physical server without having to worry too much about it.
Will won his next boss battle with NATS. We’ve successfully upgraded the whole fleet to current NATS (recall: the last attempt drove a terabit-scale message storm), on a custom branch with some of his fixes from last week. Metrics are down up to 90% across the board (a good thing) and problems we’ve been having with connection stability after network outages (inevitable at our scale!) seem to have resolved. Will’s writing a Fresh Produce release about this and we won’t steal any more of his thunder here.
Matt spent the week making log monitoring more resilient. “Logs” here mean “the platform feature we offer that ships logs off physical servers and to customers using NATS”. What Matt’s doing is, we run a Machine on every physical server in our fleet, the “debug app”, and it checks various things and freaks out and generates alerts when things go wrong. One more thing “debug” does now is track our server inventory, and make sure we’re getting NATS logs from all of them. In other words, another constantly-running, all-points end-to-end test of log shipping, from the vantage point of our customers.
Tom is doing topological work on Corrosion. As we keep saying, we have “edge” servers and “worker” servers; the “edges” are much, much smaller than the “workers”, and we don’t want to tax them too much, so they can just do their thing terminating TLS and routing traffic. But that routing function depends on Corrosion, our gossip-based state tracking system, and Corrosion is expensive. One answer, which Tom is pursuing, is for (most) edges not to run it at all, but instead to be remote clients of it on other machines.
Dave (and Matt and Will and Simon) did a bunch of hiring work, including revamping our challenges and updating our internal processes for reviewing them. We should be much more responsive to infra candidates (we already were within tolerances, but we’re raising the bar for ourselves).
Update: May 25, 2024
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
May 23: Capacity Issues in FRA (7:00EST):FRA has of late become one of our busier regions, and we’ve been continuously adding edge capacity (recall: edge hosts take in traffic from the Internet and route it to worker hosts, which run Fly Machines for customers). We needed more edge capacity this morning, and added it. The annoyance here was compounded by telemetry issues: in our “current” configuration, capacity issues degrade the performance both of Corrosion, our internal gossip service catalog, and NATS, the messaging system we use to communicate load between proxies in our Anycast network. There’s a bunch of work happening around making those systems less sensitive to edge load.
So, yeah, pretty easy week, as far as the infra team is concerned.
This Week In Infra Engineering
The big news for the past several weeks has been intra-region Fly Machine migration: minimal downtown migration of workloads, including large volumes, from one physical worker server to another. We hit a snag here: Fly Postgres wasn’t designed originally to be migrated, so many instances of it are intolerant to being moved and booted up on new 6PN IPv6 addresses. A bunch of work is happening to resolve this; we’ll
certainly be migrating Fly Postgres instances in the near future, it’s just a question of “how”.
Simon is designing migration tooling work to make large-scale migration and host draining work for everything we can safely migrate, including single-node Postgres instances (do not run single-node Postgres instances in production! — but thanks for being easy to migrate).
Ben A. worked on migrating and draining workloads to balance workers. Fly Machines bill customers primarily for the time the Machine is actually running. When they’re stopped, a Machine is a commitment to some amount of resources on its associated worker, and a promise that we will start that Machine within some n-hundred millisecond time budget. This commitment/promise dance is drastically simpler and less expensive for us to honor now that we can migrate stopped Machines.
Now that Pet Semetary is up and running, Somtochi has switched up and is working with Sage on Corrosion, our Rust-based gossip statekeeping system. Corrosion is a (large) SQLite database managed by SWIM gossip updates. The work this week is primarily testing and bugfixing, but they may have figured
out a way to reduce the size of our database by a factor of 3, which we’ll certainly write about next week if it pans out.
Dusty and Sage have been adding more edge hosts to keep up with capacity. Dusty also began trial migrations of Fly Postgres, using an IP mapping hack by Saleem, and built some internal dashboards to assist in the sort of manual host rebalancing work that Ben A. was doing this week.
Akshit cleaned up some log messages. Yawn. But also he graduated university! Congratulations to Akshit.
John rolled out a fleetwide fix for an interaction between Corrosion and our eBPF UDP forwarding path. You can run a fully-functional DNS server as a Fly Machine, because our Anycast network handles UDP as well as TCP. We do this by transparently encapsulating and routing UDP in the kernel using XDP and TC BPF. The routing logic for this scheme is written into BPF maps by a process (udpcatalogd) that subscribes to Corrosion. We decommissioned a large physical worker in AMS, which generated a big Corrosion update, which tickled a bug in a particular SQL query pattern only udpcatalogd uses, which caused ghost services for that AMS ex-worker to get stuck in our routing maps. There were a bunch of fixes for this, but the immediate thing that cleared the problem operationally involved… turning udpcatalogd off for a moment and then back on, fleetwide. Thank, John! John also did a bunch of retrospective work on our learnings abouting Fly Postgres clusters. He’s also taken up residency on our public community site. Go say ‘hi’ to him (or complain about something that infra is involved in, and he’ll apparently show up.)
Will has been heads-down in NATS land for the past 2 weeks, after an attempted NATS upgrade briefly melted our network a bit over a week ago. Will has been chatting with the Synadia people about our topology, and, in the meantime, found two scaling issues in the NATS core code that drive excess system chatter in our current topology. He’s prepared a couple upstream PRs.
Steve spent some time this week building new features for Drift, our Elixir/Phoenix internal server hardware inventory tool. Drift now tracks server lifecycle (for things like decommissioning), and, where our upstreams support it, automatically adding new servers to our inventory.
Update: May 18, 2024
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
May 13: Metrics Outage (9:00EST): Every once in awhile, on some of our AMD-based servers, with Linux IOMMU enabled, we see a weird lockup that forces us to hard-restart a host. Technically, we only need IOMMU stuff on hosts running GPU workloads, but right now we had it widely enabled. Anyways, that happened to an IAD host that ran part of our Metrics system, which meant that for about 20 minutes we had broken metrics ingestion while we rebooted that machine. You’d have seen a 20-30 minute gap in metrics on Grafana graphs.
May 16: Postgres Cluster Migration Failure (5PM EST): For the past several weeks we’ve been exercising machine+volume migration — the ability to move workloads, including storage, from one physical server to another by having the original server temporarily serve as a SAN server. For reasons having nothing to do with storage but rather the particulars of our IPv6 addressing scheme, migrations confused the repmgr process that manages Fly Postgres clusters. A limited number of Fly.io Postgres customers saw Postgres cluster outages after their underlying machines we migrated, and before we halted all migrations of Postgres clusters, over the course of about an hour.
May 8 (5:00 EST): Capacty Issues in DFW: We unexpectedly hit saturation on our “edge” servers (reminder: “edges” terminate HTTPS and serve our Anycast network, “workers” run VMs for Fly Machines), forcing us to quickly add additional edge servers in that region. This would be no big deal, but our Elixir web dashboards are served for this region, so for about 30 minutes we had degraded performance of our interface before we were able to add additional capacity.
This was an easier week than last week. The middle outage, migrating Postgres clusters, was very noticeable to impacted customers — but also quick root-caused. The other two incidents were limited in scope. Unless you’re carefully watching Fly Metrics. Are you using Fly Metrics? An incident that broke them is a weird place to pitch them, we know, but they’re pretty neat and you get them for free. Hold our feet to the fire on them being reliable!
This Week In Infra Engineering
Dusty provisioned new hardware capacity in San Jose, Singapore, Warsaw, Sydney, Atlanta, and Seattle.
Will had a conversation with engineers at Synadia (last week’s NATS outage hit right during an all-hands meeting for them!) and got some advice on reconfiguring our internal NATS topology, shifting most of our hosts to “leaf” nodes and minimizing the number of “clustering” notes we have per region; this should trade an imperceptible amount of latency (which doesn’t matter with our NATS use case) for drastically reduced chatter. Thanks, Synadians!
Akshit finished an upgrade to Firecracker 1.7 across our fleet.
1.7 does asynchronous block I/O with io_uring. We’ve noticed, since we rolled out Cloud Hypervisor for our GPU workloads (ask us about the security work we had to do here!) that Cloud Hypervisor was doing a better job handling busy disks than the version of Firecracker we were running. We’re optimistic that the new version will close the gap.
Steve finished up the provisioning tooling for the fou-tunnel-and-SNAT monstrosity that we talked about last week: giving Fly Machines static IP addresses, for people who talk to IP-restricted 3rd party APIs.
Tom and Ben A (ask us how many Bens work here!) completed the migration and draining of workloads from the cursed “edge worker” machines we mentioned last week. Edge-workers are no more. In the process, Tom debugging a bunch of draining tooling issues (being good at this is a big deal, because we’d like to be able to drain a sus server anywhere in the world at the drop of a hat), and Ben wrote up internal playbooks for draining hosts. Requiescat, Edge Workers, 2020-2024.
Simon continued low-level work on Machine/Volume migration, which is the platform kernel of the host draining stuff Tom and Ben were doing. This week’s work focused on large volume migration. Recall that our migration system causes the “source” physical server to temporarily serve as an ad-hoc SAN for the “target” physical, allowing us to “move” a Machine from one physical to another in seconds while the actual volume block clone happens in the background; Simon’s instrumentation work may have shaved about ~10s off this process (about 1/3rd of the total time).
Andres got host alerting (notifying users of hardware issues with the specific hosts they’re using, both on their personal status page and directly via email) integrated with our internal support admin tool.
Somtochi rolled out the first iteration of Pet Semetary to our flyd orchestrator. We now have two (hardware-isolated) secret stores: Hashicorp Vault, and our internal Pet Semetary. The big thing here is, if we can’t read secrets, we can’t boot machines; now, if Vault has an availability issue, we “fall back” to Pet Semetary. Requeiscat, Vault-related Outages, 2020-2024.
Update: May 11, 2024
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
May 7: NATS Storm (5:30EST): Some components of our platform, most notably log-shipping, run on top of the NATS messaging system. We’ve been fighting with NATS reliability issues for the past several months, and one thing we’ve needed to do is upgrade the fleet NATS version; more recent NATS releases have a number of bug fixes. We did a staged deployment of 2.10; it looked fine; we rolled it out further; it generated a 1.7tb/s (that’s ‘t’, with a ‘t’) message storm. Server CPU (on a small number of servers) buckled long before the network did; some users would have seen increased CPU steal and degraded performance. Log shipping was totally disrupted for about an hour.
May 8: Vault Certificate Breakage (7AM EST): The primary backend for secret storage at Fly.io is currently Hashicorp Vault (which is great). When Fly Machines start up, flyd, our orchestrator, fetches secrets from Vault to merge into the configuration. Vault is locked down with mTLS across our fleet; you need a client cert to talk to it at all. Due to a leaf/intermediate certificate configuration issue (we’re not even going to attempt to explain it), client certs across our fleet were invalidated, preventing flyd from fetching secrets, which prevented Fly Machines from booting.
May 8 (5:30 EST): Registry Load Balancing in AMS: Every application deployed on Fly.io is shipped in Docker (OCI) containers, and most are stored in our own Docker registries. For the past 6 months, those registries have been geographically distributed using LiteFS, with an accelerated S3 storage backend. Under heavy deployment load (because of the time of day), deploys using the AMS registry began to sporadically time out. We investigated this with AWS, and with our upstream provider, and mitigated temporarily by forcing builds to other regions; the issue resolved itself (never good news) within an hour or so. It turned out to have been a side effect of a fly-proxy change that fixed a bug with large HTTP POST bodies.
A pretty straightforward week. The most painful incident was the Vault “outage”, in part because it happened on the eve of us cutting over to Pet Semetary, our Vault replacement; in our new post-petsem world, it’ll take an outage of both Vault and PetSem to disrupt deploys. The other two incidents were more limited in scope.
This Week In Infra Engineering
Dusty built out telemetry and monitoring for Fly Machine migration, in preparation for a regional migration of some Machines to a new upstream provider.
In addition to doing a cubic heckload of routine hiring work (do these updates sound fun? we’re hiring!), Matt and Tom revised one of our technical work sample tests, eliminating an inadvertent cheat code some candidates had discovered; a comprehensively broken environment we ask candidates to diagnose had a way to straightforwardly dump out the changes we had made to break it. Respect to those candidates for figuring that out, and helping us level up the challenge a bit.
Steve has had a fun week. He’s working on shipping (you heard it here first) static IP address assignments for individual Fly Machines — this means Fly Machines can make direct requests to the Internet (for instance, to internal on-prem APIs) with predictable IP addresses. The original plan was to run an IGP across our fleet, but Steve worked out a combination of fou tunnels and SNAT that keeps our routing discipline static while allowing address to float. It’s a neat trick.
Steve would also like us to tell you that he rebooted dev-pkt-dc10-9b7e.
Ben built out tooling for host draining. Last week we talked about Simon’s work shipping inter-server volume migrations. Now that we can straightforwardly move workloads between physicals, storage and all, we can rebuild the “drain” feature we had when with Hashicorp Nomad back in 2020 (before we had storage), which means that when servers get janky (inevitable at our scale), or things need to be rebalanced, we can straightforwardly move all the Fly Machines to new physical homes, with minimal downtime. There’s a lot of corner cases to this (for instance: not all the volumes on a physical are necessarily attached to Machines), so this is a tooling-intensive problem.
Andres and Kaz re-established telemetry, metrics, and alerting on our Rails API, after an incident last week - it didn’t directly impact deploys, but would have made incidents involving API server problems, which are not unheard of, harder to detect and more difficult to resolve.
Kaz worked on fly-proxy-initiated Fly Machine migration. True fact: you can start a Fly Machine with an HTTP request; if a request is routed to a Fly Machine in stopped state, it’ll start. Kaz is working towards automatic migration of Machines from hosts that overloaded (i.e., exceeding our internal utilization thresholds): instead of starting on a busy machine, we can initiate a migration to a less-loaded machine. Recall that the core idea of our migration system is temporary SAN-style connections: a Machine can boot up on a new physical long before its entire volume has been copied over. Automatic migration isn’t happening yet, but it’s getting closer.
Akshit worked on cloud-hypervisor integration with our flyctl developer experience. cloud-hypervisor is like Firecracker except Intel ships it instead of AWS (they are both memory-safe Rust KVM hypervisors with minimal footprints; they even share a bunch of crates). We use cloud-hypervisor for GPU machines because it supports VFIO IOMMU device passthrough (ask us about the security work we did here, please). Operating cloud-hypervisor is similar enough to Firecracker that it’s almost a drop-in, but we’re still smoothing out the differences so they feel indistinguishable to users.
Tom and John are decommissioning our old, cursed “edge workers”. We run mainly two kinds of servers: edges that take traffic from the Internet and feed them into our proxy network, and workers that run actual Fly Machines. For historical reasons (those being: the founders made annoying decisions) we have on one of our upstreams a bunch of dual-role machines. Not for long. You may not like it, but this is what peak performance looks like:
root@edge-nac-fra1-558f: ~
$ danger-host-self-destruct-i-want-pain
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!DANGER!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!DANGER!!!!!
!!!!!DANGER!!!!! _____ _ _ _____ ______ _____ !!!!!DANGER!!!!!
!!!!!DANGER!!!!! | __ \ /\ | \ | |/ ____| ____| __ \ !!!!!DANGER!!!!!
!!!!!DANGER!!!!! | | | | / \ | \| | | __| |__ | |__) | !!!!!DANGER!!!!!
!!!!!DANGER!!!!! | | | |/ /\ \ | . ` | | |_ | __| | _ / !!!!!DANGER!!!!!
!!!!!DANGER!!!!! | |__| / ____ \| |\ | |__| | |____| | \ \ !!!!!DANGER!!!!!
!!!!!DANGER!!!!! |_____/_/ \_\_| \_|\_____|______|_| \_\ !!!!!DANGER!!!!!
!!!!!DANGER!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!DANGER!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
This script will TOTALLY DECOMMISSION and DESTROY this host and REMOVE IT
PERMANENTLY from the Fly.io fleet.
To proceed, enter the hostname: edge-nac-fra1-558f
Correct, this host is edge-nac-fra1-558f.
To proceed, repeat verbatim "Yes, IRREVERSIBLY decommission"
-> Yes, IRREVERSIBLY decommission
This is your LAST CHANCE. Press ENTER to run away to safety. Press '4' to begin.
Migration is a theme of this bulletin; like we said last week, it has been kind of our “white whale”.
We have not forgotten last week’s promise to publish Matt’s incident handling process documents, but Matt wants to clean them up a bit more. We’ll keep mentioning it in updates until Matt lets us release them.
This is a small fraction of our infra team! These are just highlights; things that stuck out to us at the end of the week.
Update: May 5, 2024
This is a new thing we’re doing to surface the work our infra team does. We’re trying to accomplish two things here: 100% fidelity reporting of internal incidents, regardless of how impactful they are, and a weekly highlights reel of project work by infra team members. We’ll be posting these once a week, and bear with us while we work out the format and tone.
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact).
Apr 24: WireGuard Mesh Outage (12:30EST): All physical hosts on Fly.io are linked with a global full mesh of WireGuard connections. The control plane for this mesh is managed with Consul KV. A bad Consul input broke Consul watches across the fleet, disrupting our internal network. Brief but severe impact to fleetwide request routing; longer sporadic impact to logging, metrics, and API components, which found new ways to be susceptible to network outages.
Apr 25: Regional API Errors (2AM EST): We saw an uptick in 500 responses and found Machines API servers (flaps) in some regions had cached an unexpected nil during the previous WireGuard outage, which generated exceptions on some requests. Sporadic disruption to deployments in impacted regions.
Apr 25 (9:15 EST): Concurrency Issue With Logging: Under heavy load, a middleware component in our Rails/Ruby API server unsafely accessed an instance variable, corrupting logs. A small number of customers experienced some log disruption.
Apr 26 (5:30 EST): WireGuard Mesh Outage II: Automation code built and deployed to mitigate/prevent the WireGuard incident that occurred on Apr 24 exhibited a bug that effectively broke the WireGuard Mesh again, with the same impact and severity: brief but severe impact to request routing, longer sporadic impact to logging, metrics, and API.
Apr 30 (5:30 EST): Upstream Data Center Power Outage in SJC: One of our 2 data center deployments in SJC experienced a total loss of power, taking more than half our SJC workers (17 of 27) offline for about 30 minutes.
Apr 30 (2AM EST): Docker Registry Resource Exhaustion: The Docker registries that host customer containers are themselves Fly Machines applications, with their own resource constraints. A Registry machine unexpectedly reached a storage limit, disrupting deploys that pushed to that registry for about 20 minutes.
May 1 (8AM EST): Token Service Deployment Failure: Fly.io’s Macaroon tokens, which authenticate API calls for people with mandatory SSO enabled, or with special-purpose deploy tokens, is served by the tkdb service, which runs on isolated hardware in 3 regions, replicated with LiteFS. A bad deploy to the ams region’s tkdb jammed up LiteFS, disrupting deploys for SSO users.
May 2 (4:30 EST): Internal Metrics Failure for API Server: A Prometheus snag caused some instances of our API server to stop reporting metrics. This event had no customer impact (beyond tying up some infra engineers for an hour).
May 4 (4AM EST): Excessive Load In BOS: Network maintance at our upstream provider for BOS broke connectivity between physical hosts, which in turn caused excessive queuing in our telemetry systems, which in turn drove up load. Brief performance degradation in BOS was resolved manually.
This was a difficult time interval, dominated by a pair of first-of-their-kind outages in the control plane for our global WireGuard mesh, which subjected us to several days of involuntary chaos testing, followed by a surprisingly long upstream power loss in one of our regions. “Incidents” for infra engineering occur somewhat routinely; these were atypically impactful to customers.
This Week In Infra Engineering
Somtochi completed an initial integration between flyd, our orchestrator, and Pet Semetary, our internal replacement for Hashicorp Vault. Fly Machines now read from both secret stores when they’re scheduled. This is the first phase of real-world deployment for Pet Semetary. Because Vault relies on a centralized Raft cluster with global client connections, and because secrets reads have to work in order to schedule Fly Machines, it has historically been a source of instability (though not within the last few months, after we drastically increased the resources we allocate to it). Pet Semetary has a much simpler data model, relying on LiteFS for leader/replica distribution, and is easier to operate. Somtochi’s work makes deployments significantly more resilient.
Simon got Fly Machine inter-server volume migrations working reliably, the payoff of a months-long project that is one of the “white whales” of our platform engineering. Volumes attached to Fly Machines are locally-attached NVMe storage; Fly Machines without Volumes can be trivially moved from one server to another, but Volumes historically could not be without an uptime-sapping snapshot restore. The new migration system exploits dm-clone, which effectively creates temporary SAN connections between our physical servers to allow Fly Machines to boot on physical while reading from a Volume on another physical while the Volume is cloned. Simon’s work allows us to drain workloads from sus physical machines, and to rebalance workloads within regions.
Andres built new internal tooling for host-specific customer alerts. At the scale we’re operating at, host failures are increasingly common; more hosts, more surface area for cosmic rays to hit. These issues generally impact only the small subset of customers deployed on that hardware, so we report them out in “personal status pages”. But we’re a CLI-first platform, and many of our customers don’t use our Dashboard. Andres has rolled out preemptive email notification, so customers get direct notification.
Dusty beefed up metrics and alerting around “stopped” Fly Machines. The premise of the Machines platform is that Machines are reasonably fast to create, but ultra-fast to start and stop: you can create a pool of Machines and keep them on standby, ready to start to field specific requests. Making this work reliably requires us to carefully monitor physical host capacity, so that we’re always ready to boot up a stopped Fly Machine. This is capacity planning issue unique to our platform.
Will continued our ongoing project to move all of Fly Metrics off of special-purpose hosts on OVH, which hosts have been flappy over the years, and onto Fly Machines running on our own platform. Metrics consumes an eye-popping amount of storage, and Will spent the week adding storage nodes to our new Fly Machine metrics cluster.
Matt capped off the week by, appropriately enough, fleshing out our incident response and review process documentation. We could say more, but what we’ll probably do instead is just make them public next week.