2024-05-05

This is a new thing we’re doing to surface the work our infra team does. We’re trying to accomplish two things here: 100% fidelity reporting of internal incidents, regardless of how impactful they are, and a weekly highlights reel of project work by infra team members. We’ll be posting these once a week, and bear with us while we work out the format and tone.

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact).

Apr 24: WireGuard Mesh Outage (12:30EST): All physical hosts on Fly.io are linked with a global full mesh of WireGuard connections. The control plane for this mesh is managed with Consul KV. A bad Consul input broke Consul watches across the fleet, disrupting our internal network. Brief but severe impact to fleetwide request routing; longer sporadic impact to logging, metrics, and API components, which found new ways to be susceptible to network outages.
Apr 25: Regional API Errors (2AM EST): We saw an uptick in 500 responses and found Machines API servers (flaps) in some regions had cached an unexpected nil during the previous WireGuard outage, which generated exceptions on some requests. Sporadic disruption to deployments in impacted regions.
Apr 25 (9:15 EST): Concurrency Issue With Logging: Under heavy load, a middleware component in our Rails/Ruby API server unsafely accessed an instance variable, corrupting logs. A small number of customers experienced some log disruption.
Apr 26 (5:30 EST): WireGuard Mesh Outage II: Automation code built and deployed to mitigate/prevent the WireGuard incident that occurred on Apr 24 exhibited a bug that effectively broke the WireGuard Mesh again, with the same impact and severity: brief but severe impact to request routing, longer sporadic impact to logging, metrics, and API.
Apr 30 (5:30 EST): Upstream Data Center Power Outage in SJC: One of our 2 data center deployments in SJC experienced a total loss of power, taking more than half our SJC workers (17 of 27) offline for about 30 minutes.
Apr 30 (2AM EST): Docker Registry Resource Exhaustion: The Docker registries that host customer containers are themselves Fly Machines applications, with their own resource constraints. A Registry machine unexpectedly reached a storage limit, disrupting deploys that pushed to that registry for about 20 minutes.
May 1 (8AM EST): Token Service Deployment Failure: Fly.io’s Macaroon tokens, which authenticate API calls for people with mandatory SSO enabled, or with special-purpose deploy tokens, is served by the tkdb service, which runs on isolated hardware in 3 regions, replicated with LiteFS. A bad deploy to the ams region’s tkdb jammed up LiteFS, disrupting deploys for SSO users.
May 2 (4:30 EST): Internal Metrics Failure for API Server: A Prometheus snag caused some instances of our API server to stop reporting metrics. This event had no customer impact (beyond tying up some infra engineers for an hour).
May 4 (4AM EST): Excessive Load In BOS: Network maintance at our upstream provider for BOS broke connectivity between physical hosts, which in turn caused excessive queuing in our telemetry systems, which in turn drove up load. Brief performance degradation in BOS was resolved manually.

This was a difficult time interval, dominated by a pair of first-of-their-kind outages in the control plane for our global WireGuard mesh, which subjected us to several days of involuntary chaos testing, followed by a surprisingly long upstream power loss in one of our regions. “Incidents” for infra engineering occur somewhat routinely; these were atypically impactful to customers.

This Week In Infra Engineering

Somtochi completed an initial integration between flyd, our orchestrator, and Pet Semetary, our internal replacement for Hashicorp Vault. Fly Machines now read from both secret stores when they’re scheduled. This is the first phase of real-world deployment for Pet Semetary. Because Vault relies on a centralized Raft cluster with global client connections, and because secrets reads have to work in order to schedule Fly Machines, it has historically been a source of instability (though not within the last few months, after we drastically increased the resources we allocate to it). Pet Semetary has a much simpler data model, relying on LiteFS for leader/replica distribution, and is easier to operate. Somtochi’s work makes deployments significantly more resilient.

Simon got Fly Machine inter-server volume migrations working reliably, the payoff of a months-long project that is one of the “white whales” of our platform engineering. Volumes attached to Fly Machines are locally-attached NVMe storage; Fly Machines without Volumes can be trivially moved from one server to another, but Volumes historically could not be without an uptime-sapping snapshot restore. The new migration system exploits dm-clone, which effectively creates temporary SAN connections between our physical servers to allow Fly Machines to boot on physical while reading from a Volume on another physical while the Volume is cloned. Simon’s work allows us to drain workloads from sus physical machines, and to rebalance workloads within regions.

Andres built new internal tooling for host-specific customer alerts. At the scale we’re operating at, host failures are increasingly common; more hosts, more surface area for cosmic rays to hit. These issues generally impact only the small subset of customers deployed on that hardware, so we report them out in “personal status pages”. But we’re a CLI-first platform, and many of our customers don’t use our Dashboard. Andres has rolled out preemptive email notification, so customers get direct notification.

Dusty beefed up metrics and alerting around “stopped” Fly Machines. The premise of the Machines platform is that Machines are reasonably fast to create, but ultra-fast to start and stop: you can create a pool of Machines and keep them on standby, ready to start to field specific requests. Making this work reliably requires us to carefully monitor physical host capacity, so that we’re always ready to boot up a stopped Fly Machine. This is capacity planning issue unique to our platform.

Will continued our ongoing project to move all of Fly Metrics off of special-purpose hosts on OVH, which hosts have been flappy over the years, and onto Fly Machines running on our own platform. Metrics consumes an eye-popping amount of storage, and Will spent the week adding storage nodes to our new Fly Machine metrics cluster.

Matt capped off the week by, appropriately enough, fleshing out our incident response and review process documentation. We could say more, but what we’ll probably do instead is just make them public next week.

Next post ↑: 2024-05-11