2025-01-25

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact.)

January 23: Partial Logging Failures In SYD and SJC (10:00EST): We use (and plan to keep using) HashiCorp Consul to manage our internal system configurations. Our Consul deployment is a cluster of large server machines managed separately from the rest of our production fleet. One of those servers hiccupped, and consul-templaterb, the “Consul driver” we use on our worker servers, got a blank value for a Vector configuration on a small number of workers in Sydney and San Jose, which shut vector down, which stopped log ingestion for about an hour while we investigated and mitigated the problem. This impacted a very small number of apps (most hosts didn’t get the weird Consul blip).
January 23: Depot Builds Failing (13:00EST): For the last several months, remote Docker builds on Fly.io have defaulted to Depot, a partner provider that is better at doing efficient Docker builds than we are. They had a database outage. For about an hour, users needed to do --depot=false on their deploys while their outage resolved.
January 23: Capacity Issues in IAD (14:00EST): IAD is our busiest region. We experienced a sudden and unexpected usage shift and ran low on memory capacity. It took about an hour to bring new capacity online from a backup provider, during which time users may have sporadically been unable to deploy their apps in IAD.

This Week In Engineering

Ben A. is working on observability/telemetry for Fly Machine migrations. A Fly Machine is brought into being through a recorded log of finite state machine steps stored in a BoltDB; “migration” is just another FSM, as far as our orchestrator is concerned. The migration FSM is somewhat complicated. We need better visibility into how each FSM step can fail. A reasonable way to think about this is that we’re trying to do for migrations, which are a fully asynchronous distributed system, what we do with oTel tracing elsewhere in our platform: get clear sightlines without scraping logs, which is what we’ve been stuck with ‘til now. See, Ben, I had no trouble writing something interesting out of that.

Dusty created a unified closed loop system for tracking hardware issues across all our various providers, in order to stop being our single source of “which providers are working on which issues” truth for us. Those kinds of issues are now bridged into our Slack and recorded in a single issue database. I could tell a story about how this will resolve hardware issues and bring capacity online more quickly for users, but really, this is just making life better for our infra team.

Will and Tim are working on stable 6PN addressing for Fly Machines. This is motivated by a terrible mistake we made several years ago when we launched Fly Postgres and had it auto-configure with IPv6 address literals; those addresses embed hardware addresses in them, which cause obvious problems when we need to migrate a database from one host to another. Will also did a deep dive into IO scheduling and performance; there are things we can be doing to manage IOPS load between different applications, and also things we can do with our hypervisors to increase IO performance (Firecracker, our mainstay hypervisor, is single-process-per-VM, which means that CPU, disk, and network can all cause contention with each other; there are fixes for this, but they depend on kernel updates we don’t uniformly deploy).

Somtochi spent the week on a support rotation, working with Lillian (welcome back Lillian!) to address gnarly customer issues. We all do rotations in support, except for me; I cop out by saying I’m too busy writing this bulletin.

Peter is moving stuff forward on regionalizing fly-proxy. The core idea here is to decreate the distributed systems failure blast radius, by relaxing the design constraint that every proxy region has fine-grained knowledge of every Fly Machine on our platform. This is straightforward to do for HTTP services, which have a command/control system that allows us to bounce requests around. It’s a real problem for raw TCP services, though: once we make a TCP connection, we can’t naively retry it elsewhere. Peter has a bananas solution for this that (I am not making this up) uses the TCP URG pointer as a signaling mechanism. We’ll write more about this if it doesn’t end up dooming us all.

Kaz continues to work on the AMD-Vi reboot process (there’s an AMD virtualization hardware bug that pops up randomly once an affected host gets past a threshold uptime, so we’re doing orderly maintenance-window reboots of upcoming potential victims). This week: lots of customer notification work.

Steve pitched in to Fly Kubernetes (FKS), building a backup system that stores etcd data in object storage. Steve was also in the middle of the Consul outage we documented this week: that turns out to have been a hardware issue, a machine burning out all its NVMe drives within an hour, and then a scramble to replace it.

Next post ↑: 2025-02-01
Previous post ↓: 2025-01-18