Update: May 18, 2024

Infra is the largest team in our engineering organization. This page is a week-by-week record of the work that team does.

Trailing 2 Weeks Incidents

A note on incidents: incidents are internal events for our infrastructure team. Incidents often correspond to degraded service on our platform, but not always. This log aims for 100% fidelity to internal incidents, and will generally be a superset of our status page events. It includes events reported to subsets of customers on their personal status pages, as well as events without any status page impact.

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

  • May 13: Metrics Outage (9:00EST): Every once in awhile, on some of our AMD-based servers, with Linux IOMMU enabled, we see a weird lockup that forces us to hard-restart a host. Technically, we only need IOMMU stuff on hosts running GPU workloads, but right now we had it widely enabled. Anyways, that happened to an IAD host that ran part of our Metrics system, which meant that for about 20 minutes we had broken metrics ingestion while we rebooted that machine. You’d have seen a 20-30 minute gap in metrics on Grafana graphs.

  • May 16: Postgres Cluster Migration Failure (5PM EST): For the past several weeks we’ve been exercising machine+volume migration — the ability to move workloads, including storage, from one physical server to another by having the original server temporarily serve as a SAN server. For reasons having nothing to do with storage but rather the particulars of our IPv6 addressing scheme, migrations confused the repmgr process that manages Fly Postgres clusters. A limited number of Fly.io Postgres customers saw Postgres cluster outages after their underlying machines we migrated, and before we halted all migrations of Postgres clusters, over the course of about an hour.

  • May 8 (5:00 EST): Capacty Issues in DFW: We unexpectedly hit saturation on our “edge” servers (reminder: “edges” terminate HTTPS and serve our Anycast network, “workers” run VMs for Fly Machines), forcing us to quickly add additional edge servers in that region. This would be no big deal, but our Elixir web dashboards are served for this region, so for about 30 minutes we had degraded performance of our interface before we were able to add additional capacity.

This was an easier week than last week. The middle outage, migrating Postgres clusters, was very noticeable to impacted customers — but also quick root-caused. The other two incidents were limited in scope. Unless you’re carefully watching Fly Metrics. Are you using Fly Metrics? An incident that broke them is a weird place to pitch them, we know, but they’re pretty neat and you get them for free. Hold our feet to the fire on them being reliable!

This Week In Infra Engineering

Dusty provisioned new hardware capacity in San Jose, Singapore, Warsaw, Sydney, Atlanta, and Seattle.

Will had a conversation with engineers at Synadia (last week’s NATS outage hit right during an all-hands meeting for them!) and got some advice on reconfiguring our internal NATS topology, shifting most of our hosts to “leaf” nodes and minimizing the number of “clustering” notes we have per region; this should trade an imperceptible amount of latency (which doesn’t matter with our NATS use case) for drastically reduced chatter. Thanks, Synadians!

Akshit finished an upgrade to Firecracker 1.7 across our fleet. 1.7 does asynchronous block I/O with io_uring. We’ve noticed, since we rolled out Cloud Hypervisor for our GPU workloads (ask us about the security work we had to do here!) that Cloud Hypervisor was doing a better job handling busy disks than the version of Firecracker we were running. We’re optimistic that the new version will close the gap.

Steve finished up the provisioning tooling for the fou-tunnel-and-SNAT monstrosity that we talked about last week: giving Fly Machines static IP addresses, for people who talk to IP-restricted 3rd party APIs.

Tom and Ben A (ask us how many Bens work here!) completed the migration and draining of workloads from the cursed “edge worker” machines we mentioned last week. Edge-workers are no more. In the process, Tom debugging a bunch of draining tooling issues (being good at this is a big deal, because we’d like to be able to drain a sus server anywhere in the world at the drop of a hat), and Ben wrote up internal playbooks for draining hosts. Requiescat, Edge Workers, 2020-2024.

Simon continued low-level work on Machine/Volume migration, which is the platform kernel of the host draining stuff Tom and Ben were doing. This week’s work focused on large volume migration. Recall that our migration system causes the “source” physical server to temporarily serve as an ad-hoc SAN for the “target” physical, allowing us to “move” a Machine from one physical to another in seconds while the actual volume block clone happens in the background; Simon’s instrumentation work may have shaved about ~10s off this process (about 1/3rd of the total time).

Andres got host alerting (notifying users of hardware issues with the specific hosts they’re using, both on their personal status page and directly via email) integrated with our internal support admin tool.

Somtochi rolled out the first iteration of Pet Semetary to our flyd orchestrator. We now have two (hardware-isolated) secret stores: Hashicorp Vault, and our internal Pet Semetary. The big thing here is, if we can’t read secrets, we can’t boot machines; now, if Vault has an availability issue, we “fall back” to Pet Semetary. Requeiscat, Vault-related Outages, 2020-2024.

Update: May 11, 2024

Infra is the largest team in our engineering organization. This page is a week-by-week record of the work that team does.

Trailing 2 Weeks Incidents

A note on incidents: incidents are internal events for our infrastructure team. Incidents often correspond to degraded service on our platform, but not always. This log aims for 100% fidelity to internal incidents, and will generally be a superset of our status page events. It includes events reported to subsets of customers on their personal status pages, as well as events without any status page impact.

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

  • May 7: NATS Storm (5:30EST): Some components of our platform, most notably log-shipping, run on top of the NATS messaging system. We’ve been fighting with NATS reliability issues for the past several months, and one thing we’ve needed to do is upgrade the fleet NATS version; more recent NATS releases have a number of bug fixes. We did a staged deployment of 2.10; it looked fine; we rolled it out further; it generated a 1.7tb/s (that’s ‘t’, with a ‘t’) message storm. Server CPU (on a small number of servers) buckled long before the network did; some users would have seen increased CPU steal and degraded performance. Log shipping was totally disrupted for about an hour.

  • May 8: Vault Certificate Breakage (7AM EST): The primary backend for secret storage at Fly.io is currently Hashicorp Vault (which is great). When Fly Machines start up, flyd, our orchestrator, fetches secrets from Vault to merge into the configuration. Vault is locked down with mTLS across our fleet; you need a client cert to talk to it at all. Due to a leaf/intermediate certificate configuration issue (we’re not even going to attempt to explain it), client certs across our fleet were invalidated, preventing flyd from fetching secrets, which prevented Fly Machines from booting.

  • May 8 (5:30 EST): Registry Load Balancing in AMS: Every application deployed on Fly.io is shipped in Docker (OCI) containers, and most are stored in our own Docker registries. For the past 6 months, those registries have been geographically distributed using LiteFS, with an accelerated S3 storage backend. Under heavy deployment load (because of the time of day), deploys using the AMS registry began to sporadically time out. We investigated this with AWS, and with our upstream provider, and mitigated temporarily by forcing builds to other regions; the issue resolved itself (never good news) within an hour or so. It turned out to have been a side effect of a fly-proxy change that fixed a bug with large HTTP POST bodies.

A pretty straightforward week. The most painful incident was the Vault “outage”, in part because it happened on the eve of us cutting over to Pet Semetary, our Vault replacement; in our new post-petsem world, it’ll take an outage of both Vault and PetSem to disrupt deploys. The other two incidents were more limited in scope.

This Week In Infra Engineering

Dusty built out telemetry and monitoring for Fly Machine migration, in preparation for a regional migration of some Machines to a new upstream provider.

In addition to doing a cubic heckload of routine hiring work (do these updates sound fun? we’re hiring!), Matt and Tom revised one of our technical work sample tests, eliminating an inadvertent cheat code some candidates had discovered; a comprehensively broken environment we ask candidates to diagnose had a way to straightforwardly dump out the changes we had made to break it. Respect to those candidates for figuring that out, and helping us level up the challenge a bit.

Steve has had a fun week. He’s working on shipping (you heard it here first) static IP address assignments for individual Fly Machines — this means Fly Machines can make direct requests to the Internet (for instance, to internal on-prem APIs) with predictable IP addresses. The original plan was to run an IGP across our fleet, but Steve worked out a combination of fou tunnels and SNAT that keeps our routing discipline static while allowing address to float. It’s a neat trick.

Steve would also like us to tell you that he rebooted dev-pkt-dc10-9b7e.

Ben built out tooling for host draining. Last week we talked about Simon’s work shipping inter-server volume migrations. Now that we can straightforwardly move workloads between physicals, storage and all, we can rebuild the “drain” feature we had when with Hashicorp Nomad back in 2020 (before we had storage), which means that when servers get janky (inevitable at our scale), or things need to be rebalanced, we can straightforwardly move all the Fly Machines to new physical homes, with minimal downtime. There’s a lot of corner cases to this (for instance: not all the volumes on a physical are necessarily attached to Machines), so this is a tooling-intensive problem.

Andres and Kaz re-established telemetry, metrics, and alerting on our Rails API, after an incident last week - it didn’t directly impact deploys, but would have made incidents involving API server problems, which are not unheard of, harder to detect and more difficult to resolve.

Kaz worked on fly-proxy-initiated Fly Machine migration. True fact: you can start a Fly Machine with an HTTP request; if a request is routed to a Fly Machine in stopped state, it’ll start. Kaz is working towards automatic migration of Machines from hosts that overloaded (i.e., exceeding our internal utilization thresholds): instead of starting on a busy machine, we can initiate a migration to a less-loaded machine. Recall that the core idea of our migration system is temporary SAN-style connections: a Machine can boot up on a new physical long before its entire volume has been copied over. Automatic migration isn’t happening yet, but it’s getting closer.

Akshit worked on cloud-hypervisor integration with our flyctl developer experience. cloud-hypervisor is like Firecracker except Intel ships it instead of AWS (they are both memory-safe Rust KVM hypervisors with minimal footprints; they even share a bunch of crates). We use cloud-hypervisor for GPU machines because it supports VFIO IOMMU device passthrough (ask us about the security work we did here, please). Operating cloud-hypervisor is similar enough to Firecracker that it’s almost a drop-in, but we’re still smoothing out the differences so they feel indistinguishable to users.

Tom and John are decommissioning our old, cursed “edge workers”. We run mainly two kinds of servers: edges that take traffic from the Internet and feed them into our proxy network, and workers that run actual Fly Machines. For historical reasons (those being: the founders made annoying decisions) we have on one of our upstreams a bunch of dual-role machines. Not for long. You may not like it, but this is what peak performance looks like:

root@edge-nac-fra1-558f: ~
$ danger-host-self-destruct-i-want-pain
!!!!!DANGER!!!!!  _____          _   _  _____ ______ _____   !!!!!DANGER!!!!!
!!!!!DANGER!!!!! |  __ \   /\   | \ | |/ ____|  ____|  __ \  !!!!!DANGER!!!!!
!!!!!DANGER!!!!! | |  | | /  \  |  \| | |  __| |__  | |__) | !!!!!DANGER!!!!!
!!!!!DANGER!!!!! | |  | |/ /\ \ | . ` | | |_ |  __| |  _  /  !!!!!DANGER!!!!!
!!!!!DANGER!!!!! | |__| / ____ \| |\  | |__| | |____| | \ \  !!!!!DANGER!!!!!
!!!!!DANGER!!!!! |_____/_/    \_\_| \_|\_____|______|_|  \_\ !!!!!DANGER!!!!!

This script will TOTALLY DECOMMISSION and DESTROY this host and REMOVE IT
PERMANENTLY from the Fly.io fleet.
To proceed, enter the hostname: edge-nac-fra1-558f

Correct, this host is edge-nac-fra1-558f.

To proceed, repeat verbatim "Yes, IRREVERSIBLY decommission"
-> Yes, IRREVERSIBLY decommission
This is your LAST CHANCE. Press ENTER to run away to safety. Press '4' to begin.

Migration is a theme of this bulletin; like we said last week, it has been kind of our “white whale”.

We have not forgotten last week’s promise to publish Matt’s incident handling process documents, but Matt wants to clean them up a bit more. We’ll keep mentioning it in updates until Matt lets us release them.

This is a small fraction of our infra team! These are just highlights; things that stuck out to us at the end of the week.

Update: May 5, 2024

Infra is the largest team in our engineering organization. This page is a week-by-week record of the work that team does.

This is a new thing we’re doing to surface the work our infra team does. We’re trying to accomplish two things here: 100% fidelity reporting of internal incidents, regardless of how impactful they are, and a weekly highlights reel of project work by infra team members. We’ll be posting these once a week, and bear with us while we work out the format and tone.

Trailing 2 Weeks Incidents

A note on incidents: incidents are internal events for our infrastructure team. Incidents often correspond to degraded service on our platform, but not always. This log aims for 100% fidelity to internal incidents, and will generally be a superset of our status page events. It includes events reported to subsets of customers on their personal status pages, as well as events without any status page impact.

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact).

  • Apr 24: WireGuard Mesh Outage (12:30EST): All physical hosts on Fly.io are linked with a global full mesh of WireGuard connections. The control plane for this mesh is managed with Consul KV. A bad Consul input broke Consul watches across the fleet, disrupting our internal network. Brief but severe impact to fleetwide request routing; longer sporadic impact to logging, metrics, and API components, which found new ways to be susceptible to network outages.
  • Apr 25: Regional API Errors (2AM EST): We saw an uptick in 500 responses and found Machines API servers (flaps) in some regions had cached an unexpected nil during the previous WireGuard outage, which generated exceptions on some requests. Sporadic disruption to deployments in impacted regions.
  • Apr 25 (9:15 EST): Concurrency Issue With Logging: Under heavy load, a middleware component in our Rails/Ruby API server unsafely accessed an instance variable, corrupting logs. A small number of customers experienced some log disruption.
  • Apr 26 (5:30 EST): WireGuard Mesh Outage II: Automation code built and deployed to mitigate/prevent the WireGuard incident that occurred on Apr 24 exhibited a bug that effectively broke the WireGuard Mesh again, with the same impact and severity: brief but severe impact to request routing, longer sporadic impact to logging, metrics, and API.
  • Apr 30 (5:30 EST): Upstream Data Center Power Outage in SJC: One of our 2 data center deployments in SJC experienced a total loss of power, taking more than half our SJC workers (17 of 27) offline for about 30 minutes.
  • Apr 30 (2AM EST): Docker Registry Resource Exhaustion: The Docker registries that host customer containers are themselves Fly Machines applications, with their own resource constraints. A Registry machine unexpectedly reached a storage limit, disrupting deploys that pushed to that registry for about 20 minutes.
  • May 1 (8AM EST): Token Service Deployment Failure: Fly.io’s Macaroon tokens, which authenticate API calls for people with mandatory SSO enabled, or with special-purpose deploy tokens, is served by the tkdb service, which runs on isolated hardware in 3 regions, replicated with LiteFS. A bad deploy to the ams region’s tkdb jammed up LiteFS, disrupting deploys for SSO users.
  • May 2 (4:30 EST): Internal Metrics Failure for API Server: A Prometheus snag caused some instances of our API server to stop reporting metrics. This event had no customer impact (beyond tying up some infra engineers for an hour).
  • May 4 (4AM EST): Excessive Load In BOS: Network maintance at our upstream provider for BOS broke connectivity between physical hosts, which in turn caused excessive queuing in our telemetry systems, which in turn drove up load. Brief performance degradation in BOS was resolved manually.

This was a difficult time interval, dominated by a pair of first-of-their-kind outages in the control plane for our global WireGuard mesh, which subjected us to several days of involuntary chaos testing, followed by a surprisingly long upstream power loss in one of our regions. “Incidents” for infra engineering occur somewhat routinely; these were atypically impactful to customers.

This Week In Infra Engineering

Somtochi completed an initial integration between flyd, our orchestrator, and Pet Semetary, our internal replacement for Hashicorp Vault. Fly Machines now read from both secret stores when they’re scheduled. This is the first phase of real-world deployment for Pet Semetary. Because Vault relies on a centralized Raft cluster with global client connections, and because secrets reads have to work in order to schedule Fly Machines, it has historically been a source of instability (though not within the last few months, after we drastically increased the resources we allocate to it). Pet Semetary has a much simpler data model, relying on LiteFS for leader/replica distribution, and is easier to operate. Somtochi’s work makes deployments significantly more resilient.

Simon got Fly Machine inter-server volume migrations working reliably, the payoff of a months-long project that is one of the “white whales” of our platform engineering. Volumes attached to Fly Machines are locally-attached NVMe storage; Fly Machines without Volumes can be trivially moved from one server to another, but Volumes historically could not be without an uptime-sapping snapshot restore. The new migration system exploits dm-clone, which effectively creates temporary SAN connections between our physical servers to allow Fly Machines to boot on physical while reading from a Volume on another physical while the Volume is cloned. Simon’s work allows us to drain workloads from sus physical machines, and to rebalance workloads within regions.

Andres built new internal tooling for host-specific customer alerts. At the scale we’re operating at, host failures are increasingly common; more hosts, more surface area for cosmic rays to hit. These issues generally impact only the small subset of customers deployed on that hardware, so we report them out in “personal status pages”. But we’re a CLI-first platform, and many of our customers don’t use our Dashboard. Andres has rolled out preemptive email notification, so customers get direct notification.

Dusty beefed up metrics and alerting around “stopped” Fly Machines. The premise of the Machines platform is that Machines are reasonably fast to create, but ultra-fast to start and stop: you can create a pool of Machines and keep them on standby, ready to start to field specific requests. Making this work reliably requires us to carefully monitor physical host capacity, so that we’re always ready to boot up a stopped Fly Machine. This is capacity planning issue unique to our platform.

Will continued our ongoing project to move all of Fly Metrics off of special-purpose hosts on OVH, which hosts have been flappy over the years, and onto Fly Machines running on our own platform. Metrics consumes an eye-popping amount of storage, and Will spent the week adding storage nodes to our new Fly Machine metrics cluster.

Matt capped off the week by, appropriately enough, fleshing out our incident response and review process documentation. We could say more, but what we’ll probably do instead is just make them public next week.