2024-05-11

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

May 7: NATS Storm (5:30EST): Some components of our platform, most notably log-shipping, run on top of the NATS messaging system. We’ve been fighting with NATS reliability issues for the past several months, and one thing we’ve needed to do is upgrade the fleet NATS version; more recent NATS releases have a number of bug fixes. We did a staged deployment of 2.10; it looked fine; we rolled it out further; it generated a 1.7tb/s (that’s ‘t’, with a ‘t’) message storm. Server CPU (on a small number of servers) buckled long before the network did; some users would have seen increased CPU steal and degraded performance. Log shipping was totally disrupted for about an hour.
May 8: Vault Certificate Breakage (7AM EST): The primary backend for secret storage at Fly.io is currently Hashicorp Vault (which is great). When Fly Machines start up, flyd, our orchestrator, fetches secrets from Vault to merge into the configuration. Vault is locked down with mTLS across our fleet; you need a client cert to talk to it at all. Due to a leaf/intermediate certificate configuration issue (we’re not even going to attempt to explain it), client certs across our fleet were invalidated, preventing flyd from fetching secrets, which prevented Fly Machines from booting.
May 8 (5:30 EST): Registry Load Balancing in AMS: Every application deployed on Fly.io is shipped in Docker (OCI) containers, and most are stored in our own Docker registries. For the past 6 months, those registries have been geographically distributed using LiteFS, with an accelerated S3 storage backend. Under heavy deployment load (because of the time of day), deploys using the AMS registry began to sporadically time out. We investigated this with AWS, and with our upstream provider, and mitigated temporarily by forcing builds to other regions; the issue resolved itself (never good news) within an hour or so. It turned out to have been a side effect of a fly-proxy change that fixed a bug with large HTTP POST bodies.

A pretty straightforward week. The most painful incident was the Vault “outage”, in part because it happened on the eve of us cutting over to Pet Semetary, our Vault replacement; in our new post-petsem world, it’ll take an outage of both Vault and PetSem to disrupt deploys. The other two incidents were more limited in scope.

This Week In Infra Engineering

Dusty built out telemetry and monitoring for Fly Machine migration, in preparation for a regional migration of some Machines to a new upstream provider.

In addition to doing a cubic heckload of routine hiring work (do these updates sound fun? we’re hiring!), Matt and Tom revised one of our technical work sample tests, eliminating an inadvertent cheat code some candidates had discovered; a comprehensively broken environment we ask candidates to diagnose had a way to straightforwardly dump out the changes we had made to break it. Respect to those candidates for figuring that out, and helping us level up the challenge a bit.

Steve has had a fun week. He’s working on shipping (you heard it here first) static IP address assignments for individual Fly Machines — this means Fly Machines can make direct requests to the Internet (for instance, to internal on-prem APIs) with predictable IP addresses. The original plan was to run an IGP across our fleet, but Steve worked out a combination of fou tunnels and SNAT that keeps our routing discipline static while allowing address to float. It’s a neat trick.

Steve would also like us to tell you that he rebooted dev-pkt-dc10-9b7e.

Ben built out tooling for host draining. Last week we talked about Simon’s work shipping inter-server volume migrations. Now that we can straightforwardly move workloads between physicals, storage and all, we can rebuild the “drain” feature we had when with Hashicorp Nomad back in 2020 (before we had storage), which means that when servers get janky (inevitable at our scale), or things need to be rebalanced, we can straightforwardly move all the Fly Machines to new physical homes, with minimal downtime. There’s a lot of corner cases to this (for instance: not all the volumes on a physical are necessarily attached to Machines), so this is a tooling-intensive problem.

Andres and Kaz re-established telemetry, metrics, and alerting on our Rails API, after an incident last week - it didn’t directly impact deploys, but would have made incidents involving API server problems, which are not unheard of, harder to detect and more difficult to resolve.

Kaz worked on fly-proxy-initiated Fly Machine migration. True fact: you can start a Fly Machine with an HTTP request; if a request is routed to a Fly Machine in stopped state, it’ll start. Kaz is working towards automatic migration of Machines from hosts that overloaded (i.e., exceeding our internal utilization thresholds): instead of starting on a busy machine, we can initiate a migration to a less-loaded machine. Recall that the core idea of our migration system is temporary SAN-style connections: a Machine can boot up on a new physical long before its entire volume has been copied over. Automatic migration isn’t happening yet, but it’s getting closer.

Akshit worked on cloud-hypervisor integration with our flyctl developer experience. cloud-hypervisor is like Firecracker except Intel ships it instead of AWS (they are both memory-safe Rust KVM hypervisors with minimal footprints; they even share a bunch of crates). We use cloud-hypervisor for GPU machines because it supports VFIO IOMMU device passthrough (ask us about the security work we did here, please). Operating cloud-hypervisor is similar enough to Firecracker that it’s almost a drop-in, but we’re still smoothing out the differences so they feel indistinguishable to users.

Tom and John are decommissioning our old, cursed “edge workers”. We run mainly two kinds of servers: edges that take traffic from the Internet and feed them into our proxy network, and workers that run actual Fly Machines. For historical reasons (those being: the founders made annoying decisions) we have on one of our upstreams a bunch of dual-role machines. Not for long. You may not like it, but this is what peak performance looks like:

root@edge-nac-fra1-558f: ~
$ danger-host-self-destruct-i-want-pain
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!DANGER!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!DANGER!!!!!
!!!!!DANGER!!!!!  _____          _   _  _____ ______ _____   !!!!!DANGER!!!!!
!!!!!DANGER!!!!! |  __ \   /\   | \ | |/ ____|  ____|  __ \  !!!!!DANGER!!!!!
!!!!!DANGER!!!!! | |  | | /  \  |  \| | |  __| |__  | |__) | !!!!!DANGER!!!!!
!!!!!DANGER!!!!! | |  | |/ /\ \ | . ` | | |_ |  __| |  _  /  !!!!!DANGER!!!!!
!!!!!DANGER!!!!! | |__| / ____ \| |\  | |__| | |____| | \ \  !!!!!DANGER!!!!!
!!!!!DANGER!!!!! |_____/_/    \_\_| \_|\_____|______|_|  \_\ !!!!!DANGER!!!!!
!!!!!DANGER!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!DANGER!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

This script will TOTALLY DECOMMISSION and DESTROY this host and REMOVE IT
PERMANENTLY from the Fly.io fleet.
To proceed, enter the hostname: edge-nac-fra1-558f

Correct, this host is edge-nac-fra1-558f.

To proceed, repeat verbatim "Yes, IRREVERSIBLY decommission"
-> Yes, IRREVERSIBLY decommission
This is your LAST CHANCE. Press ENTER to run away to safety. Press '4' to begin.

Migration is a theme of this bulletin; like we said last week, it has been kind of our “white whale”.

We have not forgotten last week’s promise to publish Matt’s incident handling process documents, but Matt wants to clean them up a bit more. We’ll keep mentioning it in updates until Matt lets us release them.

This is a small fraction of our infra team! These are just highlights; things that stuck out to us at the end of the week.

Next post ↑: 2024-05-18
Previous post ↓: 2024-05-05