2024-05-18

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

  • May 13: Metrics Outage (9:00EST): Every once in awhile, on some of our AMD-based servers, with Linux IOMMU enabled, we see a weird lockup that forces us to hard-restart a host. Technically, we only need IOMMU stuff on hosts running GPU workloads, but right now we had it widely enabled. Anyways, that happened to an IAD host that ran part of our Metrics system, which meant that for about 20 minutes we had broken metrics ingestion while we rebooted that machine. You’d have seen a 20-30 minute gap in metrics on Grafana graphs.

  • May 16: Postgres Cluster Migration Failure (5PM EST): For the past several weeks we’ve been exercising machine+volume migration — the ability to move workloads, including storage, from one physical server to another by having the original server temporarily serve as a SAN server. For reasons having nothing to do with storage but rather the particulars of our IPv6 addressing scheme, migrations confused the repmgr process that manages Fly Postgres clusters. A limited number of Fly.io Postgres customers saw Postgres cluster outages after their underlying machines we migrated, and before we halted all migrations of Postgres clusters, over the course of about an hour.

  • May 8 (5:00 EST): Capacty Issues in DFW: We unexpectedly hit saturation on our “edge” servers (reminder: “edges” terminate HTTPS and serve our Anycast network, “workers” run VMs for Fly Machines), forcing us to quickly add additional edge servers in that region. This would be no big deal, but our Elixir web dashboards are served for this region, so for about 30 minutes we had degraded performance of our interface before we were able to add additional capacity.

This was an easier week than last week. The middle outage, migrating Postgres clusters, was very noticeable to impacted customers — but also quick root-caused. The other two incidents were limited in scope. Unless you’re carefully watching Fly Metrics. Are you using Fly Metrics? An incident that broke them is a weird place to pitch them, we know, but they’re pretty neat and you get them for free. Hold our feet to the fire on them being reliable!

This Week In Infra Engineering

Dusty provisioned new hardware capacity in San Jose, Singapore, Warsaw, Sydney, Atlanta, and Seattle.

Will had a conversation with engineers at Synadia (last week’s NATS outage hit right during an all-hands meeting for them!) and got some advice on reconfiguring our internal NATS topology, shifting most of our hosts to “leaf” nodes and minimizing the number of “clustering” notes we have per region; this should trade an imperceptible amount of latency (which doesn’t matter with our NATS use case) for drastically reduced chatter. Thanks, Synadians!

Akshit finished an upgrade to Firecracker 1.7 across our fleet. 1.7 does asynchronous block I/O with io_uring. We’ve noticed, since we rolled out Cloud Hypervisor for our GPU workloads (ask us about the security work we had to do here!) that Cloud Hypervisor was doing a better job handling busy disks than the version of Firecracker we were running. We’re optimistic that the new version will close the gap.

Steve finished up the provisioning tooling for the fou-tunnel-and-SNAT monstrosity that we talked about last week: giving Fly Machines static IP addresses, for people who talk to IP-restricted 3rd party APIs.

Tom and Ben A (ask us how many Bens work here!) completed the migration and draining of workloads from the cursed “edge worker” machines we mentioned last week. Edge-workers are no more. In the process, Tom debugging a bunch of draining tooling issues (being good at this is a big deal, because we’d like to be able to drain a sus server anywhere in the world at the drop of a hat), and Ben wrote up internal playbooks for draining hosts. Requiescat, Edge Workers, 2020-2024.

Simon continued low-level work on Machine/Volume migration, which is the platform kernel of the host draining stuff Tom and Ben were doing. This week’s work focused on large volume migration. Recall that our migration system causes the “source” physical server to temporarily serve as an ad-hoc SAN for the “target” physical, allowing us to “move” a Machine from one physical to another in seconds while the actual volume block clone happens in the background; Simon’s instrumentation work may have shaved about ~10s off this process (about 1/3rd of the total time).

Andres got host alerting (notifying users of hardware issues with the specific hosts they’re using, both on their personal status page and directly via email) integrated with our internal support admin tool.

Somtochi rolled out the first iteration of Pet Semetary to our flyd orchestrator. We now have two (hardware-isolated) secret stores: Hashicorp Vault, and our internal Pet Semetary. The big thing here is, if we can’t read secrets, we can’t boot machines; now, if Vault has an availability issue, we “fall back” to Pet Semetary. Requeiscat, Vault-related Outages, 2020-2024.