2024-06-08

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

  • June 2: Network Outage in Bucharest (05:45EST): Our upstream provider had a network hardware outage, which took our OTP region offline for about 90 minutes.

  • June 3: Sporadic TLS Hangs From Github Actions (11:30EST): We spent about an hour diagnosing sporadic connection failures to Fly.io apps specifically from Github Actions. Github Actions run from VMs on Microsoft Azure. Something on the Azure network path causes repeated connections to reuse the same source port, which we think may have tripped a network flood control countermeasure. This should have been minimally (if at all) impactful to users, but it ate a bunch of infra time.

  • June 4: Single Host WireGuard Mesh Disruption (17:45EST): Depending on whether you ask Tom or not, either a bug in a script we use to decommission hosts or a bug in Consul resulted in two nodes in our WireGuard mesh being deleted, the intended host we were decommissioning and an extra-credit host we were not (the bug was unexpected prefix matching on a Consul KV path). This very briefly broke connectivity to the extra-credit host (low single-digit minutes). However, rather than restoring the backup WireGuard configuration we maintain, somebody (Tom) regenerated a WireGuard configuration, giving the victim host a new IPv4 address on our WireGuard mesh. This broke 6PN private networking on the host for about 20 minutes (for a small number of apps, whose operators we contacted).

  • June 5: Interruption In Machine Creation (11:45EST): A deployment picked up an unexpected change to our init binary, which broke boots for about 15 minutes for physical servers that got the init update.

  • June 6: Hardware Failure In IAD (22:00EST): A single machine in our old Equinix data center in IAD had an NVMe disk failure. Fly Machines without associated volumes were immediately migrated to our other, newer IAD data center deployment; over the course of several hours, Fly Machines with volumes attached were manually migrated. If you were affected, we’ve reached out directly. We’re in the process of decommissioning these hosts, in part because they have less-resilient disk configurations.

This Week In Infra Engineering

This week’s series of small regional incidents kept the infra team hopping.

Apart from incident response, this week’s work looked a lot like the last week’s. Rather than break it out by person, we’ll just document the themes:

Physical host migration remains the biggest ticket item for the infra team. We’re pushing forward on decommissioning old Equinix data center deployments and moving them to newer, more resilient, more cost-effective hardware. The big obstacle we’re facing right now remains applications that may (sometimes surreptitiously) be saving and reusing literal 6PN IPv6 private networking addresses, rather than DNS names. Because 6PN addresses are bound to specific physical hardware, these apps may break when migrated, which isn’t acceptable. We’re doing lots of things, from careful manual migration of apps (like Fly Postgres) where we control the cluster, to alerting and eBPF-based fixes. We knocked out another dozen or two old physicals this week.

Better host status alerting is a big deal for us. We’re going to keep seeing regional and host-local outages, which is just the nature of running a large fleet of physical servers. We’re now doing email alerting for customers on impacted hosts, and have PRs in for displaying alerts whenever a user touches an app impacted by a host alert with flyctl to continue closing that loop.

Corrosion scalability and reliability work continues; Somtochi has some design changes that could further minimize the amount of state we have to share, which we’ll talk about more when they pay off. Corrosion is a super important service in our infrastructure (it’s the basis for our request routing) and reliability improvements since infra took it over have been a big win.