2024-10-19

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

October 17: Flycast Internal Network Outage (11:00EST): A change was deployed in our fly-proxy Anycast request router, which had the effect of breaking Flycast (internal networking at Fly.io that runs through the proxy, as opposed to direct networking, which we call 6PN). The change was reverted fleetwide within about 10 minutes. There’s not much interesting to say about the breaking change itself (it was tested for a week in staging and in a remote region prior to rollout; it was rolled back quickly), but more to say about internal alerting; we had internal alerts firing on the staging change, but they were inadequately escalated.
October 19: Networking Outage in Denver (20:00EST): We lost Denver for about 16 hours. Well, “we” didn’t lose it. Our upstream network provider did. Specifically: a large switch in their data center threw a rod, and the spare equipment they had in the data center turned out not to be adequate to resolving the outage. Our hardware was fine, just sitting there wondering where the Internet went. Our physical footprint in Denver is small (8 physicals, give or take); this was a broader outage that didn’t just hit us. Still: not OK. It is the case that we have large regions with heavily diversified connectivity and a major hardware footprint, and smaller regions (generally speaking: if neither GCP, OCI, nor AWS are in a region, it’s probably a small region) with potentially longer disaster recovery times, and we’re clearly not communicating this well. More to come.

This Week In Infra Engineering

Stuff got done, but to generate these updates, the author of the infra-log needs to go interview infra people 1:1, and infra is heads-down responding to the incident from this week to foreclose on something like it happening again; some of that work is the same as the work we’re doing responding to the August Anycast routing incident (also a state explosion problem, also addressible by regionalizing state propagation and distributing aggregates globally instead of fine-grained updates), some of it isn’t; we’ll write more about it next week.

Next post ↑: 2024-10-26
Previous post ↓: 2024-10-12