2024-09-28

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

September 25: Loss of LATAM Connectivity (09:00EST): An upstream routing issue (impacting a whole provider network) took QRO, GIG, EZE, and SCL down for about 20 minutes (recovery took a few more minutes for QRO).
September 27: Connectivity Issues in ORD (01:00EST): A top-of-rack switch misconfiguration at our upstream provider, possibly involving a LACP issue and possibly involving upstream routing, generated high packet loss (but not total loss of connectivity) on a subset of our ORD hosts for roughly 60 minutes.

This Week In Infra Engineering

Peter worked on restructuring the connection handling code in fly-proxy, the engine for our Anycast layer, to support process-based sharding of proxy instances. This is work responding to the September 1 Anycast outage; the proximate cause of that outage was a Rust concurrency bug, which we’ve now audited for, but the root cause was the fact that a single concurrency bug could create a fleetwide outage to begin with. Process-based sharding runs multiple instances of fly-proxy on every edge, spreading customer load across them, not for performance (the single-process fly-proxy is probably marginally more performant) but to reduce the blast radius of any given bug in the proxy.

Kaz is rolling out size-aware Fly Machine limits. Obviously (it may have been more obvious to you than to us), you can’t expose something like the Fly Machines API without some kind of circuit-breaker-style limits on the resources a single user can request). Our current limits are coarse-grained: N concurrent Fly Machines, regardless of size. Clearly, these limits should be expressed in terms of underlying resources — a shared-1x is a tiny fraction of a performance-16x. Getting this working has required us to rationalize and internally document the relationships between these scaling parameters. Most of our users will never notice this (especially if we do it well), but it should make it less likely that you’ll hit a limit and have to ask support to remove it.

Somtochi has continued working on Corrosion, the SWIM-gossip SQLite state-sharing system the proxy uses to route traffic. The net effect of Corrosion is a synchronized SQLite database that is currently available across our whole fleet of edges, workers, and gateway servers. We’re refactoring this architecture to reduce the number of machines that will keep Corrosion replicas, and allowing them to subscribe and track changes to Corrosion databases stored elsewhere (this allows us to deploy more edges, by reducing the compute and storage requirements for those hosts).

Will did a bunch of reliability and ergonomic work on fsh, our internal deployment tool; better DX for fsh means more reliable deploys of new code means fewer incidents. fsh now integrates with PagerDuty to abort deploys automatically if incidents occur during a deployment; it will also fail fast on errors (this is an issue on fleet-scale rollouts, where it can be hard to spot errors across hundreds of servers being updated); it now directly supports staged deploys (something we were hacking together with shell scripts previously) and stepwise concurrency (slow-start style, running a single deploy, then 2 on success, and so on).

Dusty is continuing his capacity planning work by integrating our business intelligence tools with our capacity dashboard, so we can reflect dollar costs and revenue into our technical capacity planning.

Next post ↑: 2024-10-05
Previous post ↓: 2024-09-21