2024-12-14

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

  • December 12: Fleeting Fleetwide Forwarding Failure (11:50EST): A recurrance of the September 1 fly-proxy bug: Rust code grabbed the read side of a read-write lock in an if let, and then code lower in the call tree attempted to grab the write side, deadlocking the proxy, triggered by a Corrosion update (meaning: an update gossiped across the whole fleet). In September, this caused the most significant outage we’d ever experienced on the platform; this time, watchdog safeguards we put in after that incident immediately caught the problem and restarted the proxy, so the impact was instead a few minutes (less than 10, and percolating across regions at different times in that interval; a given region might have seen a minute or two) of networking disruption. Further countermeasures for this lovely Rust language feature: the regionalization of Corrosion clusters (so that triggering state updates are likely to be confined within regions), and semgrep rules to catch if let lock grabs. We also didn’t diagnose this as fast as we would have liked (some 5-10 minutes of that window consisted of us believing we were experiencing upstream networking issues), and so there are some observability follow-ons here.

This Week In Engineering

We’re back, baby! (But give us a break over the next week or so, for obvious reasons).

Ben and Simon worked out that we were double-storing container images for customer apps. Recall that we exploit containerd to “stage” the block devices we boot VMs onto with customer app images; once as the products of the containerd snapshotting plugin (so, as LVM snapshots, stored in a large LVM thin pool) and once as a blob in the containerd content store (as temporary storage while containerd makes the LVM snapshot). The blob store is on our root storage device. This is bad. containerd GC’s the content store; this is good. But flyd, our orchestrator, labels content in containerd in such a way that GC doesn’t work. This is bad. Ben and Simon are fixing that, which is good.

First and foremost, Dusty provisioned all our standby hardware; we now have zero unprovisioned servers. If we own it, it’s ready to go into prod. But more ambitiously, Dusty is leading efforts on what we’re calling the “hardware resiliency” project, which is what it sounds like. Most of the docket for this project right now is about I/O performance, and heavily concentrated on our volume backups (because that’s a major I/O load that we actually control, unlike your apps, which we do not). Another completed item on that checklist that rhymes with a prior outage is a fleetwide audit of all our certificates and their expirations, which is now done.

JP and Senyo released Pilot, our new init. Why don’t I just let Annie explain this one?

Tom‘s white whale right now is sourcing new “burst capacity” for us. We have long-term stable hardware and hosting, but it’s not super fast to bring online, and every once in awhile we get spikes of demand in particular regions. We have providers we can “burst” into while we bring long-term capacity online for those regions, but it’s pricey and a little precarious, so we’re building a deeper back bench of providers to burst into. Tom also re-did all our BGP4 configuration management, so we now have an effective CI/CD process fo BGP4 announcement changes, as well as a full-documented configuration.

Peter is doing some of the most important reliability work in the company getting fly-proxy load balancing working with regionalized Corrosion. Regionalized Corrosion, again, is the effort to take most of the state we track right now and make it local (within a region, like IAD or SYD) rather than global. This breaks a huge assumption we’ve made about what information our proxies have to work with! We now have regionalized load balancing up and running behind a feature flag (your apps: not balanced regionally yet), which is huge. Another huge lift: getting fly-replay, which is like our signature feature, working in the new world order where a given fly-proxy actually doesn’t know what all the specific instances of an app are; this is tricky because fly-replay lets you direct requests to specific instances by name.