2024-11-02

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

October 31: Degraded Time-To-Recover In Corrosion (13:00EST): Corrosion is our state propagation system; it’s a SQLite database synchronized across our fleet by SWIM gossip messages. When a Fly Machine starts or stops, its status is delivered via Corrosion to our Anycast network so that our proxies can route to it. For about an hour, our telemetry detected that an IAD worker in our fleet was lagging; further investigation showed that it was jammed up in SQLite, and that CPU utilization on the host itself was high. It turned out this host was being used as the primary “bridge” for Corrosion clients (many of our fleet components use Corrosion “directly” by querying the underlying SQLite database, but some use remote clients to avoid synchronizing their own database), including our internal administration application. Two problems ensued: first, we weren’t effectively spreading the load of Corrosion bridge clients, and second, the admin app had some very dumb SQL queries. During the incident, a very small number of applications (we’re already talking to you if you were impacted) running on a single physical in our fleet would have seen some lag between Fly Machine start/stop and request routing.

This Week In Engineering

We apologize for the delay this week. We’re a US company and the US was eventful! Also, there wasn’t much incident stuff to write about. We pledge to be more timely in the weeks to come.

Somtochi is back to doing surgery on Corrosion. It now exposes a lighter-weight update interface that streams the primary keys of updated nodes over an HTTP connection, rather than repeatedly applying queries. Corrosion also favors newer updates over older ones during sync, which speeds time-to-recovery when bringing nodes online and dealing with large volumes of updates.

Akshit has been working on static egress IPs. Some of our customers run Fly Machines that interface with remote APIs, and some of those APIs have IP filters. Normally, Fly Machines aren’t promised any particular egress IP address for outgoing connections, but we’re rolling out a feature that assigns a routable static egress IP. Akshit also wrote a runbook for diagnosing issues with egress IPs and its integration with nftables and our internal routing system.

Dusty built a custom iPXE installation process for bringing up new hardware on the fleet. Our hardware providers rack and plug in our servers, and PXE pulls a custom initrd and kernel down from our own infrastructure, eliminating an old process where we effectively had to uninstall an operating system configuration our hardware was shipped to us with, making it faster to roll out new hardware, and hardening our installation process.

In response to capacity issues in some regions (particularly in Europe), Kaz rolled out default per-organization capacity limits. These kinds of circuit-breaker limits are par for the course in public clouds, but we’re relatively new and had been getting away with not having them. We’re happy to let you scale up pretty much arbitrarily! But it’s to everyone’s benefit if we default to some kind of cap, because our DX makes it really easy to scale to lots of Fly Machines without thinking about it. Most capacity issues we’ve had over the last year have taken the form of “someone decided to spontaneously create 10,000 Fly Machines in one very specific region”, so this should be an impactful change.

We run a relatively large (for our age) hardware fleet, and we generate a lot of logs. We have a relatively large (for our age) logging cluster that absorbs all those logs. Well, now we absorb 80% less log traffic, because Tom spent a week using Vector (or, rather, holding it better) to parse and drop dumb and duplicative stuff.

Next post ↑: 2024-11-09
Previous post ↓: 2024-10-26