2024-05-25

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

May 23: Capacity Issues in FRA (7:00EST): FRA has of late become one of our busier regions, and we’ve been continuously adding edge capacity (recall: edge hosts take in traffic from the Internet and route it to worker hosts, which run Fly Machines for customers). We needed more edge capacity this morning, and added it. The annoyance here was compounded by telemetry issues: in our “current” configuration, capacity issues degrade the performance both of Corrosion, our internal gossip service catalog, and NATS, the messaging system we use to communicate load between proxies in our Anycast network. There’s a bunch of work happening around making those systems less sensitive to edge load.

So, yeah, pretty easy week, as far as the infra team is concerned.

This Week In Infra Engineering

The big news for the past several weeks has been intra-region Fly Machine migration: minimal downtown migration of workloads, including large volumes, from one physical worker server to another. We hit a snag here: Fly Postgres wasn’t designed originally to be migrated, so many instances of it are intolerant to being moved and booted up on new 6PN IPv6 addresses. A bunch of work is happening to resolve this; we’ll certainly be migrating Fly Postgres instances in the near future, it’s just a question of “how”.

Simon is designing migration tooling work to make large-scale migration and host draining work for everything we can safely migrate, including single-node Postgres instances (do not run single-node Postgres instances in production! — but thanks for being easy to migrate).

Ben A. worked on migrating and draining workloads to balance workers. Fly Machines bill customers primarily for the time the Machine is actually running. When they’re stopped, a Machine is a commitment to some amount of resources on its associated worker, and a promise that we will start that Machine within some n-hundred millisecond time budget. This commitment/promise dance is drastically simpler and less expensive for us to honor now that we can migrate stopped Machines.

Now that Pet Semetary is up and running, Somtochi has switched up and is working with Sage on Corrosion, our Rust-based gossip statekeeping system. Corrosion is a (large) SQLite database managed by SWIM gossip updates. The work this week is primarily testing and bugfixing, but they may have figured out a way to reduce the size of our database by a factor of 3, which we’ll certainly write about next week if it pans out.

Dusty and Sage have been adding more edge hosts to keep up with capacity. Dusty also began trial migrations of Fly Postgres, using an IP mapping hack by Saleem, and built some internal dashboards to assist in the sort of manual host rebalancing work that Ben A. was doing this week.

Akshit cleaned up some log messages. Yawn. But also he graduated university! Congratulations to Akshit.

John rolled out a fleetwide fix for an interaction between Corrosion and our eBPF UDP forwarding path. You can run a fully-functional DNS server as a Fly Machine, because our Anycast network handles UDP as well as TCP. We do this by transparently encapsulating and routing UDP in the kernel using XDP and TC BPF. The routing logic for this scheme is written into BPF maps by a process (udpcatalogd) that subscribes to Corrosion. We decommissioned a large physical worker in AMS, which generated a big Corrosion update, which tickled a bug in a particular SQL query pattern only udpcatalogd uses, which caused ghost services for that AMS ex-worker to get stuck in our routing maps. There were a bunch of fixes for this, but the immediate thing that cleared the problem operationally involved… turning udpcatalogd off for a moment and then back on, fleetwide. Thank, John! John also did a bunch of retrospective work on our learnings abouting Fly Postgres clusters. He’s also taken up residency on our public community site. Go say ‘hi’ to him (or complain about something that infra is involved in, and he’ll apparently show up.)

Will has been heads-down in NATS land for the past 2 weeks, after an attempted NATS upgrade briefly melted our network a bit over a week ago. Will has been chatting with the Synadia people about our topology, and, in the meantime, found two scaling issues in the NATS core code that drive excess system chatter in our current topology. He’s prepared a couple upstream PRs.

Steve spent some time this week building new features for Drift, our Elixir/Phoenix internal server hardware inventory tool. Drift now tracks server lifecycle (for things like decommissioning), and, where our upstreams support it, automatically adding new servers to our inventory.

Next post ↑: 2024-06-01
Previous post ↓: 2024-05-18