2024-10-12

A note on incidents: incidents are internal events for our infrastructure team. Incidents often correspond to degraded service on our platform, but not always. This log aims for 100% fidelity to internal incidents, and is a superset both of our status page events and of customer-impacting events on the platform. It includes events reported to subsets of customers on their personal status pages, as well as events without any status page impact.

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

October 9: Networking Issues in SEA (16:00EST): We detected a BGP4 upstream misconfiguration, routing some of our traffic from SEA towards LON. In the course of investigating this issue, which added significant latency to some connections out of SEA, our regional upstream uncovered a problem at SIX, the Seattle area exchange point. While working around that issue, our upstream attempted to reroute traffic to a different peering provider at SIX, which had the effect of breaking all Seattle routing for a period of about 10 minutes, which was the point at which we called the incident (though our reporting here starts it several hours earlier).
October 11: Capacity Issues in ARN and AMS (06:00EST): A customer spun up thousands of Fly Machines across Northern Europe, saturating our capacity in two regions. Some of this utilization was apparently from a mistake in their own scheduling; much of it was genuine. For several hours, we had severely limited (and inconsistent) ability to create new Fly Machines in ARN and AMS. We added additional hardware to ARN and, with some guidance from the customer, collected resources from stale instances of their applications; completely resolving this incident, and bringing capacity in these two regions to within our comfort level, took the better part of the day.

This Week In Infra Engineering

Let’s make up for some lost time.

Peter has deployed the first stage of the “sharding” of fly-proxy, the engine of our Anycast request routing system. Recall from our September 1 Anycast outage that one major identified problem was that we run a global, flat, unsegmented topology of proxies; as a result, a control-plane outage is as likely to disrupt the entire fleet as it is to disrupt a single proxy. We’re pursuing two strategies to address that: regional segmentation, which limits the propagation of control-plane updates (in potentially somewhat the same fashion as an OSPF area does) and sharding of instances. Sharding here means that, within a single region and on a single edge physical server, we run multiple instances of the proxy.

The first stage of making that happen is to add a layer of indirection between the kernel network stack and our proxy; that layer, the fly-proxy-acceptor, picks up incoming TCP connections from the kernel, and then routes them to particular instances of the “full” proxy using a Unix domain socket and file descriptor passing. This allows us to add and remove proxy instances without reconfiguring or contending for the same network ports. In the early stages of deployment, both the proxy-acceptor and the proxy itself listen for TCP connections (meaning the acceptor can blow up, and we’ll continue to handle connections, though nothing has blown up yet).

Unix file descriptor passing is textbook Unix systems programming, literally, you can find it in the W. Richard Stevens books, but it’s surprisingly tricky to get right; for instance, connect and accept completion are separate events, and we have to be fastidious about which instances we route file descriptors to (the bug where you let two different proxies see the same request file descriptor is very problematic).

Peter, Dov, and Pavel have been in a protracted disagreement with systemd. From a few weeks back: Dov added systemd watchdog support to fly-proxy. Recall that the diagnosis of the September 1 outage involved us noticing that the entire proxy event loop had locked up (it was a mutex deadlock, that’s what happens in a deadlock). It shouldn’t have been possible for the proxy to lock up without us noticing, and now it can’t.

Anyways, Dov read the systemd source code as it relates to watchdogs to make sure that when the proxy entered a shutting down state, the watchdog would be disabled. Things seemed fine, but then alerts began firing every time we did a deploy; the watchdog was tripping while the proxy was doing its orderly shutdown. Peter discovered a bug in systemd: it assumes that signal handling and watchdog logic share a thread.

In our case they don’t, which created a race condition that triggered watchdogs right after the systemd unit went into stopping state, which caused systemd to re-enable the watchdog. We stopped preempting the watchdog task and let it run until proxy’s bitter end.

There was more. In some cases it can take greater than 10 seconds (our watchdog length) for the fly-proxy to exit, after our tokio::main is complete. Boom, watchdog kill. “Ok, fine, you got us!” we said to systemd, and simply disabled the watchdog at runtime when the watchdog task was preempted. This, finally, worked, and proxies would no longer get watchdog killed when shutting down.

Except that sometimes they did? Turns out that our few older-distro hosts (remember: we have up-to-date kernels everywhere, but not up-to-date distros; systemd is the one big problem with that) use a pretty old version of systemd. That systemd does not support disabling the watchdog at runtime. Peter landed what we hope is the final blow this week; instead of disabling the watchdog at runtime, he set it to a very large non-zero value. You may read further adventures of Peter, Dov, and Pavel in their battles with systemd next week.

Speaking of distro updates, Steve continues our steady march towards getting our whole fleet on a recent distro. He’s picked up where Ben left off a few weeks ago, testing and re-testing and re-re-testing our provisioning to ensure that swapping distros out from under our running workloads doesn’t confuse our orchestration; we now have something approaching unit/integration testing for our OS provisioning process.

Tom spent the week spiking alternative log infrastructure to replace ElasticSearch, with which we are now at our wit’s end. We’re generally pretty reliable at log ingestion with ES, but experience sporadic ES outages with log retrieval. What we’ve come to learn as a business is that our customers are less sanguine about log disruption than we are; what sometimes feels to us like secondary infrastructure reads as core platform health to them. That being the case, we can’t keep limping with the ES architecture we booted up in 2021.

Finally: a couple weeks ago, Daniel had an on-call shift, and was, like everyone working an on-call shift here, triggered by alerts about storage capacity issues; everybody on-call sees at least a couple of these. You check to make sure the host isn’t actually running out of space, clear the alert, and go back to sleep. Unless you’re Daniel.

Daniel has had it in for the way we track available volume storage since back when he shipped GPU support for Fly Machines. There are two big problems with the way we’ve been doing this: the first is, going back to 2021 when we first shipped volumes, the system of record for available storage has been the RDS database backing our GraphQL API; that’s a design that predates flyd and our move away from centralized resource scheduling. The second big problem is that flyd itself has erroneous logic for querying available storage in our LVM configuration (it pulls disk usage from the wrong LVM object, causing it to misreport available space.

The result of this situation is that we’ve been managing available storage, and, worse, storage resource scheduling (deciding which physical server to boot up new Fly Machines on) manually — and, not just manually, but largely in response to alerts, some of which are arriving in the middle of the night.

Daniel fixed the flyd resource calculation and surfaced it to our Fly Machines API service, starting in Sao Paolo, where our API storage tracking went from reporting an average 95% storage utilization across all our physicals to an average 5%. The change has since been rolled out fleetwide, and, in addition to reducing alert load, has drastically improved Fly Machine scheduling. In every region we now have significantly more headroom and, just as importantly, more diversity in our options for deploying new Machines.

Next post ↑: 2024-10-19
Previous post ↓: 2024-10-05