2024-11-09

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

  • November 5: Customer Metrics Ingestion Disrupted (13:30EST): We operate a large VictoriaMetrics cluster that collects Prometheus-style metrics from Fly Machines (this is a built-in feature for all apps running on Fly.io; we’re generating metrics even if you don’t define any, but you can add a [metrics] configuration to your app to add additional ones. We attempted an upgrade of our cluster, which only partially succeeded, leaving the cluster in a state where queries and raw storage were functioning, but ingestion of metrics (vminsert) was compromised and generating a large and growing backlog of new entries. Several hours later, we determined that the ingestion app instances had an aggressive GOGC setting configured; restoring it to the default unjammed the backlog, and upgrading the ingestion service machine types and redeploying them cleared the backlog. Ingestion began functioning at 16:20EST and was restored around 17:00EST.

This Week In Engineering

Somtochi is working on bringing up regional Corrosion clusters. Recall that Corrosion is our state-sync service; think of it as a replacement for Consul, driven by SWIM gossip rather than Raft consensus, and with a SQLite interface (it essentially gossip-syncs a big SQLite database of everything running). In the wake of the Anycast outage from a few months back, we’ve been working on splitting Corrosion into a much smaller global cluster than we currently run (that is: gossiping less state into the global cluster) and then supporting it with regional clusters. A good first approximation of what we’re talking about: the global cluster knows every Fly App running in every region on the fleet, but the regional clusters know the specific Fly Machines for those Fly Apps running on the worker physicals in their region. Anyways, that’s what Somtochi is working on; this week, that mostly involved teaching our Corrosion-backed internal DNS service how to fetch informatin for machines in another regional cluster.

JP and Jerome diagnosed and fixed a gnarly volume migration bug that temporarily broke Jerome’s Fediverse server. If a Fly Volume is extended while that volume is in the process of being migrated (meaning that behind the scenes, dm-clone is still “hydrating” the volume over a temporary iSCSI connection from the origin worker physical), the underlying volume operation could apply to the wrong block device (the temporary clone device, not the final device). This was a missing step in the flyd FSM for restarting Fly Machines, now fixed.

Will upgraded VictoriaMetrics. The one incident we had last week was from an aborted partial attempt to upgrade Vicki. Well, we succeeded this week. In the process of investigating that outage and completing the upgrade this week, Will spotted a perf issue in upstream Vicki that degraded cache performance in Vicki clusters with large numbers of tenants (like we operate), and wrote an upstream PR for it.

Steve was on support rotation. Engineers across the team all do time, a couple days at a time, as technical escalation for our support team. Our support team is great, but being directly exposed to customers is as helpful for product engineering as it is for the support team. Steve and Peter are also hip-deep in working out plans to reboot large numbers of worker physicals, which is a fun problem we’ll be writing about in the weeks to come. Nothing dramatic is going on, we just need a process to reliably schedule reboots and maintenance windows.

Peter spent the week rolling out lazy-loading Corrosion state in fly-proxy. Currently, all the state we hold about every app running on the fleet is kept in-memory in fly-proxy (the component that picks up your HTTP requests from the Internet and relays them to your Fly Machines). As part of the work we’re doing to make Corrosion more resilient (along with regional clusters), we’re changing this, so that we load state for apps only when they’re actually requested. By way of example: the author of this log had one bourbon too many back in 2022 and booted up “paulgra-ham.com” on Fly.io, which is an app that has never once been requested since. Ever since that moment, fly-proxy has assiduously kept abreast of the current state of “paulgra-ham.com”, every minute of every day on every edge and worker in the fleet. This is dumb, and makes fly-proxy brittle. So we’re not doing it anymore.