2024-07-13

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

July 8: Capacity Issues In ORD (10:00EST): For roughly an hour, Machine launches in ORD failed for lack of physical server capacity. This was a combination of issues: constrained capacity due to decommissioning older physicals and Machine migration, physical hosts being marked ineligible due to maintenance that completed awhile ago, and also just user growth in the ORD region — which normally wouldn’t cause problems, but did in this case because of the preceding two problems. Fixing the eligibility status resolve the immediate incident, and we’ve provisioned additional capacity in ORD.
July 10: Elixir API Server Down (23:00EST): A failed deploy took down our Elixir API server. Most of our day-to-day APIs are served from our legacy Rails API server, and our Machines API server is served from a fleet of Golang API servers deployed around the world, but we have some internal APIs used by partners that are served from Elixir. This should have had minimal customer impact. A revert fixed the problem within a few minutes.
July 11: Request Routing Disruption in LAX (9:00EST): A failed deploy took Corrosion, our state-management system, down in the LAX region for 5-10 minutes. During that window of time and within the LAX region, request routing and deployment information may have been stale.
July 12: Redis Capacity Issues Disrupted APIs (18:00EST): For legacy reasons, our legacy Rails API, which serves the majority of our user-facing API calls (including our GraphQL API) is backed by a Redis server we manage ourselves on an ad-hoc basis. A change in how we track Sidekiq background jobs caused a spike in the amount of storage we demand from that Redis server, which got us to a place where Redis was erroring for about 5 minutes while we extended the underlying volume. During that window, deployments would have failed.

This Week In Infra Engineering

Will shipped bottomless storage volumes backed by Tigris. This is big! Last fall, Matthew Ingwersen announced log-structured virtual disks that cache blocks while writing them to object storage for durability — the net effect is a “bottomless volume” that is continuously in snapshotted state. The tradeoff for this is, you had to write them to off-network object storage, like S3, which adds an order of magnitude latency to uncached blocks. Tigris is S3-flavored object storage that is both directly attached to Fly.io and also localized to the regions we operate in, which drastically improves performance. It’s early days yet, and this feature is experimental, but we’d like to get this tuned well enough to be a sane default choice for general-purpose storage.

Andres shipped a first cut of a new synthetic monitoring system (“synthetics” is the cool-kid way of saying “actually making requests and seeing if they complete”, as opposed to watching metrics). We had some synthetic monitoring, but now we have substantially more, broken out into regions, particularly for the APIs reachable from flyctl, our CLI.

Akshit and Steve worked on internal bandwidth tracking, in part to support the egress pricing work Akshit talked about a few weeks back. Steve’s work gives us improved visibility for our own internal traffic between all pairs of servers, regions, and data centers.

John worked on our continuing theme of migrating from and decommissioning older hardware, and, in the process, resolved a gnarly problem with LVM2 metadata stores running near capacity. LVM2 is the userland correspondant to devicemapper, the kernel’s block storage framework; if you think of LVM2 and devicemapper together as an implementation of a software RAID controller, you’re not far off. LVM2 virtualizes block storage devices on top of physical devices, and reserves space on each physical to track metadata about which sectors are being used where; if space runs out, all hell breaks loose, and extending metadata space is tricky to do, but is much less tricky now. This is one of these random backend infra engineering problems that make migrations tricky (to balance workloads between servers and migrate off old servers, you sometimes want to migrate jobs to places where there’s LVM2 metadata pressure) which, once solved, makes it much easier for us to migrate jobs without ceremony. Maybe you have to have dealt with LVM2 PV metadata issues for them to be as interesting to you as they are to us. We’ll shut up now.

Dusty is on a top-secret mission to increase the speed of OCI image pulls from containerd. Recall: you deploy, and push a Docker image to our registry. Then, a worker server, running containerd, pulls that image from the registry into its own local storage, and converts it to a block storage device we can boot a VM on. That containerd image pull is the dominant factor in how long it takes to create a Fly Machine, and we’re like create to be asymptotically as fast as start (which is so fast you can start a Fly Machine to handle an incoming HTTP request on the fly).

Peter shipped fallback routing in fly-proxy, and we can’t write it up any better than he did, so go follow that link.

Tom did a bunch of anti-abuse stuff we’re not allowed to talk about. In lieu of a fun writeup of the anti-abuse stuff Tom did, we’ve instead been asked to describe the on-call drama that kept him busy for much of the week:

ElasticSearch randomly exploded when we rolled over an index because of an incompatibility between our log ingestion (which expects JSON logs), Vector (which expects and manipulates JSON logs), and the feature flagging library we use, which does not log JSON.
Our adoption of OverlayFS for containers sharply increases the number of LVM2 volumes we need to track, which puts pressure on LVM2’s metadata storage (see above), which requires us to reprovision physical storage disks with increased metadata storage. This is especially painful because ongoing Machine migration has the side-effect of converting Fly Machines to OverlayFS backing store, and we’re migrating a lot of stuff.
Disks were getting full because of (a) bugs in our deployment tools leaving lots of junk around in /tmp, (b) side effects of migration and OverlayFS (see above) logging metadata deltas, and © extraneous LV creation by flyd, a bug, which is now fixed.

Next post ↑: 2024-07-20
Previous post ↓: 2024-07-06