2024-06-15

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

June 11: WireGuard Connectivity Issues In California and Frankfurt (12:00EST): We spent a few hours debugging roughly ten minutes of widespread but intermittant WireGuard failures from flyctl (if you were impacted, you’d have seen a “failed probing… context deadline exceeded” error). This turned out to be a transient networking problem at an upstream network provider.
June 13: Networking and Deployment Failures in Singapore (10:30EST): We (and our customers) saw elevated packet loss and sporadic errors in Singapore. This too turned out to be a problem with an upstream networking provider, who was in turn having a problem with one of their upstreams (Cogent, solved by disabling Cogent).

Can’t complain too much. There may come a day when we are large enough not to experience transient failures somewhere in the world, but that day is not this day. Two things we’re working aggressively on:

Monitoring systems sufficient to be sure our infra team are the first to detect these things and call incidents, rather than our support team (we’re good at this, but the bar is high).
Eliminating cursed Golang error messages like “context deadline exceeded” and “context cancelled” from our flyctl output; these content-free errors are all essentially bugs we need to fix.

This Week In Infra Engineering

Bunch of people out this week! It’s summer (for most of us)!

Andres shipped a long-overdue feature for flyctl: if you run a flyctl command that involves some physical host on our platform (most commonly: the worker server your Machine is on), we’ll warn you if we’re currently dealing with an issue on that host. We’ve had these notices in the UI for a bit, and Andres recently shipped email alerts for any host drama that impacts your Fly Machines, but we suspect this might be the more important reporting channel, since so many of our users are CLI users.

Ben integrated some work from Saleem on our ProdSec team that, during a Fly Machine migration, makes the original Machine’s 6PN address still appear to work for other members of the same network. Recall: our 6PN private network feature works under the hood by embedding routing information into IPv6 addresses; moving a Machine from one physical worker to another breaks that routing. This is only a problem for a small subset of apps that embed literal IPv6 addresses in their configurations. Saleem’s work applies network address translation during and after migrations; Ben’s work links this capability into Corrosion, our global state sharing system, to keep everyone’s Machine updated.

Peter is working on stalking cluster apps people have deployed that use statically-configured 6PN addresses, and thus need the mitigation Ben is working on. He’s doing that by detecting connections that originate prior to DNS lookups, and tracking them in SQLite databases, using a tool we call petertron3000.

Akshit and Ben did a bunch of work this week updating and improving metrics, for internal vs. edge traffic, FlyCast traffic, gateways, and flyd. Ben also caught and fixed some flyd migration bugs.

Kaz did a bunch of bug fixing and ops work in the background, but this week we’ll call out the stuff he’s been doing with customer comms, in particular this Machine Create success rate metric on our public status page, which is now much more accurate.

Simon did some rocket surgery on flyd to ensure that applications that are migrated with multiple deployed instances are migrated serially rather than concurrently, to eliminate corner cases in distributed applications.

Steve spent some time talking to Oracle about cross connects, because we have users and partners that want especially fast and reliable connectivity to Oracle OCI. So that’ll happen.

Steve also spent a bunch of time this week refactoring parts of fcm, our bespoke, Bourne Shell based physical host provisioning tool, so that it can be run from arbitrary production hosts rather than the specially designated host that it runs from now. I mean, it can’t be, not yet, but we’re… steps closer to that? We don’t know why he did this work. Sometimes people just get nerd sniped. This page is all about transparency, and Steve is this week’s designated Victim Of Transparency.

Will is working with Shaun on our platform team on a volumes project so awesome that we don’t want to spoil it yet. (Similarly: Somtochi is still working on the huge Corrosion project she was working on last week which is also such a big deal you won’t hear about it until it ships or fails).

Next post ↑: 2024-06-22
Previous post ↓: 2024-06-08