Update: Jul 13, 2024

Trailing 2 Weeks Incidents

A note on incidents: incidents are internal events for our infrastructure team. Incidents often correspond to degraded service on our platform, but not always. This log aims for 100% fidelity to internal incidents, and will generally be a superset of our status page events. It includes events reported to subsets of customers on their personal status pages, as well as events without any status page impact.

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

  • July 8: Capacity Issues In ORD (10:00EST): For roughly an hour, Machine launches in ORD failed for lack of physical server capacity. This was a combination of issues: constrained capacity due to decommissioning older physicals and Machine migration, physical hosts being marked ineligible due to maintenance that completed awhile ago, and also just user growth in the ORD region — which normally wouldn’t cause problems, but did in this case because of the preceding two problems. Fixing the eligibility status resolve the immediate incident, and we’ve provisioned additional capacity in ORD.

  • July 10: Elixir API Server Down (23:00EST): A failed deploy took down our Elixir API server. Most of our day-to-day APIs are served from our legacy Rails API server, and our Machines API server is served from a fleet of Golang API servers deployed around the world, but we have some internal APIs used by partners that are served from Elixir. This should have had minimal customer impact. A revert fixed the problem within a few minutes.

  • July 11: Request Routing Disruption in LAX (9:00EST): A failed deploy took Corrosion, our state-management system, down in the LAX region for 5-10 minutes. During that window of time and within the LAX region, request routing and deployment information may have been stale.

  • July 12: Redis Capacity Issues Disrupted APIs (18:00EST): For legacy reasons, our legacy Rails API, which serves the majority of our user-facing API calls (including our GraphQL API) is backed by a Redis server we manage ourselves on an ad-hoc basis. A change in how we track Sidekiq background jobs caused a spike in the amount of storage we demand from that Redis server, which got us to a place where Redis was erroring for about 5 minutes while we extended the underlying volume. During that window, deployments would have failed.

This Week In Infra Engineering

Will shipped bottomless storage volumes backed by Tigris. This is big! Last fall, Matthew Ingwersen announced log-structured virtual disks that cache blocks while writing them to object storage for durability — the net effect is a “bottomless volume” that is continuously in snapshotted state. The tradeoff for this is, you had to write them to off-network object storage, like S3, which adds an order of magnitude latency to uncached blocks. Tigris is S3-flavored object storage that is both directly attached to Fly.io and also localized to the regions we operate in, which drastically improves performance. It’s early days yet, and this feature is experimental, but we’d like to get this tuned well enough to be a sane default choice for general-purpose storage.

Andres shipped a first cut of a new synthetic monitoring system (“synthetics” is the cool-kid way of saying “actually making requests and seeing if they complete”, as opposed to watching metrics). We had some synthetic monitoring, but now we have substantially more, broken out into regions, particularly for the APIs reachable from flyctl, our CLI.

Akshit and Steve worked on internal bandwidth tracking, in part to support the egress pricing work Akshit talked about a few weeks back. Steve’s work gives us improved visibility for our own internal traffic between all pairs of servers, regions, and data centers.

John worked on our continuing theme of migrating from and decommissioning older hardware, and, in the process, resolved a gnarly problem with LVM2 metadata stores running near capacity. LVM2 is the userland correspondant to devicemapper, the kernel’s block storage framework; if you think of LVM2 and devicemapper together as an implementation of a software RAID controller, you’re not far off. LVM2 virtualizes block storage devices on top of physical devices, and reserves space on each physical to track metadata about which sectors are being used where; if space runs out, all hell breaks loose, and extending metadata space is tricky to do, but is much less tricky now. This is one of these random backend infra engineering problems that make migrations tricky (to balance workloads between servers and migrate off old servers, you sometimes want to migrate jobs to places where there’s LVM2 metadata pressure) which, once solved, makes it much easier for us to migrate jobs without ceremony. Maybe you have to have dealt with LVM2 PV metadata issues for them to be as interesting to you as they are to us. We’ll shut up now.

Dusty is on a top-secret mission to increase the speed of OCI image pulls from containerd. Recall: you deploy, and push a Docker image to our registry. Then, a worker server, running containerd, pulls that image from the registry into its own local storage, and converts it to a block storage device we can boot a VM on. That containerd image pull is the dominant factor in how long it takes to create a Fly Machine, and we’re like create to be asymptotically as fast as start (which is so fast you can start a Fly Machine to handle an incoming HTTP request on the fly).

Peter shipped fallback routing in fly-proxy, and we can’t write it up any better than he did, so go follow that link.

Tom did a bunch of anti-abuse stuff we’re not allowed to talk about. In lieu of a fun writeup of the anti-abuse stuff Tom did, we’ve instead been asked to describe the on-call drama that kept him busy for much of the week:

  • ElasticSearch randomly exploded when we rolled over an index because of an incompatibility between our log ingestion (which expects JSON logs), Vector (which expects and manipulates JSON logs), and the feature flagging library we use, which does not log JSON.

  • Our adoption of OverlayFS for containers sharply increases the number of LVM2 volumes we need to track, which puts pressure on LVM2’s metadata storage (see above), which requires us to reprovision physical storage disks with increased metadata storage. This is especially painful because ongoing Machine migration has the side-effect of converting Fly Machines to OverlayFS backing store, and we’re migrating a lot of stuff.

  • Disks were getting full because of (a) bugs in our deployment tools leaving lots of junk around in /tmp, (b) side effects of migration and OverlayFS (see above) logging metadata deltas, and © extraneous LV creation by flyd, a bug, which is now fixed.

Update: Jul 6, 2024

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

  • July 1: Consul Template Tooling (Internal) (23:00EST): Following a system update earlier in the day, consul-templaterb began exploding on a small number of our edge servers. It’s been over a year since we used Consul to track user application state, so this tooling isn’t in the critical path for user applications; in other words, this incident had no customer impact. It turned out to be an incompatible system Ruby configuration (our software update zapped a Ruby gem consul-templaterb depended on).

  • July 2: Poor Network Performance for Tigris in IAD (4:00EST): Tigris is our object storage partner and you should definitely check them out. At 4AM on Tuesday, they reported that they were seeing slow downloads from east coast regions, especially IAD. This turned out to be an upstream networking issue, resolved by a transit provider by adjusting routes, roughly 2 hours later.

  • July 2: Connectivity Loss in IAD (15:00EST): A BGP change at an upstream provider broke connectivity to our IAD data center for several minutes; this was unrelated to the previous incident, but much more severe (and thankfully brief).

  • July 3: Hardware Failure Breaks Upstash Redis in IAD (12:00EST): Upstash is our Redis parrnet and you should definitely check them out. Upstash runs distributed clusters of Redis servers in each of our regions. A quorum of the Fly Machines running their IAD cluster were scheduled onto a single server, which, months later, noticed and blew up. The server was recovered several hours later, and during the interval the Upstash cluster was rebuilt with a different Fly Machine on a different IAD server. This problem impacted only the IAD region, but IAD is an important region.

  • July 4: LiteFS Cloud RAFT Cluster Failure in IAD (20:00EST): LiteFS Cloud is a managed LiteFS service we run for our customers. Our internal LiteFS clusters run a RAFT quorum system scheme for leader election and cluster tracking. An open files rlimit configuration bug forced a node in the lfsc-iad-1 cluster to restart, which in turn tickled a bug in dragonboat, the Golang RAFT library the service used, which in turn forced us to rebuild the cluster. This incident had marginal customer impact and maximal Ben Johnson impact.

  • July 5: Elevated Machine Creation Alerts (1:00EST): Our infra team was alerted about elevated errors from the Fly Machines API. A different internal team had created a Fly Kubernetes cluster with an invalid name. Not a real incident, no customer impact, but we document everything here; that’s the rule.

This Week In Infra Engineering

The 4th of July hit on a Thursday this year, making this an extended holiday weekend for a big chunk of our team.

The big stories this week are mostly the same as last week. We continued deploying and ironing out bugs in Corrosion record compaction, we migrated off a bunch of old physical servers and continued building out migration tooling to make it even easier to drain workloads from arbitrary servers, and we improved incident alerting for customers in our UI and in flyctl.

The most important work that happened in this abbreviated week was all internal process stuff: we roadmapped out the next 12 weeks of infra work for networking, block storage, observability, hardware provisioning, and Corrosion. Lots of new projects hitting, which we’ll be talking about in upcoming posts.

Update: Jun 29, 2024

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

  • June 25: Authorization Errors With NATS Log Shipping (15:00EST): A customer informed our support team about an “Authorization Error” received when connecting NATS (via our internal API endpoint) to ship logs (this is a feature of the platform, normally used with the Fly Log Shipper, intended to allow users to connect their Fly.io platform logs to an off-network log management platform). As it turned out, we’d just done some work tightening up the token handling in our internal API server, and missed a corner case (users using fully-privileged Fly.io API tokens — don’t do this! — to ship logs). It took about 30 minutes to deploy a fix.

  • June 27: 502s on Some Edges Due To Corrosion Reseeding (13:00EST): Our monitoring picked up HTTP 502 errors from some of our apps, which we tracked down to stale data in Corrosion, our distributed state tracking system. We’d recently done major maintenance with Corrosion’s database, and it had knocked out Corrosion on a small number of our edges, causing it to miss updates for about 30 minutes. The underlying issue was resolved relatively quickly, but a corner-casey interaction with blue/green deploys stuck several apps (roughly 10) that deployed during the outage in a bad state that we had to reconcile manually over the next hour.

  • June 28: 502s in Sao Paulo (16:45EST): About 5 apps, including our own Elixir app, saw sharply elevated HTTP 502 errors, which we again traced to stale Corrosion data, possibly from the work done the day previously. We mitigated the issue by resyncing our proxy and Corrosion, which cut errors by 3000% but didn’t eliminate them; we narrowed errors to a particular GRU edge server, and stopped advertising it, which eliminated the problem. We’re still investigating what happened here.

This Week In Infra Engineering

Somtochi rolled out a major change to the way we track distributed state with Corrosion. Because Corrosion is a distributed system (based principally on SWIM gossip) and no distributed system is without sin, we have to carefully watch the amount of data it consumes; updates are relatively easy to propagate, but eliminating space for old, overridden data is difficult; this is the “compaction” problem. Somtochi and Jerome worked out a straightforward scheme for doing compaction, but it required adding an index to a table that had been growing without bound for many months, and would potentially trigger multi-minute startup lags everywhere Corrosion needed to get reinstalled. Instead of doing that, we “re-seeded” Corrosion, taking a known-good dataset from one of our nodes, compacted, and then using it as the basis for new Corrosion databases. This was rolled out on many hundreds of hosts without event, and on a small number of edge servers (which have much slower disks) with some events, which you just read about above.

Akshit worked on improving the metrics we’re using for bandwidth billing, putting us in a position to true up bandwidth accounting by more carefully tracking inter-region (like, Virginia to Frankfurt) traffic, especially for users with app clusters where only some of the apps have public addresses. You’ll hear more about this from us! This is just the infra side of that work.

After Peter wrote a brief postmortem of an incident from last week, Ben Ang worked out a system to more carefully track deployments of internal components, especially when those deployments happen piecemeal as opposed to full-system redeploys. Since the first question you ask when you’re handling an incident is “what changed”, anything that gives us quicker answers also gives us shorter incidents.

Dusty, John, Simon, and Peter all worked on draining old servers, migrating Fly Machines to newer, faster hardware. This is all we’ve been talking about here for the last month or so, and it’s happening at scale now.

Andres got tipped off by an I/O performance complaint on a Mumbai worker and ended up tracking down a small network of crypto miners. The hosting business; how do you not love it? Andres did other stuff this week, too, but this was the only one that was fun to write about.

Will wrapped up his NATS log-shipping work. We’ll let him tell the story.

Update: Jun 22, 2024

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

  • June 17: Southeast Asia Connectivity (15:00EST): We saw high packet loss and interface flapping on edge servers in Singapore. We stopped announcing SIN Anycast routes, redirecting SIN traffic to other nearby regions, while we investigated the problem, which resolved roughly an hour later. There would have been minimal customer impact (for the duration of the event, which didn’t impact worker servers, we would have had somewhat worse Anycast routes.)

  • June 17: Internal App Deployment Failure That Turned Out To Be Nothing (20:30EST): An infra team member made a change to our API server and later deployed it; the deployment took upwards of an hour, and in a “where there’s smoke there’s fire” move called an incident. The incident: a typo in the code they were deploying, along with a bug in our API server that exited with a non-failure status. No customer impact.**

  • June 18: Volume Capacity in Brazil (10:00EST): The platform began reporting a lack of available space for new volumes in our GRU region. We were not in fact low on available volume space; rather, a change we pushed out to Corrosion, our internal state-sharing system, had a SQL bug that mis-sorted worker servers (on a condition that only occurred in GRU). We had a workaround published within 15 minutes (you could restart your “builder” Machine, the thing we run to build containers for you, and dodge the problem), and a sitewide fix within 90 minutes.

  • June 21: Midsummer Night’s Billing Outage (3:00EST): For an interval of about 2 hours, an upstream billing provider had an outage, which in turn broke some of our invoice reporting features; notably, if you had been issued a credit that you only tried to redeem during the outage, it would not have shown up (you wouldn’t have lost the credit, but you couldn’t have used it at 3:00EST).

This Week In Infra Engineering

Intra-region host migrations are unblocked again! This is huge for us.

Peter worked with our upstream providers to eliminate pathological AS-path routes impacted by recent APAC undersea cable cuts. This work started with us noticing relatively high packet loss in Asia regions, and resulted in us drastically reducing timeouts in our own telemetry and tooling, and network quality for users. A very big win that we’re looking to compound with better monitoring and tooling. He also figured out a configuration bug that was causing Fly Machines not to use BBR congestion control on private networking traffic, which is now fixed.

Dusty and Matt got all our multi-node Postgres clusters in condition to migrate (recall: multi-node Postgres clusters had been problematic for us, because they were configured to use literal IPv6 addresses for their peer configurations, and migration breaks those addresses, which embed routing information).

In addition to spending 30 working hours getting a single email announcement (about migrations) out to customers, John shipped our 6PN address forwarding tooling, along with Ben, out to the fleet, making it possible to migrate clusters that refer to literal IPv6 addresses. Dusty, Peter, John and Matt began draining hosts, moving the Machines running on them to most stable, modern, resilient systems on better upstreams, and lining us up to decom the much older machines. Ben drained an old server live during our internal Town Hall meeting. It was an emotional moment.

Still a bunch of people out this week! It’s summer (for most of us)!

Update: Jun 15, 2024

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

  • June 11: WireGuard Connectivity Issues In California and Frankfurt (12:00EST): We spent a few hours debugging roughly ten minutes of widespread but intermittant WireGuard failures from flyctl (if you were impacted, you’d have seen a “failed probing… context deadline exceeded” error). This turned out to be a transient networking problem at an upstream network provider.

  • June 13: Networking and Deployment Failures in Singapore (10:30EST): We (and our customers) saw elevated packet loss and sporadic errors in Singapore. This too turned out to be a problem with an upstream networking provider, who was in turn having a problem with one of their upstreams (Cogent, solved by disabling Cogent).

Can’t complain too much. There may come a day when we are large enough not to experience transient failures somewhere in the world, but that day is not this day. Two things we’re working aggressively on:

  • Monitoring systems sufficient to be sure our infra team are the first to detect these things and call incidents, rather than our support team (we’re good at this, but the bar is high).

  • Eliminating cursed Golang error messages like “context deadline exceeded” and “context cancelled” from our flyctl output; these content-free errors are all essentially bugs we need to fix.

This Week In Infra Engineering

Bunch of people out this week! It’s summer (for most of us)!

Andres shipped a long-overdue feature for flyctl: if you run a flyctl command that involves some physical host on our platform (most commonly: the worker server your Machine is on), we’ll warn you if we’re currently dealing with an issue on that host. We’ve had these notices in the UI for a bit, and Andres recently shipped email alerts for any host drama that impacts your Fly Machines, but we suspect this might be the more important reporting channel, since so many of our users are CLI users.

Ben integrated some work from Saleem on our ProdSec team that, during a Fly Machine migration, makes the original Machine’s 6PN address still appear to work for other members of the same network. Recall: our 6PN private network feature works under the hood by embedding routing information into IPv6 addresses; moving a Machine from one physical worker to another breaks that routing. This is only a problem for a small subset of apps that embed literal IPv6 addresses in their configurations. Saleem’s work applies network address translation during and after migrations; Ben’s work links this capability into Corrosion, our global state sharing system, to keep everyone’s Machine updated.

Peter is working on stalking cluster apps people have deployed that use statically-configured 6PN addresses, and thus need the mitigation Ben is working on. He’s doing that by detecting connections that originate prior to DNS lookups, and tracking them in SQLite databases, using a tool we call petertron3000.

Akshit and Ben did a bunch of work this week updating and improving metrics, for internal vs. edge traffic, FlyCast traffic, gateways, and flyd. Ben also caught and fixed some flyd migration bugs.

Kaz did a bunch of bug fixing and ops work in the background, but this week we’ll call out the stuff he’s been doing with customer comms, in particular this Machine Create success rate metric on our public status page, which is now much more accurate.

Simon did some rocket surgery on flyd to ensure that applications that are migrated with multiple deployed instances are migrated serially rather than concurrently, to eliminate corner cases in distributed applications.

Steve spent some time talking to Oracle about cross connects, because we have users and partners that want especially fast and reliable connectivity to Oracle OCI. So that’ll happen.

Steve also spent a bunch of time this week refactoring parts of fcm, our bespoke, Bourne Shell based physical host provisioning tool, so that it can be run from arbitrary production hosts rather than the specially designated host that it runs from now. I mean, it can’t be, not yet, but we’re… steps closer to that? We don’t know why he did this work. Sometimes people just get nerd sniped. This page is all about transparency, and Steve is this week’s designated Victim Of Transparency.

Will is working with Shaun on our platform team on a volumes project so awesome that we don’t want to spoil it yet. (Similarly: Somtochi is still working on the huge Corrosion project she was working on last week which is also such a big deal you won’t hear about it until it ships or fails).

Update: Jun 8, 2024

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

  • June 2: Network Outage in Bucharest (05:45EST): Our upstream provider had a network hardware outage, which took our OTP region offline for about 90 minutes.

  • June 3: Sporadic TLS Hangs From Github Actions (11:30EST): We spent about an hour diagnosing sporadic connection failures to Fly.io apps specifically from Github Actions. Github Actions run from VMs on Microsoft Azure. Something on the Azure network path causes repeated connections to reuse the same source port, which we think may have tripped a network flood control countermeasure. This should have been minimally (if at all) impactful to users, but it ate a bunch of infra time.

  • June 4: Single Host WireGuard Mesh Disruption (17:45EST): Depending on whether you ask Tom or not, either a bug in a script we use to decommission hosts or a bug in Consul resulted in two nodes in our WireGuard mesh being deleted, the intended host we were decommissioning and an extra-credit host we were not (the bug was unexpected prefix matching on a Consul KV path). This very briefly broke connectivity to the extra-credit host (low single-digit minutes). However, rather than restoring the backup WireGuard configuration we maintain, somebody (Tom) regenerated a WireGuard configuration, giving the victim host a new IPv4 address on our WireGuard mesh. This broke 6PN private networking on the host for about 20 minutes (for a small number of apps, whose operators we contacted).

  • June 5: Interruption In Machine Creation (11:45EST): A deployment picked up an unexpected change to our init binary, which broke boots for about 15 minutes for physical servers that got the init update.

  • June 6: Hardware Failure In IAD (22:00EST): A single machine in our old Equinix data center in IAD had an NVMe disk failure. Fly Machines without associated volumes were immediately migrated to our other, newer IAD data center deployment; over the course of several hours, Fly Machines with volumes attached were manually migrated. If you were affected, we’ve reached out directly. We’re in the process of decommissioning these hosts, in part because they have less-resilient disk configurations.

This Week In Infra Engineering

This week’s series of small regional incidents kept the infra team hopping.

Apart from incident response, this week’s work looked a lot like the last week’s. Rather than break it out by person, we’ll just document the themes:

Physical host migration remains the biggest ticket item for the infra team. We’re pushing forward on decommissioning old Equinix data center deployments and moving them to newer, more resilient, more cost-effective hardware. The big obstacle we’re facing right now remains applications that may (sometimes surreptitiously) be saving and reusing literal 6PN IPv6 private networking addresses, rather than DNS names. Because 6PN addresses are bound to specific physical hardware, these apps may break when migrated, which isn’t acceptable. We’re doing lots of things, from careful manual migration of apps (like Fly Postgres) where we control the cluster, to alerting and eBPF-based fixes. We knocked out another dozen or two old physicals this week.

Better host status alerting is a big deal for us. We’re going to keep seeing regional and host-local outages, which is just the nature of running a large fleet of physical servers. We’re now doing email alerting for customers on impacted hosts, and have PRs in for displaying alerts whenever a user touches an app impacted by a host alert with flyctl to continue closing that loop.

Corrosion scalability and reliability work continues; Somtochi has some design changes that could further minimize the amount of state we have to share, which we’ll talk about more when they pay off. Corrosion is a super important service in our infrastructure (it’s the basis for our request routing) and reliability improvements since infra took it over have been a big win.

Update: Jun 1, 2024

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

  • May 31: Billing Issue With Upstash Redis (8:00EST): We’re in the middle of a transition from our old billing system to a new one based on Metronome. Billing is a gnarly problem. On Friday morning, someone called an incident after noticing a bug wherein a small number of Upstash Redis customers might have gotten double billed for something. We refunded them. This was an issue we detected internally, with no customer impact, but technically we called an incident for it and by the rules of this page we have to log it.

  • May 31: Network Filtering Breaks Flycast (13:00EST): As part of a project we’re running to do automatic authenticated connections between Fly Machines, our ProdSec team rolled out an nftables change. It was tested in our dev region, but had an unexpected interaction with our deployment tooling (something about the order in which tables are dropped and rebuilt). The net effect was that the fleetwide deployment broke FlyCast. Diagnosis and remediation took about 30 minutes.

This Week In Infra Engineering

Short week. Couple people out sick.

Kaz worked on getting Fly Machine creation success rates onto our status page, which you should see soon. The two most important things you can know about the Fly Machines API: “create” and “start” are two different operations (“start” is the fast one; you can pre-“create” a bunch of stopped machines and start them whenever you need them), and “create” can fail; for instance, you can ask for more resources than are available in the region you target. Read more about that here. We (well, Kaz, but we agree with him) want the success rates for this operation to be visible to customers.

Dusty and Simon spent the week heads-down on Postgres cluster migration. Read last week’s bulletin for more on that. We’re getting somewhere, but we’re not done until we can push a button and safely clear all the Machines of a physical server without having to worry too much about it.

Will won his next boss battle with NATS. We’ve successfully upgraded the whole fleet to current NATS (recall: the last attempt drove a terabit-scale message storm), on a custom branch with some of his fixes from last week. Metrics are down up to 90% across the board (a good thing) and problems we’ve been having with connection stability after network outages (inevitable at our scale!) seem to have resolved. Will’s writing a Fresh Produce release about this and we won’t steal any more of his thunder here.

Matt spent the week making log monitoring more resilient. “Logs” here mean “the platform feature we offer that ships logs off physical servers and to customers using NATS”. What Matt’s doing is, we run a Machine on every physical server in our fleet, the “debug app”, and it checks various things and freaks out and generates alerts when things go wrong. One more thing “debug” does now is track our server inventory, and make sure we’re getting NATS logs from all of them. In other words, another constantly-running, all-points end-to-end test of log shipping, from the vantage point of our customers.

Tom is doing topological work on Corrosion. As we keep saying, we have “edge” servers and “worker” servers; the “edges” are much, much smaller than the “workers”, and we don’t want to tax them too much, so they can just do their thing terminating TLS and routing traffic. But that routing function depends on Corrosion, our gossip-based state tracking system, and Corrosion is expensive. One answer, which Tom is pursuing, is for (most) edges not to run it at all, but instead to be remote clients of it on other machines.

Dave (and Matt and Will and Simon) did a bunch of hiring work, including revamping our challenges and updating our internal processes for reviewing them. We should be much more responsive to infra candidates (we already were within tolerances, but we’re raising the bar for ourselves).

Update: May 25, 2024

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

  • May 23: Capacity Issues in FRA (7:00EST): FRA has of late become one of our busier regions, and we’ve been continuously adding edge capacity (recall: edge hosts take in traffic from the Internet and route it to worker hosts, which run Fly Machines for customers). We needed more edge capacity this morning, and added it. The annoyance here was compounded by telemetry issues: in our “current” configuration, capacity issues degrade the performance both of Corrosion, our internal gossip service catalog, and NATS, the messaging system we use to communicate load between proxies in our Anycast network. There’s a bunch of work happening around making those systems less sensitive to edge load.

So, yeah, pretty easy week, as far as the infra team is concerned.

This Week In Infra Engineering

The big news for the past several weeks has been intra-region Fly Machine migration: minimal downtown migration of workloads, including large volumes, from one physical worker server to another. We hit a snag here: Fly Postgres wasn’t designed originally to be migrated, so many instances of it are intolerant to being moved and booted up on new 6PN IPv6 addresses. A bunch of work is happening to resolve this; we’ll certainly be migrating Fly Postgres instances in the near future, it’s just a question of “how”.

Simon is designing migration tooling work to make large-scale migration and host draining work for everything we can safely migrate, including single-node Postgres instances (do not run single-node Postgres instances in production! — but thanks for being easy to migrate).

Ben A. worked on migrating and draining workloads to balance workers. Fly Machines bill customers primarily for the time the Machine is actually running. When they’re stopped, a Machine is a commitment to some amount of resources on its associated worker, and a promise that we will start that Machine within some n-hundred millisecond time budget. This commitment/promise dance is drastically simpler and less expensive for us to honor now that we can migrate stopped Machines.

Now that Pet Semetary is up and running, Somtochi has switched up and is working with Sage on Corrosion, our Rust-based gossip statekeeping system. Corrosion is a (large) SQLite database managed by SWIM gossip updates. The work this week is primarily testing and bugfixing, but they may have figured out a way to reduce the size of our database by a factor of 3, which we’ll certainly write about next week if it pans out.

Dusty and Sage have been adding more edge hosts to keep up with capacity. Dusty also began trial migrations of Fly Postgres, using an IP mapping hack by Saleem, and built some internal dashboards to assist in the sort of manual host rebalancing work that Ben A. was doing this week.

Akshit cleaned up some log messages. Yawn. But also he graduated university! Congratulations to Akshit.

John rolled out a fleetwide fix for an interaction between Corrosion and our eBPF UDP forwarding path. You can run a fully-functional DNS server as a Fly Machine, because our Anycast network handles UDP as well as TCP. We do this by transparently encapsulating and routing UDP in the kernel using XDP and TC BPF. The routing logic for this scheme is written into BPF maps by a process (udpcatalogd) that subscribes to Corrosion. We decommissioned a large physical worker in AMS, which generated a big Corrosion update, which tickled a bug in a particular SQL query pattern only udpcatalogd uses, which caused ghost services for that AMS ex-worker to get stuck in our routing maps. There were a bunch of fixes for this, but the immediate thing that cleared the problem operationally involved… turning udpcatalogd off for a moment and then back on, fleetwide. Thank, John! John also did a bunch of retrospective work on our learnings abouting Fly Postgres clusters. He’s also taken up residency on our public community site. Go say ‘hi’ to him (or complain about something that infra is involved in, and he’ll apparently show up.)

Will has been heads-down in NATS land for the past 2 weeks, after an attempted NATS upgrade briefly melted our network a bit over a week ago. Will has been chatting with the Synadia people about our topology, and, in the meantime, found two scaling issues in the NATS core code that drive excess system chatter in our current topology. He’s prepared a couple upstream PRs.

Steve spent some time this week building new features for Drift, our Elixir/Phoenix internal server hardware inventory tool. Drift now tracks server lifecycle (for things like decommissioning), and, where our upstreams support it, automatically adding new servers to our inventory.

Update: May 18, 2024

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

  • May 13: Metrics Outage (9:00EST): Every once in awhile, on some of our AMD-based servers, with Linux IOMMU enabled, we see a weird lockup that forces us to hard-restart a host. Technically, we only need IOMMU stuff on hosts running GPU workloads, but right now we had it widely enabled. Anyways, that happened to an IAD host that ran part of our Metrics system, which meant that for about 20 minutes we had broken metrics ingestion while we rebooted that machine. You’d have seen a 20-30 minute gap in metrics on Grafana graphs.

  • May 16: Postgres Cluster Migration Failure (5PM EST): For the past several weeks we’ve been exercising machine+volume migration — the ability to move workloads, including storage, from one physical server to another by having the original server temporarily serve as a SAN server. For reasons having nothing to do with storage but rather the particulars of our IPv6 addressing scheme, migrations confused the repmgr process that manages Fly Postgres clusters. A limited number of Fly.io Postgres customers saw Postgres cluster outages after their underlying machines we migrated, and before we halted all migrations of Postgres clusters, over the course of about an hour.

  • May 8 (5:00 EST): Capacty Issues in DFW: We unexpectedly hit saturation on our “edge” servers (reminder: “edges” terminate HTTPS and serve our Anycast network, “workers” run VMs for Fly Machines), forcing us to quickly add additional edge servers in that region. This would be no big deal, but our Elixir web dashboards are served for this region, so for about 30 minutes we had degraded performance of our interface before we were able to add additional capacity.

This was an easier week than last week. The middle outage, migrating Postgres clusters, was very noticeable to impacted customers — but also quick root-caused. The other two incidents were limited in scope. Unless you’re carefully watching Fly Metrics. Are you using Fly Metrics? An incident that broke them is a weird place to pitch them, we know, but they’re pretty neat and you get them for free. Hold our feet to the fire on them being reliable!

This Week In Infra Engineering

Dusty provisioned new hardware capacity in San Jose, Singapore, Warsaw, Sydney, Atlanta, and Seattle.

Will had a conversation with engineers at Synadia (last week’s NATS outage hit right during an all-hands meeting for them!) and got some advice on reconfiguring our internal NATS topology, shifting most of our hosts to “leaf” nodes and minimizing the number of “clustering” notes we have per region; this should trade an imperceptible amount of latency (which doesn’t matter with our NATS use case) for drastically reduced chatter. Thanks, Synadians!

Akshit finished an upgrade to Firecracker 1.7 across our fleet. 1.7 does asynchronous block I/O with io_uring. We’ve noticed, since we rolled out Cloud Hypervisor for our GPU workloads (ask us about the security work we had to do here!) that Cloud Hypervisor was doing a better job handling busy disks than the version of Firecracker we were running. We’re optimistic that the new version will close the gap.

Steve finished up the provisioning tooling for the fou-tunnel-and-SNAT monstrosity that we talked about last week: giving Fly Machines static IP addresses, for people who talk to IP-restricted 3rd party APIs.

Tom and Ben A (ask us how many Bens work here!) completed the migration and draining of workloads from the cursed “edge worker” machines we mentioned last week. Edge-workers are no more. In the process, Tom debugging a bunch of draining tooling issues (being good at this is a big deal, because we’d like to be able to drain a sus server anywhere in the world at the drop of a hat), and Ben wrote up internal playbooks for draining hosts. Requiescat, Edge Workers, 2020-2024.

Simon continued low-level work on Machine/Volume migration, which is the platform kernel of the host draining stuff Tom and Ben were doing. This week’s work focused on large volume migration. Recall that our migration system causes the “source” physical server to temporarily serve as an ad-hoc SAN for the “target” physical, allowing us to “move” a Machine from one physical to another in seconds while the actual volume block clone happens in the background; Simon’s instrumentation work may have shaved about ~10s off this process (about 1/3rd of the total time).

Andres got host alerting (notifying users of hardware issues with the specific hosts they’re using, both on their personal status page and directly via email) integrated with our internal support admin tool.

Somtochi rolled out the first iteration of Pet Semetary to our flyd orchestrator. We now have two (hardware-isolated) secret stores: Hashicorp Vault, and our internal Pet Semetary. The big thing here is, if we can’t read secrets, we can’t boot machines; now, if Vault has an availability issue, we “fall back” to Pet Semetary. Requeiscat, Vault-related Outages, 2020-2024.

Update: May 11, 2024

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

  • May 7: NATS Storm (5:30EST): Some components of our platform, most notably log-shipping, run on top of the NATS messaging system. We’ve been fighting with NATS reliability issues for the past several months, and one thing we’ve needed to do is upgrade the fleet NATS version; more recent NATS releases have a number of bug fixes. We did a staged deployment of 2.10; it looked fine; we rolled it out further; it generated a 1.7tb/s (that’s ‘t’, with a ‘t’) message storm. Server CPU (on a small number of servers) buckled long before the network did; some users would have seen increased CPU steal and degraded performance. Log shipping was totally disrupted for about an hour.

  • May 8: Vault Certificate Breakage (7AM EST): The primary backend for secret storage at Fly.io is currently Hashicorp Vault (which is great). When Fly Machines start up, flyd, our orchestrator, fetches secrets from Vault to merge into the configuration. Vault is locked down with mTLS across our fleet; you need a client cert to talk to it at all. Due to a leaf/intermediate certificate configuration issue (we’re not even going to attempt to explain it), client certs across our fleet were invalidated, preventing flyd from fetching secrets, which prevented Fly Machines from booting.

  • May 8 (5:30 EST): Registry Load Balancing in AMS: Every application deployed on Fly.io is shipped in Docker (OCI) containers, and most are stored in our own Docker registries. For the past 6 months, those registries have been geographically distributed using LiteFS, with an accelerated S3 storage backend. Under heavy deployment load (because of the time of day), deploys using the AMS registry began to sporadically time out. We investigated this with AWS, and with our upstream provider, and mitigated temporarily by forcing builds to other regions; the issue resolved itself (never good news) within an hour or so. It turned out to have been a side effect of a fly-proxy change that fixed a bug with large HTTP POST bodies.

A pretty straightforward week. The most painful incident was the Vault “outage”, in part because it happened on the eve of us cutting over to Pet Semetary, our Vault replacement; in our new post-petsem world, it’ll take an outage of both Vault and PetSem to disrupt deploys. The other two incidents were more limited in scope.

This Week In Infra Engineering

Dusty built out telemetry and monitoring for Fly Machine migration, in preparation for a regional migration of some Machines to a new upstream provider.

In addition to doing a cubic heckload of routine hiring work (do these updates sound fun? we’re hiring!), Matt and Tom revised one of our technical work sample tests, eliminating an inadvertent cheat code some candidates had discovered; a comprehensively broken environment we ask candidates to diagnose had a way to straightforwardly dump out the changes we had made to break it. Respect to those candidates for figuring that out, and helping us level up the challenge a bit.

Steve has had a fun week. He’s working on shipping (you heard it here first) static IP address assignments for individual Fly Machines — this means Fly Machines can make direct requests to the Internet (for instance, to internal on-prem APIs) with predictable IP addresses. The original plan was to run an IGP across our fleet, but Steve worked out a combination of fou tunnels and SNAT that keeps our routing discipline static while allowing address to float. It’s a neat trick.

Steve would also like us to tell you that he rebooted dev-pkt-dc10-9b7e.

Ben built out tooling for host draining. Last week we talked about Simon’s work shipping inter-server volume migrations. Now that we can straightforwardly move workloads between physicals, storage and all, we can rebuild the “drain” feature we had when with Hashicorp Nomad back in 2020 (before we had storage), which means that when servers get janky (inevitable at our scale), or things need to be rebalanced, we can straightforwardly move all the Fly Machines to new physical homes, with minimal downtime. There’s a lot of corner cases to this (for instance: not all the volumes on a physical are necessarily attached to Machines), so this is a tooling-intensive problem.

Andres and Kaz re-established telemetry, metrics, and alerting on our Rails API, after an incident last week - it didn’t directly impact deploys, but would have made incidents involving API server problems, which are not unheard of, harder to detect and more difficult to resolve.

Kaz worked on fly-proxy-initiated Fly Machine migration. True fact: you can start a Fly Machine with an HTTP request; if a request is routed to a Fly Machine in stopped state, it’ll start. Kaz is working towards automatic migration of Machines from hosts that overloaded (i.e., exceeding our internal utilization thresholds): instead of starting on a busy machine, we can initiate a migration to a less-loaded machine. Recall that the core idea of our migration system is temporary SAN-style connections: a Machine can boot up on a new physical long before its entire volume has been copied over. Automatic migration isn’t happening yet, but it’s getting closer.

Akshit worked on cloud-hypervisor integration with our flyctl developer experience. cloud-hypervisor is like Firecracker except Intel ships it instead of AWS (they are both memory-safe Rust KVM hypervisors with minimal footprints; they even share a bunch of crates). We use cloud-hypervisor for GPU machines because it supports VFIO IOMMU device passthrough (ask us about the security work we did here, please). Operating cloud-hypervisor is similar enough to Firecracker that it’s almost a drop-in, but we’re still smoothing out the differences so they feel indistinguishable to users.

Tom and John are decommissioning our old, cursed “edge workers”. We run mainly two kinds of servers: edges that take traffic from the Internet and feed them into our proxy network, and workers that run actual Fly Machines. For historical reasons (those being: the founders made annoying decisions) we have on one of our upstreams a bunch of dual-role machines. Not for long. You may not like it, but this is what peak performance looks like:

root@edge-nac-fra1-558f: ~
$ danger-host-self-destruct-i-want-pain
!!!!!DANGER!!!!!  _____          _   _  _____ ______ _____   !!!!!DANGER!!!!!
!!!!!DANGER!!!!! |  __ \   /\   | \ | |/ ____|  ____|  __ \  !!!!!DANGER!!!!!
!!!!!DANGER!!!!! | |  | | /  \  |  \| | |  __| |__  | |__) | !!!!!DANGER!!!!!
!!!!!DANGER!!!!! | |  | |/ /\ \ | . ` | | |_ |  __| |  _  /  !!!!!DANGER!!!!!
!!!!!DANGER!!!!! | |__| / ____ \| |\  | |__| | |____| | \ \  !!!!!DANGER!!!!!
!!!!!DANGER!!!!! |_____/_/    \_\_| \_|\_____|______|_|  \_\ !!!!!DANGER!!!!!

This script will TOTALLY DECOMMISSION and DESTROY this host and REMOVE IT
PERMANENTLY from the Fly.io fleet.
To proceed, enter the hostname: edge-nac-fra1-558f

Correct, this host is edge-nac-fra1-558f.

To proceed, repeat verbatim "Yes, IRREVERSIBLY decommission"
-> Yes, IRREVERSIBLY decommission
This is your LAST CHANCE. Press ENTER to run away to safety. Press '4' to begin.

Migration is a theme of this bulletin; like we said last week, it has been kind of our “white whale”.

We have not forgotten last week’s promise to publish Matt’s incident handling process documents, but Matt wants to clean them up a bit more. We’ll keep mentioning it in updates until Matt lets us release them.

This is a small fraction of our infra team! These are just highlights; things that stuck out to us at the end of the week.

Update: May 5, 2024

This is a new thing we’re doing to surface the work our infra team does. We’re trying to accomplish two things here: 100% fidelity reporting of internal incidents, regardless of how impactful they are, and a weekly highlights reel of project work by infra team members. We’ll be posting these once a week, and bear with us while we work out the format and tone.

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact).

  • Apr 24: WireGuard Mesh Outage (12:30EST): All physical hosts on Fly.io are linked with a global full mesh of WireGuard connections. The control plane for this mesh is managed with Consul KV. A bad Consul input broke Consul watches across the fleet, disrupting our internal network. Brief but severe impact to fleetwide request routing; longer sporadic impact to logging, metrics, and API components, which found new ways to be susceptible to network outages.
  • Apr 25: Regional API Errors (2AM EST): We saw an uptick in 500 responses and found Machines API servers (flaps) in some regions had cached an unexpected nil during the previous WireGuard outage, which generated exceptions on some requests. Sporadic disruption to deployments in impacted regions.
  • Apr 25 (9:15 EST): Concurrency Issue With Logging: Under heavy load, a middleware component in our Rails/Ruby API server unsafely accessed an instance variable, corrupting logs. A small number of customers experienced some log disruption.
  • Apr 26 (5:30 EST): WireGuard Mesh Outage II: Automation code built and deployed to mitigate/prevent the WireGuard incident that occurred on Apr 24 exhibited a bug that effectively broke the WireGuard Mesh again, with the same impact and severity: brief but severe impact to request routing, longer sporadic impact to logging, metrics, and API.
  • Apr 30 (5:30 EST): Upstream Data Center Power Outage in SJC: One of our 2 data center deployments in SJC experienced a total loss of power, taking more than half our SJC workers (17 of 27) offline for about 30 minutes.
  • Apr 30 (2AM EST): Docker Registry Resource Exhaustion: The Docker registries that host customer containers are themselves Fly Machines applications, with their own resource constraints. A Registry machine unexpectedly reached a storage limit, disrupting deploys that pushed to that registry for about 20 minutes.
  • May 1 (8AM EST): Token Service Deployment Failure: Fly.io’s Macaroon tokens, which authenticate API calls for people with mandatory SSO enabled, or with special-purpose deploy tokens, is served by the tkdb service, which runs on isolated hardware in 3 regions, replicated with LiteFS. A bad deploy to the ams region’s tkdb jammed up LiteFS, disrupting deploys for SSO users.
  • May 2 (4:30 EST): Internal Metrics Failure for API Server: A Prometheus snag caused some instances of our API server to stop reporting metrics. This event had no customer impact (beyond tying up some infra engineers for an hour).
  • May 4 (4AM EST): Excessive Load In BOS: Network maintance at our upstream provider for BOS broke connectivity between physical hosts, which in turn caused excessive queuing in our telemetry systems, which in turn drove up load. Brief performance degradation in BOS was resolved manually.

This was a difficult time interval, dominated by a pair of first-of-their-kind outages in the control plane for our global WireGuard mesh, which subjected us to several days of involuntary chaos testing, followed by a surprisingly long upstream power loss in one of our regions. “Incidents” for infra engineering occur somewhat routinely; these were atypically impactful to customers.

This Week In Infra Engineering

Somtochi completed an initial integration between flyd, our orchestrator, and Pet Semetary, our internal replacement for Hashicorp Vault. Fly Machines now read from both secret stores when they’re scheduled. This is the first phase of real-world deployment for Pet Semetary. Because Vault relies on a centralized Raft cluster with global client connections, and because secrets reads have to work in order to schedule Fly Machines, it has historically been a source of instability (though not within the last few months, after we drastically increased the resources we allocate to it). Pet Semetary has a much simpler data model, relying on LiteFS for leader/replica distribution, and is easier to operate. Somtochi’s work makes deployments significantly more resilient.

Simon got Fly Machine inter-server volume migrations working reliably, the payoff of a months-long project that is one of the “white whales” of our platform engineering. Volumes attached to Fly Machines are locally-attached NVMe storage; Fly Machines without Volumes can be trivially moved from one server to another, but Volumes historically could not be without an uptime-sapping snapshot restore. The new migration system exploits dm-clone, which effectively creates temporary SAN connections between our physical servers to allow Fly Machines to boot on physical while reading from a Volume on another physical while the Volume is cloned. Simon’s work allows us to drain workloads from sus physical machines, and to rebalance workloads within regions.

Andres built new internal tooling for host-specific customer alerts. At the scale we’re operating at, host failures are increasingly common; more hosts, more surface area for cosmic rays to hit. These issues generally impact only the small subset of customers deployed on that hardware, so we report them out in “personal status pages”. But we’re a CLI-first platform, and many of our customers don’t use our Dashboard. Andres has rolled out preemptive email notification, so customers get direct notification.

Dusty beefed up metrics and alerting around “stopped” Fly Machines. The premise of the Machines platform is that Machines are reasonably fast to create, but ultra-fast to start and stop: you can create a pool of Machines and keep them on standby, ready to start to field specific requests. Making this work reliably requires us to carefully monitor physical host capacity, so that we’re always ready to boot up a stopped Fly Machine. This is capacity planning issue unique to our platform.

Will continued our ongoing project to move all of Fly Metrics off of special-purpose hosts on OVH, which hosts have been flappy over the years, and onto Fly Machines running on our own platform. Metrics consumes an eye-popping amount of storage, and Will spent the week adding storage nodes to our new Fly Machine metrics cluster.

Matt capped off the week by, appropriately enough, fleshing out our incident response and review process documentation. We could say more, but what we’ll probably do instead is just make them public next week.