2024-06-22

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

  • June 17: Southeast Asia Connectivity (15:00EST): We saw high packet loss and interface flapping on edge servers in Singapore. We stopped announcing SIN Anycast routes, redirecting SIN traffic to other nearby regions, while we investigated the problem, which resolved roughly an hour later. There would have been minimal customer impact (for the duration of the event, which didn’t impact worker servers, we would have had somewhat worse Anycast routes.)

  • June 17: Internal App Deployment Failure That Turned Out To Be Nothing (20:30EST): An infra team member made a change to our API server and later deployed it; the deployment took upwards of an hour, and in a “where there’s smoke there’s fire” move called an incident. The incident: a typo in the code they were deploying, along with a bug in our API server that exited with a non-failure status. No customer impact.**

  • June 18: Volume Capacity in Brazil (10:00EST): The platform began reporting a lack of available space for new volumes in our GRU region. We were not in fact low on available volume space; rather, a change we pushed out to Corrosion, our internal state-sharing system, had a SQL bug that mis-sorted worker servers (on a condition that only occurred in GRU). We had a workaround published within 15 minutes (you could restart your “builder” Machine, the thing we run to build containers for you, and dodge the problem), and a sitewide fix within 90 minutes.

  • June 21: Midsummer Night’s Billing Outage (3:00EST): For an interval of about 2 hours, an upstream billing provider had an outage, which in turn broke some of our invoice reporting features; notably, if you had been issued a credit that you only tried to redeem during the outage, it would not have shown up (you wouldn’t have lost the credit, but you couldn’t have used it at 3:00EST).

This Week In Infra Engineering

Intra-region host migrations are unblocked again! This is huge for us.

Peter worked with our upstream providers to eliminate pathological AS-path routes impacted by recent APAC undersea cable cuts. This work started with us noticing relatively high packet loss in Asia regions, and resulted in us drastically reducing timeouts in our own telemetry and tooling, and network quality for users. A very big win that we’re looking to compound with better monitoring and tooling. He also figured out a configuration bug that was causing Fly Machines not to use BBR congestion control on private networking traffic, which is now fixed.

Dusty and Matt got all our multi-node Postgres clusters in condition to migrate (recall: multi-node Postgres clusters had been problematic for us, because they were configured to use literal IPv6 addresses for their peer configurations, and migration breaks those addresses, which embed routing information).

In addition to spending 30 working hours getting a single email announcement (about migrations) out to customers, John shipped our 6PN address forwarding tooling, along with Ben, out to the fleet, making it possible to migrate clusters that refer to literal IPv6 addresses. Dusty, Peter, John and Matt began draining hosts, moving the Machines running on them to most stable, modern, resilient systems on better upstreams, and lining us up to decom the much older machines. Ben drained an old server live during our internal Town Hall meeting. It was an emotional moment.

Still a bunch of people out this week! It’s summer (for most of us)!