2024-07-06

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

July 1: Consul Template Tooling (Internal) (23:00EST): Following a system update earlier in the day, consul-templaterb began exploding on a small number of our edge servers. It’s been over a year since we used Consul to track user application state, so this tooling isn’t in the critical path for user applications; in other words, this incident had no customer impact. It turned out to be an incompatible system Ruby configuration (our software update zapped a Ruby gem consul-templaterb depended on).
July 2: Poor Network Performance for Tigris in IAD (4:00EST): Tigris is our object storage partner and you should definitely check them out. At 4AM on Tuesday, they reported that they were seeing slow downloads from east coast regions, especially IAD. This turned out to be an upstream networking issue, resolved by a transit provider by adjusting routes, roughly 2 hours later.
July 2: Connectivity Loss in IAD (15:00EST): A BGP change at an upstream provider broke connectivity to our IAD data center for several minutes; this was unrelated to the previous incident, but much more severe (and thankfully brief).
July 3: Hardware Failure Breaks Upstash Redis in IAD (12:00EST): Upstash is our Redis parrnet and you should definitely check them out. Upstash runs distributed clusters of Redis servers in each of our regions. A quorum of the Fly Machines running their IAD cluster were scheduled onto a single server, which, months later, noticed and blew up. The server was recovered several hours later, and during the interval the Upstash cluster was rebuilt with a different Fly Machine on a different IAD server. This problem impacted only the IAD region, but IAD is an important region.
July 4: LiteFS Cloud RAFT Cluster Failure in IAD (20:00EST): LiteFS Cloud is a managed LiteFS service we run for our customers. Our internal LiteFS clusters run a RAFT quorum system scheme for leader election and cluster tracking. An open files rlimit configuration bug forced a node in the lfsc-iad-1 cluster to restart, which in turn tickled a bug in dragonboat, the Golang RAFT library the service used, which in turn forced us to rebuild the cluster. This incident had marginal customer impact and maximal Ben Johnson impact.
July 5: Elevated Machine Creation Alerts (1:00EST): Our infra team was alerted about elevated errors from the Fly Machines API. A different internal team had created a Fly Kubernetes cluster with an invalid name. Not a real incident, no customer impact, but we document everything here; that’s the rule.

This Week In Infra Engineering

The 4th of July hit on a Thursday this year, making this an extended holiday weekend for a big chunk of our team.

The big stories this week are mostly the same as last week. We continued deploying and ironing out bugs in Corrosion record compaction, we migrated off a bunch of old physical servers and continued building out migration tooling to make it even easier to drain workloads from arbitrary servers, and we improved incident alerting for customers in our UI and in flyctl.

The most important work that happened in this abbreviated week was all internal process stuff: we roadmapped out the next 12 weeks of infra work for networking, block storage, observability, hardware provisioning, and Corrosion. Lots of new projects hitting, which we’ll be talking about in upcoming posts.

Next post ↑: 2024-07-13
Previous post ↓: 2024-06-29