2025-02-15

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact.)

February 10: flyctl ssh console Breakage (16:30EST): Customers report that they’re unable to use flyctl ssh console to log into newly-created Fly Machines. After 2 hours of investigation, it
turns out that Saleem, in fixing bugs in our guest-resident SSH server hallpass, broke bug compatibility with flyctl.
February 10: Depot Builder Outage (18:30EST): Internal users report that Docker container builders using Depot (which are the platform default) are failing; metrics and traces confirm widespread Depot builder failures. Depot is a 3rd party firm that manages Docker builds for our users, because they are better at it than we are, but before Depot existed we did our own builders, which still work fine, so when Depot experiences hiccups we have a workaround, which is to revert to our own builders. That workaround was published to our status page during the 45 minutes or so in which Depot experienced a database issue that disrupted their service.
February 11: Internal Log Indexing Cluster Failure (02:00EST): For about an hour, our internal OpenSearch cluster falls into a degraded state; something about a shard, field limits on incoming Vector logs, the eternal battle between the Mystics and the Skeksis, and the Java runtime. Somebody does something, or hits something like Fonzi with the juke box, and the cluster recovers. No known customer impact, but we infra-log everything here.
February 12: Network Disruption In JNB (14:30EST): Upstream network connectivity in our JNB region is flappy for about 45 minutes. While our upstream resolves the issue, the region loses connectivity to our HashiCorp Vault cluster, causing us to trigger an internal incident. That incident is largely mitigated by Pet Sematary, our internal Vault replacement, which runs side-by-side with Vault. Customers may have experienced some network instability with requests to and from JNB (but probably not platform instability).
February 13: Network Disruption in BOG (08:00EST): The second in a series of about 5 very spicy days for one of our most important upstreams: an unannounced maintenance window occurs in BOG, apparently in response to an unexpected hardware failure, and the upstream manages to break our LACP groups in the process, turning what they expected to be a brief blip in connectivity into a 3-hour total outage for the region. Our experience of that outage is a complete loss of connectivity to our rack; their experience is discovering several hours into the outage that the problem is not our line cards but rather a routing error on their end.

This Week In Engineering

We have an excuse every week for not updating this, don’t we? There’s two blog posts coming this week about infra work, but also we’re doing investigative work with our upstream on the outages they’ve experienced; that, and a lot of on-call afterhours stuff, and we’re going to lay off the infra team this week too.

Next post ↑: 2025-02-22
Previous post ↓: 2025-02-08