A note on incidents: incidents are internal events for our infrastructure team. Incidents often correspond to degraded service on our platform, but not always. This log aims for 100% fidelity to internal incidents, and is a superset both of our status page events and of customer-impacting events on the platform. It includes events reported to subsets of customers on their personal status pages, as well as events without any status page impact.
Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact.)
We experienced a significant outage on Sunday the 16th, just missing the cutoff for this infra-log
update. We’ll have a postmortem for it in the next infra-log update. TL;DR: a primary region data
center experienced a hardware failure, far upstream of us, that cut the whole data center off.
February 10:
flyctl ssh console
Breakage (16:30EST): Customers report that they’re unable to useflyctl ssh console
to log into newly-created Fly Machines. After 2 hours of investigation, it
turns out that Saleem, in fixing bugs in our guest-resident SSH serverhallpass
, broke bug compatibility withflyctl
.February 10: Depot Builder Outage (18:30EST): Internal users report that Docker container builders using Depot (which are the platform default) are failing; metrics and traces confirm widespread Depot builder failures. Depot is a 3rd party firm that manages Docker builds for our users, because they are better at it than we are, but before Depot existed we did our own builders, which still work fine, so when Depot experiences hiccups we have a workaround, which is to revert to our own builders. That workaround was published to our status page during the 45 minutes or so in which Depot experienced a database issue that disrupted their service.
February 11: Internal Log Indexing Cluster Failure (02:00EST): For about an hour, our internal OpenSearch cluster falls into a degraded state; something about a shard, field limits on incoming Vector logs, the eternal battle between the Mystics and the Skeksis, and the Java runtime. Somebody does something, or hits something like Fonzi with the juke box, and the cluster recovers. No known customer impact, but we infra-log everything here.
February 12: Network Disruption In
JNB
(14:30EST): Upstream network connectivity in ourJNB
region is flappy for about 45 minutes. While our upstream resolves the issue, the region loses connectivity to our HashiCorp Vault cluster, causing us to trigger an internal incident. That incident is largely mitigated by Pet Sematary, our internal Vault replacement, which runs side-by-side with Vault. Customers may have experienced some network instability with requests to and fromJNB
(but probably not platform instability).February 13: Network Disruption in
BOG
(08:00EST): The second in a series of about 5 very spicy days for one of our most important upstreams: an unannounced maintenance window occurs inBOG
, apparently in response to an unexpected hardware failure, and the upstream manages to break our LACP groups in the process, turning what they expected to be a brief blip in connectivity into a 3-hour total outage for the region. Our experience of that outage is a complete loss of connectivity to our rack; their experience is discovering several hours into the outage that the problem is not our line cards but rather a routing error on their end.
This Week In Engineering
We have an excuse every week for not updating this, don’t we? There’s two blog posts coming this week about infra work, but also we’re doing investigative work with our upstream on the outages they’ve experienced; that, and a lot of on-call afterhours stuff, and we’re going to lay off the infra team this week too.