2024-09-14

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

  • September 9: Scheduler Instability On IAD Hosts (9:00EST): Internal metrics alerts from trace telemetry showed a spike in latency for “add new machine” calls in our flyd orchestrator on a subset of IAD events; normally these calls are very fast, and now they weren’t. We’ve seen that before and correlated it with errors in our APIs, though none were evident; we called an incident, and updated the status page with a warning about potential slowness in the API. After about 80 minutes of debugging, the culprit was identified as a series of internal apps running on Fly Machines that were caught in a particular weird crash loop; deleting those apps resolved the immediate problem.

  • September 9: Fly-Metrics Log Cluster Outage (21:00EST): The log retrieval interface for a Quickwit log cluster used for fly-metrics (but not flyctl logs, our internal logs, or our log shipper) stopped returning logs; logging were ingested and saved but not returned in queries. After roughly an hour of diagnosis, the culprit was determined to be a broken indexing service; destroying and recreating its Fly Machine resolved the problem.

  • September 13: Scheduler Outage (13:00EST): For a period of about 10 minutes, a large fraction of our flyd scheduler services in multiple regions were locked in a crash loop. The proximate cause of the outage was us rotating a Vault token used by the service; the root cause was an infrastructure orchestration bug in how we managed those tokens (a configuration management tool got into a state where it held on to and attempted to renew a non-renewable token).

  • September 13: DNS Failures in Europe (17:30EST): For about 10 minutes, we observed internal DNS failures in European regions. We briefly stopped advertising FRA edges (which resolved the problem, though not to our satisfaction) and then bounced DNS services across edges in the region (which also resolved the problem). We identified some ongoing upstream networking issues and kept some edges un-advertised into the next week. Minimal, uncertain customer impact.

This Week In Infra Engineering

In the interests of getting this infra log update up in a timely fashion and also giving the infra log writer a break, we’re going to talk about this week in infra engineering… next week.