2025-01-11

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; this week, unlike ordinary weeks, all 3 rows of the chart are “fresh.”)

January 6: Internal Metrics Outage (12:00EST): For metrics on our internal platform components, we operate a relatively large cluster of VictoriaMetrics servers. I could look up exactly how many, but, it’s a lot. So much so that quickly recovering from storage exhaustion on the existing clusters would have required adding 15 new servers. That’s the position we found ourselves in; we did add new capacity, but not that dramatically, and instead limped forwards with retention and configuration changes. That metrics cluster is now stabilized, but we’re almost certainly going to build out a new internal metrics system over the next several months. No customer impact.
January 10: Edge Server Overload In San Jose (20:30EST): Metrics alarms and synthetics for our Anycast proxies began firing in the SJC region. When we have region-local problems with edge servers, where we terminate incoming Anycast traffic from the Internet, we have a quick mitigation, which is to stop advertising the impacted region, redirecting the traffic elsewhere. We did that, worked to diagnose the issue for awhile, and then did a workaround traffic shift from one edge provider to another, which resolved the immediate incident, leaving us with some capacity planning and debugging issues for the following days. Call this 5-10 minutes of traffic disruption in SJC, followed by 30 minutes of increased latency as traffic was routed out of the region. We’d like to shorten the gap between “disruption” and “increased latency”; that’s a process thing, so it shouldn’t be ha
January 10: Machine Creation Failures in London (22:00EST): Fly Machines API metrics indicated a spike in failures for machine creations. Machine creations can fail; that’s part of the Machines API contract. But they were localized to a particular machine, and customers were annoyed. It took about 40 minutes to debug the issue, which appeared at first to be a problem with how our flyd orchestrator communicates with tkdb, our hardware-segregated Macaroon token server, but turned out to be that an internal route that fly-proxy relies on had been dropped. That’s apparently a thing that just happens now? The author of this infra-log dug around in the incident channel and our internal message board and could not find a dispositive diagnosis for how that route got dropped, but reinstalling it fixed the problem. I’m only a little bit kidding. Mark this up for about an hour of sporadic API failures in London (we have lots of worker servers in London; this impacted just one.)

This Week In Engineering

Your author has spent this week writing a blog post and is thus derelict in their duty to interview our infra team about the fun stuff they’ve been working on, and does not have the chutzpah to reach out to the team at 9:00PM. We’ll maybe cut a special interim update tomorrow, but we don’t like leaving you hanging on the incident reports.

Next post ↑: 2025-01-18
Previous post ↓: 2025-01-04