2025-05-03

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact.)

April 28: Madrid Connectivity Loss (12:30EST): Y'all watch the news, right? Somebody unplugged all of Spain for several hours. The Internet noticed; our MAD region, particularly the edge servers, became flappy. Madrid had bigger problems than us, but the outage functioned as a Chaos Monkey and unearthed a couple reboot issues with our eBPF networking code. We resolved those, but MAD stayed flappy, so we disfavored it in BGP4 to direct traffic out of the region. Incident was acute in MAD for about an hour, and fully resolved in 2.
April 29: API Outage (02:30EST): Metrics and synthetics went nuts, alerting our on-call team to a loss of API availability. Turned out: an upstream shared by both of our IAD providers had a 10-minute outage; IAD is the region our primary API is deployed in. We continue accept the risk of having a SPOF API region; the complexity of alternatives is riskier, and we’ve taken steps to make IAD, our largest region, more resilient. This time, it was (briefly) out of the hands even of our upstreams. The incident resolved in less time than it took to update the status page.
April 29: WireGuard Outage In CDG (20:00EST): We lost connectivity to our WireGuard gateway in CDG, meaning that users whose nearest gateway was CDG briefly lost WireGuard connectivity to the platform. The way our flyctl CLI works, if a WireGuard connection fails, we notice it and immediately go pick up a new peer to use. Mitigating this outage thus only required us to remove CDG from the list of available gateways, which we did within a couple minutes; the incident was then subacute (though our valued French customers may have been suboptimally routed to Frankfurt for a bit). The underlying incident turned out to be (wait for it) an upstream networking issue.
May 3: Anycast Routing Disruption in CDG (09:00EST): This was a fun one. We had just reorganized a big portion of our European edge network, which shifted traffic patterns. We’d also recently dealt with a European application that was exhibiting pathological request behavior, which stressed our edge routing. CDG synthetics began alerting, and while we investigated, customers in the region also began notifying us of unstable networking. Turns out: the pathological app behavior had queued up a gazillion retry requests, which were saturating our fly-proxy processes. This in turn brought down Vector (our telemetry router) on the CDG edges, which (along with the retries) set up a vicious cycle of queuing in the proxy and pressure on Vector. Bouncing the affected edge regions resolved the problem (inside of about 30 minutes). The Vector comorbidity of this incident blinded us to the retry loop we were experiencing in the region, a problem we have since resolved.

Next post ↑: 2025-05-10
Previous post ↓: 2025-04-26