Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact.)
April 28: Madrid Connectivity Loss (12:30EST): Y'all watch the news, right? Somebody unplugged all of Spain for several hours. The Internet noticed; our
MAD
region, particularly the edge servers, became flappy. Madrid had bigger problems than us, but the outage functioned as a Chaos Monkey and unearthed a couple reboot issues with our eBPF networking code. We resolved those, butMAD
stayed flappy, so we disfavored it in BGP4 to direct traffic out of the region. Incident was acute inMAD
for about an hour, and fully resolved in 2.April 29: API Outage (02:30EST): Metrics and synthetics went nuts, alerting our on-call team to a loss of API availability. Turned out: an upstream shared by both of our
IAD
providers had a 10-minute outage;IAD
is the region our primary API is deployed in. We continue accept the risk of having a SPOF API region; the complexity of alternatives is riskier, and we’ve taken steps to makeIAD
, our largest region, more resilient. This time, it was (briefly) out of the hands even of our upstreams. The incident resolved in less time than it took to update the status page.April 29: WireGuard Outage In
CDG
(20:00EST): We lost connectivity to our WireGuard gateway inCDG
, meaning that users whose nearest gateway wasCDG
briefly lost WireGuard connectivity to the platform. The way ourflyctl
CLI works, if a WireGuard connection fails, we notice it and immediately go pick up a new peer to use. Mitigating this outage thus only required us to removeCDG
from the list of available gateways, which we did within a couple minutes; the incident was then subacute (though our valued French customers may have been suboptimally routed to Frankfurt for a bit). The underlying incident turned out to be (wait for it) an upstream networking issue.May 3: Anycast Routing Disruption in
CDG
(09:00EST): This was a fun one. We had just reorganized a big portion of our European edge network, which shifted traffic patterns. We’d also recently dealt with a European application that was exhibiting pathological request behavior, which stressed our edge routing.CDG
synthetics began alerting, and while we investigated, customers in the region also began notifying us of unstable networking. Turns out: the pathological app behavior had queued up a gazillion retry requests, which were saturating ourfly-proxy
processes. This in turn brought down Vector (our telemetry router) on theCDG
edges, which (along with the retries) set up a vicious cycle of queuing in the proxy and pressure on Vector. Bouncing the affected edge regions resolved the problem (inside of about 30 minutes). The Vector comorbidity of this incident blinded us to the retry loop we were experiencing in the region, a problem we have since resolved.