2025-05-24

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact.)

May 19: BGP Issues In Europe (14:30EST): Synthetics and health checks flagged that bird, the BGP daemon we run, was flapping on edges in WAW. We escalated upstream, but the issue quickly resolved itself. Recorded here for completeness; a very short, very regional incident with unknown but probably very limited impact on customer networking.
May 23: WireGuard Gateway Outage In FRA (7:00EST): Our FRA gateway became unresponsive. Investigation turned up that it functioned fine with “native” UDP WireGuard, just not with WireGuard-over-Websockets, which is our platform default. That in turn narrowed down the scope of what could have gone wrong; Lillian and Saleem then spotted a contention/bottleneck issue in the code we run to install “just-in-time” WireGuard peers by sniffing and parsing WireGuard INIT packets. For new peers, that code path involves the gateway making calls to our central API, which can be slow. The code in question had always been suboptimal, but was under new strain because of how our Fly Kubernetes orchestration for Managed Postgres uses WireGuard. We took the FRA gateway out of rotation immediately, mitigating the issue, and had a bugfix deployed fleetwide about 2 hours later.
May 8: Fleetwide WireGuard Gateway Load (13:00EST): This an incident declared out of caution when a metric spiked. Per the previous incident, attempts to use new peers on a WireGuard gateway incur a call to our central API (to look up the public key for the peer). If that call fails, we log it. The metric tracking those logs shot up overnight. There was no apparent functional impact to the gateways, but they were generating a truly alarming number of queries every minute, all of which were failing. Investigation isolated the requests to FKS MPG clusters, which had a bug in the code they use to test availability of WireGuard connections; the bug had been fixed, but the clusters had not all been restarted. Bouncing them resolved the incident. No customer impact.

Next post ↑: 2025-05-31
Previous post ↓: 2025-05-17