April 14: WireGuard wg0 one-way host connectivity
#April 14: WireGuard wg0 one-way host connectivity (10:51UTC)
Over the past few weeks, we observed individual pairs of hosts fail to send traffic over our global WireGuard mesh. Specifically, the tunnel between the hosts would appear up and handshaking correctly, but packets would only flow in one direction (or sometimes none at all). Neither WireGuard configuration nor firewall rules were able to explain this behaviour.
This caused a few issues: fly-proxy on affected hosts wouldn’t be able to talk to each other, breaking load balancing in some cases; a more severe problem is with static egress IPs, since the return path depends on edge nodes being able to forward packets back to workers – if one edge node happens to lose connectivity with a worker in this way, some packets might be silently dropped depending on which node upstream flow hashing decides to forward the packets to.
Eventually, we tracked this issue down to a regression in the 5.15 stable kernel tree. We attempted to resolve this problem by removing and re-add the peer, but that caused Netlink in the kernel to hang, as described in this LKML thread. Fortunately, we later realized that even though resetting one single peer would hang, restarting the entire Wireguard interface (by downing the interface and re-initializing it) does not. This causes much less disruption to customer workloads on affected hosts, and we quickly fixed up all that we could find.
To close up the incident, we added an alert for any WireGuard peers stuck in this way, and scheduled a kernel upgrade to a later version in the future.