Duplicate Wireguard Mesh IPs Wreaking Havoc

#April 20: Duplicate Wireguard Mesh IPs Wreaking Havoc (14:53UTC)

Some background: at Fly.io, we run a fleet of bare metal servers hosting your workload, be it Machines or Sprites, all connected over a Wireguard mesh. When we provision new servers, something has to set up Wireguard such that it is reachable by the rest of the fleet. This is done by something we call flywire. It generates a Wireguard public / private key pair, sends the public part to our Consul cluster to be read by other nodes, and picks a IP in a private /8 range.

You have probably read about our last incident where some nodes just lost connectivity over this Wireguard mesh. This incident began as what looked like a recurrence of that one: some nodes not being able to talk to others. Only that this time, resetting the wg0 interface did not do anything to fix the issue. On one of the affected edge servers, we also noticed that NATS (used to propagate app load information, logs, etc.) is using an abnormally high amount of CPU. This actually gave us some clue, since its logs kept complaining about some of its peers do not report the expected regions (they should be in sin but report as fra, for example).

We went to check on those nodes in fra as well. Turns out, they have the same IPs as the problematic nodes in sin! In fact, after a quick sweep of our entire fleet using a script, we found a couple more pairs / triples of servers with this exact same problem. They were all provisioned recently, and we were also lucky that many of them were not yet set up to accept new Machines. Duplicate IPs are problematic, because other nodes may end up selecting one but not the other as the “active” peer, causing partial connectivity. Most of our platform components also assume that Wireguard IPs are unique. We quickly took all of them out of production to investigate.

It turned out that there were two bugs in the provisioning process that caused this:

When generating new Wireguard key pairs and IPs, we acquire a Consul lock on the respective resources, but the duration of the lock only covers generating the IP. We do check duplicate IPs at this stage, but by the time we write the IP into Consul, the lock would have been released already. Any parallel writers could cause a classic TOCTOU condition.
In some cases though, nodes all get the first IP available in the /8 range. That is too unlikely to be explained away by pure chance. Rather, the bug here is that our code to generate the next IP by checking consul ignored errors emitted by Consul and just defaulted to the first IP in that case.

As an immediate measure, we have reset the wg0 IP addresses of all these servers and added an alert when we detect duplicates. We are also going to fix the two bugs in our provisioning script to avoid this in the future.