Our User-Mode WireGuard Year

Image by Annie Ruygt

We’re Fly.io. We run container images on our hardware around the world, linked to our Anycast network with a WireGuard mesh. It’s pretty neat, and you should check it out. You can be up and running on Fly.io in single-digit minutes. That’s the last thing we’re going to say about how great Fly.io is in this post.

WireGuard is fundamental to how Fly.io works.

Practically everything that talks to anything else at Fly.io does so over WireGuard.

That goes for users, too. F’rinstance: to SSH into an instance of an app running on Fly.io, you bring up a WireGuard peer to one of our gateways and SSH to an IPv6 private network address reachable only over that WireGuard session.

This kind of stuff is mostly hidden by flyctl, our command-line interface, which is how users interact with their apps on Fly.io. On most platforms, “how do you SSH to an instance” is a boring detail. But flyctl isn’t boring.

I’m about to explain why, for better and worser, flyctl is exciting. But first, a disclaimer. In the ordinary course, a company writes a blog post such as this to crow about some achievement or “unique selling proposition”. This post is not that. This is an exercise in radical candor.

A Houdini With The Manacles Of Sound Engineering

Recap: It’s February 2021, and you can deploy a Docker container to Fly.io, with flyctl, and our platform will dutifully convert it to a Firecracker VM running on a worker in our network and connected (again, via WireGuard) to our Anycast proxy network so that your customers can reach it on the Internet.

What you couldn’t do is easily log into that VM to monkey around with it.

That wouldn’t do; what’s the point of having a fleet of VMs if you can’t monkey with them? The team agreed, being able to pop a shell on a running app was table-stakes for the platform.

We noodled a bit about how to do it. For awhile, we thought about building a remote-access channel into our Rust Anycast proxies. But we’d just rolled out 6PN private networking, making it easy for Fly.io apps to talk to each other. SSH seemed like an obvious example of a service you might run over a private network.

The trick was how to get users access to 6PN networks from their laptops. It was pretty easy to build an SSH server to run on our VMs, and APIs for certificate-based access control and building WireGuard peers for 6PN networks. But the client side is tricky: WireGuard changes your network configuration, and mainstream operating systems won’t let you do that without privileges. It wouldn’t do to require root access to run flyctl, and having to do a bunch of system administration to set up system WireGuard before you could run flyctl wasn’t attractive either.

So we came up with a goofy idea. You have to be root to add a network interface, but any old user can speak the WireGuard protocol, and once you’ve got a WireGuard session, TCP/IP networking is just building and shuttling a bunch of packet buffers around. Your operating system doesn’t even have to know you’re doing it. All you need is a TCP stack that can run in userland.

I said this within earshot of Jason Donenfeld. The next morning, he had a working demo. As it turns out, the gVisor project had exactly the same user-mode TCP/IP problem, and built a pretty excellent TCP stack in Golang. Jason added bindings to wireguard-go, and an enterprising soul (Ben Burkert, pay him all your moneys, he’s fantastic) offered to build the feature into flyctl. We were off to the races: you could flyctl ssh console into any Fly.io app, with no client configuration beyond just installing flyctl.

I want to pause a second here and make sure what we ended up doing makes sense to you, because it still seems batshit to me a year later. Want to SSH into a Fly.io app instance you started? Sure. Just run flyctl ssh console, and it will:

  1. Kick off a WireGuard VPN session from within flyctl, using the wireguard-go library.
  2. Run an entire TCP/IP stack, in userland, inside flyctl, to make an IPv6 TCP connection over that WireGuard connection.
  3. Using the golang.org/x/crypto/ssh library, run an SSH session over that synthetic TCP connection.

And this works! You can do it right now! Why am I the only person that thinks it’s bananas that this works?

It Is Bananas That This (Mostly) Works

Alright, radical candor.

The nerd quotient on flyctl ssh console is extreme, which is a strong argument in favor of it. But there are countervailing reasons, and we ran into them.

Here’s a simple problem. When you tell flyctl ssh console to bring up a WireGuard session like this, that running instance of flyctl on your machine — you know, the one that shows up in ps — is effectively another computer on the Internet. It has an IPv6 address. It is the only machine on the Internet that can have that IPv6 address. So what happens when you open up an SSH session in one window, and then another session in a different window?

In March of 2021, the answer was “it knocked the first SSH session off the Internet”. That’s how WireGuard works! Your peer keeps track of the source socket address that’s talking to it, and when a new source appears, that’s the host it starts talking to. It’s one of the great things about WireGuard, and why you can bring up a WireGuard connection, close your Macbook, walk to the coffee shop, open your Macbook back up, and still be connected to WireGuard.

I tried to rationalize this “one SSH session at a time” behavior for a couple weeks, but, come on.

There were two paths we could have taken to address this problem. The easy-seeming way would be to have each flyctl instance make a new WireGuard peer, each with its own IPv6 address and public key pair. There were things I didn’t like about that, like the fact that it would crud our WireGuard gateways up with zillions of random ephemeral WireGuard sessions. But the dealbreaker in Spring 2021 was that creating a new WireGuard peer configuration was slow. We will return to this point shortly.

The other way forward was to not have multiple instances of flyctl speaking WireGuard. Instead, when you made a WireGuard connection, we’d spawn a background process — the flyctl agent. flyctl ssh console runs would come and go, each talking to the agent, which would stick around holding open WireGuard sessions. Sure, why not!

I know how much you all love random background agent processes. I’m here to tell you that my Spring 2021  flyctl agent was all you could have imagined it would be. It only worked on Unix. Concurrency management? Try to start a new agent, and it’ll just ask the old one to die over that same socket, and then take over. Configuration changes? I’m just a simple unfrozen caveman agent, what changes would I need to know about?

Custom Resolver Dialers don’t work on Windows in Go?

Fortunately for everyone else, I’m not the only developer on this team, and the agent got some professional help. The team got Unix domain sockets working on Windows. They wrote a new DNS resolver that worked on Windows as well. The agent will only run one of itself at a time. It notices configuration changes after it starts, and doesn’t get out of sync and stale. If you use flyctl today, you’re missing a whole lot of debugging fun.

Doubling Down On Banana Futures

User-mode WireGuard and TCP/IP via IPC with a background agent is an awful lot of mechanism just to run an SSH session. A lesser engineer might look at this and say “the thing to do here is to get rid of some of this mechanism”. We chose instead “do more stuff with that mechanism”. I mean, I say “we chose”. But I was the last to know; I arose from a long slumber at some point in the middle of the year to find that our deploys were running over user-mode WireGuard.

Here’s another challenge users run into when deploying Docker apps on Fly.io: they’re often not running Docker. An engineer of limited imagination such as myself would look at this as a documentation problem, the solution to which would be an instruction to users to “install Docker”. But we’re a clever lot, we is.

Flip back to that 1-2-3 process for popping a shell over user-mode WireGuard. Here’s a new problem: “from a directory with a Dockerfile, deploy a Docker image on Fly.io if you’re not running Docker locally”. Here’s what you do:

  1. Use our GraphQL API to tell Fly.io to boot up a “builder” instance that does almost nothing but run Docker, because, hammer, nail, only tool, &c. 
  2. Kick off a WireGuard VPN session from within flyctl, using the wireguard-go library.
  3. Run an entire TCP/IP stack, in userland, inside flyctl, to make an IPv6 TCP connection over that WireGuard connection.
  4. Using the github.com/docker/docker/client libraries, build a Docker container on the remote builder instance (which really just means connecting to a random IPv6 address rather than 127.0.0.1). 
  5. Tell the builder to push the image to our Docker registry, and our API to deploy it.

It’s just 5 steps! It all happens in the background; you just run flyctl deploy! What could go wrong?

This pattern repeats. The horror of user-mode WireGuard and TCP/IP is that it is a whole lot of mechanism. But the beauty of it is that it’s mind-bogglingly flexible. A little later in the year, we launched Fly.io Postgres. Want to bring up a psql shell on your database? flyctl pg connect. Do I need to rattle off the 1-2-3 of that? For that matter, what if you have a cool client like Postico you want to use? No problem! flyctl proxy 5432:5432. A proxy isn’t even 3 whole steps!

Here’s where shit gets real. One can rationalize the occasional SSH connection janking out. SSH wasn’t even a feature we had at the beginning of the year. But deploys? Deploys have to work.

The Deploys, They Were Not Always Working

More radical candor.

We have some good ideas on what made remote builds over WireGuard shaky, and builds have gotten a lot better. But I can’t tell you we’ve nailed down every failure mode. Here are two big ones.

First: bringing up new WireGuard peers was slow. Real, real slow.

It’s Fall of 2021 and here’s what happened when you asked us to create a new WireGuard peer:

  1. You’d trigger a mutation in our GraphQL API to add a WireGuard peer.
  2. Our API would generate a WireGuard configuration and send it back to you.
  3. Meanwhile, it’d trigger a Consul KV write, adding the configuration to a KV tree that I did not expect to get as big as it got.
  4. The Consul cluster would hold an Entmoot.
  5. 45-95 seconds later, consul-templaterb on our gateway would get wind of a KV change, and download every single peer that had ever been configured for the gateway.
  6. About 10 seconds earlier, your flyctl command gave up trying to bring up a connection to a WireGuard peer that did not yet exist on the gateway.
  7. consul-templaterb would write wg1.conf and then run a shell program that would resync the WireGuard configuration and then re-install routes for each one of the tens of thousands of WireGuard peers for that gateway.
  8. 10 seconds later, you’re good to go! Wait, where’d you go?

This is very bad. It only happens the first time you use WireGuard on a machine; the next time you go light up that WireGuard peer, it’ll come right up, because it’s already installed. But guess who’s making a WireGuard connection with flyctl for the first time? That’s right: someone who just decided to try out Fly.io and followed our Speedrun instructions. No fair! It looks like all of Fly.io isn’t working, when in fact the only part of Fly.io that isn’t working is the part that allows you to use it.

That’s Will Jordan, on our SRE team.

There was low-hanging fruit to pick here. For instance, Will took one look at our WireGuard resync shell script and cut its runtime from 10 seconds to a few dozen milliseconds. But consul-templaterb — well, it is what it is. “Things will go as they will, and there’s no need to hurry to meet them”, it says. “I am on nobody’s side, because nobody is on my side, little orc.”

We have, in our prod infrastructure, two basic ways of communicating state changes: Consul and NATS. Consul is slow and “reliable”; NATS is fast, but doesn’t guarantee delivery. A few weeks ago, we switched from consul-templaterb to a system we call attache, which, among other things, does NATS transactions to update WireGuard peers. In the new system, creating a new WireGuard peer looks like this:

  1. You trigger a mutation in our GraphQL API to add a WireGuard peer.
  2. Our API generates a WireGuard configuration and sends it to attache on the gateway.
  3. A couple dozen milliseconds later, the gateway has installed the new WireGuard peer, and acknowledges the update to our API.
  4. The API replies to your GraphQL request with the WireGuard configuration
  5. Your flyctl connects to the WireGuard peer, which works, because you receiving the configuration means it’s installed on the gateway.

The whole process might take a second or two. It’s fast enough that you could imagine revisiting the decision to have a flyctl agent; with a little bit of caching, you could just make new WireGuard peers whenever you need them, and we could garbage collect the old ones.

There Is A More Ominous Problem I Don’t Like Talking About

The remote builds, they’re way better. You can stick flyctl in your Github actions for CI and it’ll work.

But I have a nagging feeling it can’t be working perfectly for everybody, because it involves running WireGuard, and, as you may already know, WireGuard doesn’t run over 443/tcp.

If you’re on a corporate network with a proxy firewall, or on a janky VPN, or some random CI VM, 51820/udp might be blocked. In fact, for all we know, all UDP might be blocked. I’ve tried telling Kurt “that’s their problem”, but I haven’t won the argument.

There is, in the Github project for flyctl, a branch that addresses this problem. Our WireGuard gateways all run a program called wgtcpd. It is as elegant as it is easy to pronounce. It runs an HTTPS server (with a self-signed certificate, natch!) with a single endpoint that upgrades to WebSockets and proxies WireGuard. The flyctl tcp-proxy branch will run WireGuard over that, instead of UDP.

I’m here to tell you that for all the nattering about how problematic UDP-only WireGuard is, it turns out not to involve a lot of code to fix; the WebSockets protocol for this is just “send a length, then send a packet, read a length, then read a packet”.

I originally called DERP “WebRTC” and Tailscale took offense, saying that the proper term would be “not-WebRTC not-ICE not-TURN but-kinda-similarish DERP NAT Traversal thingy”.

We could do something even more clever here; for instance, our friends/archnemeses at Tailscale run a global network of something they call “DERP”, which is part of their NAT-traversal proxy system; we could have our gateways connect to their DERP servers and register our public keys, and then you’d be able to connect to the same DERP servers and talk to us, and that seems like a fun project because there’s apparently nothing they can do to stop us.

But we’re still in denial about this problem and waiting for it to smack us in the face; we haven’t even merged the WebSockets branch of flyctl, because maybe it’s just not an issue? We only just solved the peer creation lag problem, and we’re waiting for things to even out. But if you needed to, you could run a WebSockets build of flyctl today.

Where This Leaves Us

I’ve painted a picture here, and you might infer from it that I regret user-mode WireGuard and TCP/IP. But the truth is, I love it very much; it is one of those Fly.io architectural features that makes me happy to work here. I’d say that for the first half of 2021, it probably wasn’t paying its way in complexity and operational cost, but that it’s opened up a bunch of possibilities for us that will let us build other bananas features without having to change anything in our prod infrastructure.

There’s a fun side to flyctl WireGuard. For instance, it has its own dig command, which talks directly to our private .internal nameservers. What’s that, you say? The same feature would be a couple dozen lines of Ruby in a GraphQL API? Shut up!

Or, how about this: you can ping things now. Ping! Of all things! You can flyctl ping my-app.internal and we’ll ping each instance of my-app for you. I know how much you love pinging things. And what’s the fun of using a hosting platform if you don’t get to pilot Howl’s Moving Castle Of Weird Network Engineering to check the latency on your app instances?

Also, I blame Jason Donenfeld and Ben Burkert for all of this.

As I said at the top, this is one of those posts that isn’t trying to sell Fly.io, but just provide a somewhat honest accounting of the experience of building it, and a peek into the way we think about stuff. Having said all that: you should take Fly.io for a spin, because when flyctl ssh console is working — and it’s working pretty much all the time now — it is slick as hell.