Platform Engineering: Proxy

Now Hiring
Intern
Level 1
Level 2
Senior
Work From

Fly.io’s platform code, the engine that makes all our stuff work, is written in two different systems languages: Rust and Go. Go code powers our orchestration; it’s what converts Docker images and provisions VMs. Rust code drives fly-proxy, our Anycast network.

We’re looking for people who want to work on things like fly-proxy.

fly-proxy is an interesting piece of code. A request for some Fly.io app running in Dallas and Sydney arrives at our edge in, say, Toronto. We need to get it to the closest VM (yes, in this case, Dallas). So fly-proxy needs a picture of where all the VMs are running, so it can make quick decisions about where to bounce the request. This happens billions of times a day.

We don’t simply forward those requests, though. fly-proxy runs on both our (external-facing) edge hosts and our (internal) workers, where the VMs are. It builds on-the-fly, multiplexed HTTP2 transports (running over our internal WireGuard mesh) to move requests over. The “backhaul” configuration running on the worker demultiplexes and routes the request over a local virtual interface to the VM.

It gets more interesting in every direction you look. For instance: we don’t just route simple HTTP requests; we also do raw TCP (no HTTP2 for that forwarding path). And WebSockets. All these requests get balanced (most of our users run a bunch of instances, not just one). And in the HTTP case, we automatically configure HTTPS, and get certificates issued with the LetsEncrypt ALPN challenge.

Zoom in on the raw request routing and there’s still more stuff going on. fly-proxy is build on TokioHyper, and Tower. A single fly-proxy is managing connectivity for lots and lots of Fly.io apps, and isolates the concurrency budget for each of those apps, so a busy app can’t starve the other apps. We’re tracking metrics for each of those apps and making them accessible to users.

fly-proxy also has some fun distsys problems. That global picture of where the VMs are is updating all the time. So too are the load stats for all those VMs, which impact how we balance requests. Requests can fail and get retried automatically elsewhere; in fact, that’s the core of how we do distributed Postgres.

What We’re Up To

As an engineering team, we’ve barely scratched surface of what we can do for our users with this code. Now we’re getting our pickaxes out:

  • We might allow users to configure custom routing or middleware processors per-app or per-machine.
  • We’ll be adding new handlers to the proxy, beyond HTTP, HTTP2, and TCP.
  • We’re working on a source-to-edge static asset caching system, so that our Anycast proxy network behaves more like a CDN for the kinds of content that benefit from CDN behavior.
  • We allocate routable IPv4 apps to every app we run, which puts a cost floor on new apps on Fly.io; we’ll build a system to do shared IPs soon.
  • We do internal (OpenTelemetry-style) tracing today, but we’ll ultimately want to make this end-to-end, linking browsers and customer VMs into request traces.

And, of course, whatever you come up with, too. Fly.io runs on small, single-pizza teams that largely set their own direction. If that sounds fun to you, we should talk.

More About Us And How We Hire

There’s more about us than you probably want to know at our hiring documentation.

Make sure you read the part about our hiring process! We do hiring differently here.

The salary range for this role is $120k-$200k, plus equity.

Interested? Have questions? Hit us up at jobs+platform@fly.io.