A Foolish Consistency: Consul at Fly.io

Fly.io runs applications by transmogrifying Docker containers into Firecracker micro-VMs running on our hardware around the world, connected with WireGuard to a global Anycast network. Yours could be one of them! Check us out: from a working container, your app can be running worldwide in minutes.

We set the scene, as usual, with sandwiches. Dig if you will the picture: a global Sandwich Bracket application, ascertaining once and for all the greatest sandwich on the planet.

Fly.io wants our app, sandwich-bracket, deployed close to users around the world. Chicago users vote for Italian beefs on an instance of sandwich-bracket in Chicago; people who love bánh mì are probably voting on a Sydney instance, egg salad on white bread, Tokyo.

To run a platform that makes this kind of thing work, we need a way to route incoming traffic to instances. The way we do that is with service discovery: a distributed catalog of all services running at Fly.io. The Fly.io service catalog lives in Consul. The catalog expands consciousness. The catalog is vital to space travel.

The catalog occupies more of our mental energy than just about anything at Fly.io. We've sunk a huge amount of energy into keeping São Paulo, Sydney, Singapore, and points between consistent in their view of what's running on Fly.io, scaling a single global Consul cluster. What we think we've learned is that keeping São Paulo and Sydney on exactly the same page about what's running in Mumbai is a mug's game, and we shouldn't be playing it.

And so, to begin, it is my privilege to inflict Consul on you.

What the Hell Is Consul?

Consul is a distributed database that attempts to be a source of truth for which services are currently running. It’s one of several “service coordination” or “service discovery” databases; the other popular ones are Etcd, which once powered Kubernetes, and Zookeeper, the original service coordinator, important in the Java/Hadoop ecosystem.

The challenge of these databases, the reason they’re not just trivial MySQL instances, is that you can’t just have one of them. Once you start relying on service discovery, it can't go down, or your applications all break. So you end up with a cluster of databases, which have to agree with each other, even as services come and go. These systems all expend a lot of effort, and make a lot of compromises, in order to cough up consistent answers on flappy networks with fallible servers where individual components can fail.

How Consul works is that you have a cluster of “Consul Servers” — maybe 3, 5, or 7 — and then all the rest of your machines run a “Consul Agent” that talks to the Servers. The Servers execute the Raft consensus protocol, maintaining a log of updates that form the basis for the database. Agents, in turn, inform the servers about events, such as an instance of a service terminating. An Agent can talk to any Server, but the Servers elect a leader, and updates are routed to the leader, which coordinates the Raft update to the log.

With me so far? Neither am I. But the specifics don’t matter much, as long as you understand that every machine in our fleet runs a lightweight Consul Agent that relays events to Consul Servers, which we only have a few of, locked in an unending and arcane ritual of consensus-tracking, producing: a map of every service running on Fly.io, and exactly where it’s running. Plus some other stuff.

Let's see it in action.

Consul at Fly.io

A Consul "service" at Fly.io, in the main, is an exposed port on an app a user deployed here. An "instance" of a service is a VM exposing that port. A "node" is one of Fly.io's own servers.

We run a couple different kinds of servers, among them lightweight “edges” handling Internet traffic, and chonky “workers” running customer VMs. Both run fly-proxy, our Rust+Tokio+Hyper proxy server.

Every app running on Fly.io gets a unique, routable IPv4 address.

That’s how our CDN works: we advertise these addresses, the same addresses, from dozens of data centers around the world, with BGP4 (this is “Anycast”). Backbone routing takes you to the closest one. Say sandwich-bracket is currently deployed in Frankfurt and Sydney. Anycast means the votes for doner land on a Fly.io edge in Frankfurt, right next to a worker running sandwich-bracket; A bánh mì vote lands on a Sydney edge, and is routed to a Sydney worker. We're not deployed in Tokyo, so a vote for egg salad hits a Tokyo edge, and gets routed… out of Japan.

The problem facing fly-proxy is, “where do I send this egg salad vote”. “The garbage” being, unfortunately, not a valid answer, fly-proxy needs to know which workers your sandwich-bracket app is running on, and then it needs to pick one to route to.

Here's the data we're working with:

data stability what
nodes very stable network locations of workers and edges
app IPs stable match incoming traffic to apps
services changing pinpoint instances to route to
health flappy pinpoint instances to avoid routing to
load untenably flappy enable load balancing

Enter Consul. When you first created sandwich-bracket with our API, we:

  1. allocated an IPv4 address for it 
  2. wrote it to Consul’s KV store
  3. deployed it to, say, a worker node in Frankfurt
  4. when that instance came up, our orchestration code registered the service instance with Consul. 

The simplest way to integrate all that information, and what we did until a couple months ago, is: we’d run consul-templaterb and ask it to track every service in Consul and sync a JSON file with the data; when the file is updated, fly-proxy gets a signal and re-reads it into memory.

Theoretically, Consul can tell us how “close” each of those services are – Consul puts a bunch of work into network telemetry — but we do that bit ourselves, and so there’s another JSON file that fly-proxy watches to track network distance to every server in our fleet.

Consul doesn't give us the load (in concurrent requests) on all the services. For a long time, we abused Consul for this, too: we tracked load in Consul KV. Never do this! Today, we use a messaging system to gossip load across our fleet.

Specifically, we use NATS, a simple "brokered" asynchronous messaging system, the pretentious way to say "almost exactly like IRC, but for programs instead of people".

Unlike Consul, NATS is neither consistent nor reliable. That's what we like about it. We can get our heads around it. That's a big deal: it's easy to get billed for a lot of complexity by systems that solve problems 90% similar to yours. It seems like a win, but that 10% is murder. So our service discovery will likely never involve an event-streaming platform like Kafka.

Putting it all together, you have a sense of how our control plane works. Say the World Sandwich Authority declares doner is no longer a sandwich, and Japanese biochemists invent an even fluffier white bread. Traffic plummets in Frankfurt and skyrockets in Tokyo. We move our Frankfurt instance to Tokyo (flyctl regions set syd nrt). This kills the Frankfurt instance, and Frankfurt's Consul Agent deregisters it. JSON files update across the fleet. The Tokyo instance comes up and gets registered; more JSON churn.

We use Consul for other stuff!

  • Apps on Fly.io belong to “organizations”, and apps within an organization, all over the world, can talk to each other on 6PN private networks. The DNS records for those 6PN addresses are derived (mostly) from Consul.
  • One of Consul’s big features is “health checks” — it’ll report the health of a service based on configurable checks, so we don’t route traffic to instances that are having problems.
  • Consul used to be how we propagated WireGuard information for our users, which is how flyctl works. 
  • Consul is also the source of truth for the WireGuard mesh network that links every machine in our fleet.

It Burns

It looks like textbook Consul. But it's not, really.

Consul is designed to make it easy to manage a single engineering team's applications. We're managing deployments for thousands of teams. It's led us to a somewhat dysfunctional relationship.

To start with, we have a single Consul namespace, and a single global Consul cluster. This seems nuts. You can federate Consul. But every Fly.io data center needs details for every app running on the planet! Federating costs us the global KV store. We can engineer around that, but then we might as well not use Consul at all.

Consul's API was also not designed with our needs in mind (nor should it have been). It's got a reasonable API for tracking some things, but not quite the things we need. So, for instance, there’s an HTTP endpoint we can long-poll to track the catalog of services. But:

  1. For any endpoint in the Consul HTTP API, if something changes, we get a refresh of all the data for that endpoint, which means incremental changes, which happen every few seconds, are expensive.
  2. What’s worse (and not Consul’s fault), tooling like consul-templaterb means we’re constantly rewriting and rereading large JSON files to register those small changes.
  3. The “catalog of services” endpoint returns information about the services (the sandwich-brackets app), but not the metadata for individual instances of those services (the instances of sandwich-brackets in nrt and syd).

That second problem is kind of a nightmare. We have tens of thousands of distinct services. We need per-instance metadata for every instance of those services.

Consul can give us that metadata in two ways: by asking it about individual services, one-by-one, or by asking for service catalogs from each of our servers. We can't long poll tens of thousands of endpoints. So, the way we get instance metadata from Consul is to ask it about servers, not services. We long-poll an API endpoint for each individual server. There’s no one endpoint that we can long-poll for all the nodes.

You might at this point ask why we’re storing this kind of stuff in instance metadata at all. That’s a good question with a complicated answer. Some of it has to do with Nomad, the orchestration service we use (and are gradually moving away from) to actually run VM jobs. Information about, say, what ports an app listens on percolates from a Nomad task description into its Consul service registration; that’s just how Nomad works.

You also can’t just factor the information out into, say, a Consul KV tree, because apps have versions, and different versions of apps listen on different ports, and fly-proxy needs to track them.

This stuff is all solvable! But, like, are you going to solve it by using Consul more carefully, or are you going to solve it by using Consul less?

Which is how it came to be that we found ourselves driving over 10 (t-e-n) gb/sec of Consul traffic across our fleet. Meanwhile, and I haven’t done the math, but it’s possible that the underlying data, carefully formatted and compressed, might fit on a dialup modem.

This, it turns out, was Not Entirely Our Fault. Long-suffering SRE Will Jordan, his brain shattered by ten gigabits of sustained Consul traffic, dove into the Consul codebase and discovered a bug: updates anywhere in Consul un-blocked every long-polling query. We had tens of thousands, N^2 (don’t email me!) in the number of nodes, all of which return a full refresh of the data they’re tracking when they unblock. Anyways, Will wrote a couple dozen lines of Go, and:

A gruesome bandwidth graph

Extricating Ourselves

So, consul-templaterbis easy, but rough at the scale we work at. It runs Ruby code to to track updates that happen multiple times per second, each time writing giant JSON blobs to disk.

We felt this acutely with private DNS. Consul propagates the data that our DNS servers use (it's similar to the data that fly-proxy uses). Fly.io Postgres depends on these DNS records, so it needs to work.

Our DNS server was originally written in Rust, and used consul-templaterb the way fly-proxy did, but wrote its updates to a sqlite database. At certain times, for certain workers, we’d experience double-digit second delays after instances came up — or worse, after they terminated. This is a big deal: it's a window of many seconds during which internal requests get routed to nonexistent hosts; worse, the requests aren't being handled by our smart proxy, but by people's random app networking code.

We blamed Consul and consul-templaterb.

To fix this, we rewrote the DNS server (in Go), so that it tracked Consul directly, using Consul’s Go API, rather than relying on consul-templaterb. We also had it take “hints” directly from our orchestration code, via NATS messages, for instances starting and stopping.

Rewriting a Rust program in Go is sacrilege, we know, but Go had the Consul libraries we needed, and, long term, that Go server is going to end up baked into our orchestration code, which is already in Go.

It turns out that what we really want (if not our dream Consul API) is a local sqlite cache of all of Consul’s state. That way our proxy, WireGuard code, DNS servers, and everything else can track updates across our fleet without a lot of complicated SRE work to make sure we’re interfacing with Consul properly.

By rewriting our DNS server, we'd inadvertently built most of that. So we extracted its Consul-tracking code and gave it an identity of its own, attache. attache runs on all our hosts and tracks most of Consul in sqlite. In theory, infra services at Fly.io don’t need to know anything about Consul anymore, just the schema for that database.

New architectural possibilities are becoming apparent.

Take that horrible N^2 polling problem. Since we've abstracted Consul out, we really don't need a better Consul API; we just need an attache API, so a "follower" attache can sync from a "leader". Then we could run a small number of leaders around the world — maybe just alongside Consul Servers, so that almost all our Consul read traffic would be from machines local to the Consul Servers. We'd chain lots of followers from them, and possibly scale Consul indefinitely.

We can get clever about things, too. What we'd really be doing with "leader" and "follower" attache is replicating a sqlite database. That's already a solved problem! There's an amazing project called Litestream that hooks sqlite's WAL checkpointing process and ships WAL frames to storage services. Instead of building a new attache event streaming system, we could just set up "followers" to use Litestream to replicate the "leader" database.

You don't really have to think about any of this.

It's just a couple commands to get an app deployed on Fly.io; no service discovery required.

Try Fly for free  

We Don't Want Raft

We're probably not going to do any of that, though, because we're increasingly convinced that's sinking engineering work into the wrong problem.

First, if you haven't read the Google Research paper "The Tail At Scale", drop everything and remedy that. It's amazing, an easy read, and influential at Fly.io.

Then: the problem we're solving with attache is "creating a globally consistent map of all the apps running across our fleet". But that's not our real problem. Our real problem is "generate fast valid responses from incoming requests". Solving the former problem is hard. It's one answer to the real problem, but not necessarily the optimal one.

No matter how we synchronize service catalogs, we're still beclowned by the speed of light. Our routing code always races our synchronization code. Routing in Tokyo is impacted by events happening in São Paolo (sandwich: mortadella on a roll), and events in São Paolo have a 260ms head start.

A different strategy for our real problem is request routing that's resilient to variability. That means a distributing enough information to make smart first decisions, and smart routing to handle the stale data. We've already had to build some of that. For instance, if we route a request from an edge to a worker whose instances have died, fly-proxy replays it elsewhere.

You can finesse this stuff. "The Tail At Scale" discusses "hedged" requests, which are sometimes forwarded to multiple servers (after cleverly waiting for the 95th percentile latency); you take the first response. Google also uses "tied" requests, which fan out to multiple servers that can each call dibs, canceling the other handlers.

We'd do all of this stuff if we could. Google works on harder problems than we do, but they have an advantage: they own their applications. So, for instance, they can engineer their protocols to break large, highly-variable requests into smaller units of work, allowing them to interleave heavyweight tasks with latency-sensitive interactive ones to eliminate head-of-line blocking. We haven't figured out how to do that with your Django POST and GET requests. But we're working on it!

A lot of our problems are also just simpler than Raft makes them out to be. We used to use Consul to synchronize load information. But that's kind of silly: Consul is slower than the feed of load events, and with events sourced globally, we never could have had a picture that was both fresh and accurate. What we needed were hints, not databases, and that's what we have now: we use NATS (which isn't necessarily even reliably delivered, let alone consensus-based) to gossip load.

Same goes for health checks. We already have to be resilient to stale health check information (it can and does change during request routing). Consul keeps a picture of health status, and we do still use it to restart ailing instances of apps and report status to our users. But fly-proxy doesn't use it anymore, and shouldn't: it does a fine job generating and gossiping health hints on its own.

But, as Always: DNS

We can make our routing resilient to stale health and load information, and push orchestration decisions closer to workers to be less reliant on distributed health events. We have a lot of flexibility with our own infrastructure.

That flexibility stops at the doors of the VM running your Django app. We don't own your app; we don't really even know what it does. We're at the mercy of your socket code.

So one place we're kind of stuck with Consul-style strongly consistent service maps is DNS. Normal apps simply aren't built to assume that DNS can be a moving target. If your app looks up sandwich-postgres.internal, it needs to get a valid address; if it gets nothing, or, worse, the address of a VM that terminated 750ms ago, it'll probably break, and break in hyper-annoying ways that we don't yet have clever ways to detect.

We've spent the better part of 9 months improving the performance of our .internal DNS, which is a lot for a feature that was an afterthought when I threw it together. We're in a sane place right now, but we can't stay here forever.

What we're going to do instead – you'll see it soon on our platform – is play the ultimate CS trump card, and add another layer of indirection. Apps on Fly.io are getting, in addition to the DNS names they have today, "internal Anycast": a stable address that routes over fly-proxy, so we can use the same routing smarts we're using for Internet traffic for Postgres and Redis.

You'll still get to be fussy about connectivity between your internal apps! Internal Anycast is optional. But it's where our heads are at with making internal connectivity resilient and efficient.

And So

We mostly like Consul and would use it again in new designs. It’s easy to stand up. It’s incredibly useful to deploy infrastructure configurations. For example: we write blog posts like this and people invariably comment about how cool it is that we have a WireGuard mesh network between all of our machines. But, not to diminish Steve’s work on flywire, that system falls straight out of us using Consul. It’s great!

But we probably wouldn’t use Consul as the backing store for a global app platform again, in part because a global app platform might not even want a single globally consistent backing store. Our trajectory is away from it.