Building clusters with serf, my new favorite thing

Author

Name: Thomas Ptacek
@tqbf: @tqbf

Assume for a second we’d like to see what happens when a web page loads in a browser in Singapore. Easy enough; Fly.io will take a container image you throw at it, transform it into a Firecracker VM, and run it in Singapore.

Getting Up And Running

We want a container that loads a web page in a browser; that sounds like a job for Headless Chromium. Here’s a Dockerfile; actually, don’t bother, it’s just one of those Dockerfiles that installs the right apt-get packages and then downloads the right Chromium distribution; the entrypoint runs Chromium with the right arguments.

Deploy the app on Fly:

    $ flyctl apps create # accept the defaults
$ flyctl regions set sin
$ flyctl deploy

  

And this will pretty much just work. Say we named the Fly app ichabod-chrome. When flyctl deploy finishes, our image will be running as VM somewhere near Singapore, and reachable from around the world as ichabod-chrome.fly.dev. You could drive the Chrome instance running in Singapore using the Chrome debug protocol, which has implementations in a ton of languages; for instance, if we want to screenshot a page in Ruby, we could just install the ferrum gem, and then:

    require 'ferrum'
c = Ferrum::Browser.new :base_url => "http://ichabod-chrome.fly.dev:9222/"
c.goto "https://news.ycombinator.com"
c.screenshot :path => "hn.webp"

  

Super boring! Neat that it works, though! But there’s, like, an obvious problem here: Chrome Debug Protocol isn’t authenticated, so we’re just kind of hanging out on the Internet hoping nobody does something dumb with the browser proxy we’ve created on this public URL.

Let’s fix that. We’ll run our Chrome as a 6PN application, and talk to it over WireGuard. We crack open the fly.toml that flyctl generated for us, and add:

    [experimental]
   private_network = true

We also yank out the whole [[services]] section, because we’re not exposing any public services to health-check anymore. And we change our entrypoint to bind to its private IPv6 address.

A flyctl deploy run loads our “new” application, which speaks CDP only over private IPv6 addresses. But: now we can’t talk to it! We’re not on the private IPv6 network.

That’s easy to fix: install WireGuard (it runs everywhere). Then run flyctl wireguard create, which will generate a WireGuard configuration for us that we can load in our client. Hit the connect button, and we’re good to go again, this time with a cryptographically secure channel to run CDP over. On our internal DNS, which is included in the WireGuard configuration we generate, our app is now reachable at ichabod-chrome.internal.

Clusters and DNS

Let’s say we want a bunch of Headless Chromiums, in a bunch of different locations. Maybe we want to screenshot the CNN front page from different countries, or run Lighthouse tests from around the world. I’m not here to judge your life decisions.

Getting those Chromium instances up and running is mercifully boring. Let’s say we want roughly to run in Singapore, Sydney, Paris, and Chile:

flyctl regions add sin syd cdg scl
flyctl scale count 4

… and that’s it; Fly will figure out how to satisfy those constraints and deploy appropriately (we’re asking now for 4 instances, and Fly will try to spread those instances around as many data centers as it can).

Now, we want to drive these new instances, and do so selectively. To do that, we have to be able to find them. We can use the DNS to do that:

$ dig txt regions.ichabod-chrome.internal +short
sin,syd,cdg,scl
$ dig aaaa syd.ichabod-chrome.internal +short
fdaa:0:18:a7b:aa4:3a64:c0ba:2

And this pretty much works, and you can probably get a long ways just using DNS for instance discovery, especially if your cluster is simple.

But for me, for this app, this is kind of an annoying place to leave off. I could pick a bunch of nits, but the big one is that there isn’t a good way to automatically get updates when the DNS changes. I can get a pretty good picture of the world when an instance starts up, but I have to go through contortions to update that picture as time ticks on.

When we were putting DNS together at Fly, we had the same thoughts. And yet we did nothing about them! We quickly concluded that if people wanted “interesting” service discovery, they could B.Y.O.

Let’s see how that plays out with this cluster. I’m going set up HashiCorp Serf to make all the components of this cluster aware of each other.

Running HashiCorp Serf

They do somewhat similar things, but Serf gets less attention than its HashiCorp sibling Consul. Which is a shame, because Serf is a simpler, more approachable system that does 80% of what a lot of people use Consul for.

A reasonable mental model of Consul is that it’s a distributed system that solves 3 problems:

Serving an up-to-date catalog of available services
Storing configuration for those services
Removing unhealthy services from the catalog

Unlike Consul, Serf handles just one of these problems, #1. In fact, Consul uses Serf under the hood to solve that problem for itself. But Consul is much more complicated to set up. Serf runs without leaders (cluster members come and go, and everybody just figures things out) and no storage requirements.

Serf is easy. In a conventional configuration — one where we run Serf as a program and not as a library embedded into our application — every node in the cluster runs a Serf agent, which, after installing Serf, is just the serf agent command. All the configuration can get passed in command line arguments:

      name="${FLY_APP_NAME}-${FLY_REGION}-$(hostname)"
  paddr=$(grep fly-local-6pn /etc/hosts | cut -f 1)

  serf agent                        \
    -node="$name"                   \ 
  -profile=wan                  \
    -bind="[$paddr]:7777"           \
    -tag role="${FLY_APP_NAME}"     \
    -tag region="${FLY_REGION}"     

  

There’s not much to it. We give every node a unique name. Serf by default assumes we’re running on a LAN and sets timers accordingly; we switch that to WAN mode. Importantly, we bind Serf to our 6PN private address. Then we set some tags, for our convenience later when selecting members.

To help Serf find other members in the cluster and converge on the complete picture of its membership, can make some quick introductions:

        dig aaaa "${FLY_APP_NAME}.internal" +short | while read raddr ; do
      if [ "$raddr" != "$paddr" ]; then
        serf join "$raddr"
      fi
    done

  

Here we’re just dumping the current snapshot of the cluster from DNS and using serf join to introduce those members. Now, if we have nodes Alice, Bob, and Chuck, and Alice introduces herself to Bob and Bob introduces herself to Chuck, Bob will make sure Alice knows about Chuck as well. We’ll talk about how that works in a second.

I wrap these two actions, running the agent and making introductions, up in a little shell script. Because I’m now running multiple thingies in my Docker image, I use overmind as my new entrypoint, which drives a Procfile. Here’s the whole Dockerfile.

What did this get me? Well, from now on, if I’m on the private IPv6 network for my organization, I can find any node and instantly get a map of all the other nodes:

    $ serf members -status=alive
ichabod-yyz-3a64c0ba   [fdaa:0:18:a7b:aa4:3a64:c0ba:2]:7777  alive   role=ichabod,region=yyz
ichabod-iad-bdc999ff   [fdaa:0:18:a7b:ab8:bdc9:99ff:2]:7777  alive   region=iad,role=ichabod
ichabod-ord-401dbe36   [fdaa:0:18:a7b:7d:401d:be36:2]:7777   alive   role=ichabod,region=ord

  

I can integrate this information with a shell script, but I can also just bring it into my application code directly (here with the relatively simple serfx gem:

    require 'ferrum'
require 'serfx'

serf = Serfx.connect
raddr = IPAddr.new_ntoh serf.members.body["Members"].first["Addr"]

c = Ferrum::Browser.new :base_url => "http://[#{raddr}]:9222/"

I could easily filter this down by location (via the “region”) tag, role, or, as we’ll see in a sec, network closeness. This interface is simpler than DNS, it’s lightning fast, and it’s always up-to-date.

A Quick Word About Security

Serf has a security feature: you can key your Serf communications statically, so rogue nodes without the key can’t participate or read messages.

It’s fine, I guess. I’d be nervous if I was deploying Serf in an environment where I was really depending on Serf’s encryption for security. But, frankly, it doesn’t matter to us here, because we’re already running on a private network, and our external connectivity to that network takes advantage of the vastly more sophisticated cryptography in WireGuard.

What Serf Is Doing

The first bit of distributed systems jargon that comes up when people describe Serf is SWIM, the “Scalable Weakly-Consistent Infection Membership” protocol. Distributed systems are full of protocols with acronymical names that are hard to get your head around, and SWIM is not one of those; I don’t think you even need a diagram to grok it.

You can imagine the simplest possible membership protocol, where you make introductions (like we did in the last section) and every member simply relays messages and tries to connect to every new host it learns about. That’s probably what you’d come up with if you ran into the membership problem unexpectedly in a project and just needed to bang something out to solve it, and it works fine to a point.

SWIM is just a couple heuristic steps forward from that naive protocol, and those steps make the protocol (1) scale better, so you can handle many thousands of nodes, and (2) quickly detect failed nodes.

First, instead of spamming every host we learn about with heartbeats on an interval, we instead select a random subset of them. We essentially just ping each host in that subset; if we get an ACK, we’ve confirmed they’re still members (and, when new nodes connect up to us, we can share our total picture of the world with them to quickly bring them up to date). If we don’t get an ACK, we know something’s hinky.

Now, to keep the group membership picture from flapping every time a ping fails anywhere in the network, we add one more transaction to the protocol: we mark the node we couldn’t ping as SUS, we pick another random subset of nodes, and we ask them to ping the SUS node for us. If they succeed, they tell us, and the node is no longer SUS. If nobody can ping the node, we finally conclude that the node is the impostor, and eject them from the ship.

Serf’s SWIM implementation has some CS grace notes, but you could bang the basic protocol out in an hour or two if you had to.

Serf isn’t just a SWIM implementation, and SWIM isn’t the most interesting part of it. That honor would have to go to the network mapping algorithm Vivaldi. Vivaldi, which was authored by a collection of my MIT CSAIL heroes including Russ Cox, Frans Kaashoek, and (yes, that) Robert Morris, computes an all-points pairwise network distance map for a cluster. Here’s a funny thread where Russ Cox finds out, years later, that HashiCorp implemented his paper for Serf.

Here’s roughly how Vivaldi works:

We model the members of our cluster as existing in some space. To get your head around it, think of them as having Cartesian 3D coordinates. These coordinates are abstract; they have no relation to real 3D space.

To assign nodes coordinates in this space, we attach them to each other with springs of varying (and, to start with, indeterminate) lengths. Our job will be to learn those lengths, which we’ll do by sampling network latency measurements.

To begin with, we’ll take our collection of spring-connected nodes and squish them down to the origin. The nodes are, to begin with, all sitting on top of each other.

Then, as we collect measurements from other nodes, we’ll measure error, comparing our distance in the model to the distance reflected by the measurement. We’ll push ourselves away from the nodes we’re measuring in some random direction (by generating a random unit vector), scaled by the error and a sensitivity factor. That sensitivity factor will itself change based on the history of our error measurements, so that we update the model more or less confidently based on the quality of our measurements.

Our cluster converges on a set of network coordinates for all the nodes that, we hope, relatively accurately represents the true network distance between the nodes.

This all sounds complicated, and I guess it is, but it’s complicated in the same sense that TCP congestion control (which was originally also based on a physical model) is complicated, not in the sense that, say, Paxos is: the complexity is mostly not exposed to us and isn’t costing meaningful performance. Serf sneaks Vivaldi data into its member updates, so we get them practically for free.

We can now ask Serf to give us the RTT’s between any two points on the network:

> $ serf rtt ichabod-yyz-3a64c0ba ichabod-iad-bdc999ff                                       
Estimated ichabod-yyz-3a64c0ba <-> ichabod-iad-bdc999ff rtt: 19.848 ms

If you’re like me, you read Serf’s description of their Vivaldi implementation and have a record scratch moment when they say they’re using an 8-dimensional coordinate system. What do those coordinates possibly represent? But you can sort of intuitively get your head around it this way:

Imagine that network performance was entirely dictated by physical distance, so that by sampling RTTs and updating a model what we were effectively doing was recapitulating the physical map of where nodes where. Then, a 2D or 3D coordinate space might effectively model network distance. But we know there are many more factors besides physical distance that impact network distance! We don’t know what they are, but they’re embedded somehow in the measurement data we’re collecting. We want enough dimensionality in our coordinates so that by iteratively and randomly sproinging away from other nodes, we’re capturing all the factors that determine our RTTs, but not so much that the data that we’re collecting is redundant. Anyways, 8 coordinates, plus (again) some grace notes.

Armon Dadger, the CTO of HashiCorp, has a really excellent talk on Vivaldi that you should just watch if this stuff is interesting to you.

Frankly, I’m writing about Vivaldi because it’s neat, not because I get a huge amount of value out of it. In theory, Serf’s Vivaldi implementation powers “nearness” metrics in Consul, which in our experience have been good but not great; I’d trust relative distances and orders of magnitude. But RTT’s aside, you could also theoretically take the 8D coordinates themselves and use them to do more interesting modeling, like automatically creating clusters of nearby or otherwise similar nodes.

A last Serf thing to point out: membership is interesting, but if you have a membership protocol, you’re epsilon away from having a messaging system, and Serf does indeed have one of those. You can send events to a Serf cluster and tell your agent to react to them, and you can define queries, which are events that generate replies. So, I can set up Serf this way:

      serf agent                        \
    -node="$name"                   \
  -profile=wan                  \
    -bind="[$paddr]:7777"           \
    -tag role="${FLY_APP_NAME}"     \
    -tag region="${FLY_REGION}"     \
    -event-handler=query:load=uptime

  

And now we can query the load on all our nodes:

    > $ serf query load                                                                          
Query 'load' dispatched
Ack from 'ichabod-iad-bdc999ff'
Response from 'ichabod-iad-bdc999ff':  23:51:58 up 1 day, 19:25,  0 users,  load average: 0.00, 0.00, 0.00
Ack from 'ichabod-yyz-3a64c0ba'
Ack from 'ichabod-ord-401dbe36'
Response from 'ichabod-ord-401dbe36':  23:51:47 up 1 day, 19:25,  0 users,  load average: 0.00, 0.00, 0.00
Response from 'ichabod-yyz-3a64c0ba':  23:51:59 up 1 day, 19:26,  0 users,  load average: 0.00, 0.00, 0.00
Total Acks: 3
Total Responses: 3

  

Under the hood, Serf is using logical timestamps to distribute these messages somewhat (but imperfectly) reliably. I love logical timestamps so much.

Easily run clusters on Fly

Painlessly boot up clusters on Fly and link them with WireGuard to your computer, AWS, or GCP.
Try Fly for free →

& Scene!

So anyways, my point here is, you can do a lot better than DNS for service discovery in a 6PN setup on Fly. Also my point is that Serf is really useful and much, much easier to set up and run than Consul or Zookeeper; you can bake it into a Dockerfile and forget about it.

Also, my point is that Fly is a pretty easy way to take distributed systems tools like this out for a spin, which you should do! You can boot a Consul cluster up on Fly, or, if you’re an Elixir person, you could use Partisan instead of Serf, which does roughly the same kind of thing.

Next post ↑: Fly In 2020 - A year in features (and articles)
Previous post ↓: How to build a global message service with NATS