Fly's Prometheus Metrics

Author

Name: Thomas Ptacek
@tqbf: @tqbf

Fly.io transforms container images into fleets of micro-VMs running around the world on our hardware. It’s bananas easy to try it out; if you’ve got a Docker container, it can be running on Fly in single-digit minutes.

We should talk a bit about metrics and measurement and stuff, because they’re how we all know what’s going on.

There’s two reasons we’ve written this post. The first is just that we think this stuff is interesting, and that the world can always use another detailed case study. The second, though, is that the work we’re going to describe is now part of our public API, and you can piggyback your own application metrics on top of ours. Our hope is, the more we describe this stuff, the more likely you are to get value out of it.

Counting

Here’s where we’ll start: an HTTP request arrives for an app running on Fly. It lands on one of our “edge nodes”, at fly-proxy, our Rust request router. The proxy’s job is to deliver the request to the closest unloaded “worker node”, which hosts Firecracker VMs running user applications.

Somewhere in fly-proxy’s memory there’s a counter, for the number of incoming bytes the proxy has handled. While routing the incoming request, the proxy bumps the counter.

Bumping a counter is cheap. So there’s counters, and things that act on them, all over the edge node. The operating system counts packets processed, TCP SYN segments seen, bytes processed per interface. CPU load. Free memory. Naturally, the proxy counts a bunch of things: a histogram of the times taken to complete TLS handshakes, number of in-flight connections, and so on. Some counts are global, others per-user.

The proxy directs the request over a WireGuard interface (more counters) to the worker node, which also runs fly-proxy (still more). The worker’s proxy finds the tap interface associated with the application and routes the request to its address (you guessed it). Inside the VM, where the user’s app is running, the guest OS counts a series of arriving Ethernet frames, a newly opened TCP connection, memory used, CPU cycles burned. Your app running in that VM might count a new HTTP connection.

We count everything we can think to count, because computers like counting stuff, so it’s cheap. You’d do the same thing even if you were building on a potato-powered MSP430 microcontroller.

Those who cannot remember the Borgmon are doomed to repeat it

To do anything with this information we need to collect it. There was once in our industry a lively debate about how to accomplish this, but Google settled it. Services expose HTTP endpoints that dump the stats; if your service can’t do that, it arranges to have something else do it for them.

Google called these web pages /varz, and filled them with a line-by-line human-readable format (apocryphally: intended for human readers) that Google’s own tools learned to parse. In the real world, we’ve kept the text format, and call the pages /metrics. Here’s what it looks like:

fly_proxy_service_egress_http_responses_count{status=“200”,app_id=“3146”,app_name=“debug”,alloc=“f3e39e04”,servi
ce=“app-3146-tcp-8080”,org_id=“17”} 1586

If you’re an Advent of Code kind of person and you haven’t already written a parser for this format, you’re probably starting to feel a twitch in your left eyelid. Go ahead and write the parser; it’ll take you 15 minutes and the world can’t have too many implementations of this exposition format. There’s a lesson about virality among programmers buried in here somewhere.

To collect metrics, you scrape all the /metrics pages you know about, on an interval, from a collector with some kind of indexed storage. Give that system a query interface and now it’s it’s a time-series database (a TSDB). We can now generate graphs and alerts.

What we’re describing is, obviously, Prometheus, which is an escaped implementation of Borgmon, Google’s cluster monitoring system. Google SREs have a complicated relationship with Borgmon, which was once said to have been disclosed in an “industry-disabling paper”. A lot of people hate it.

None of us have ever worked for Google, let alone as SREs. So we’re going out on a limb: this is a good design; it’s effective, so easy to implement that every serious programming environment supports it, and neatly separates concerns: things that generate metrics just need to keep counters and barf them out into an HTTP response, which makes it easy to generate lots of metrics.

Metrics Databases

The division of labor between components in this design is profoundly helpful on the database side as well. What makes general-purpose database servers hard to optimize is that they’re general-purpose: it’s hard to predict workloads. TSDBs don’t have this problem. Every Prometheus store shares a bunch of exploitable characteristics:

Lots of writes, not a lot of reads.
Reads return vectors of data points — columns, not rows.
Simply formatted numeric data.
Readers tend to want data from 5 minutes ago, not 5 days ago.

Knowing this stuff ahead of time means you can optimize. You’ll use something like an LSM tree to address the bias towards writes, jamming updates into a write-ahead log, indexing them in memory, gradually flushing them to immutable disk chunks. You can compress; timestamps recorded on a relatively consistent interval are especially amenable to compression. You can downsample older data. You can use a column store instead of a row store.

And, of course, you can scale this stuff out in a bunch of different ways; you can have Prometheus servers scrape and aggregate other Prometheus servers, or spill old chunks of data out to S3, or shard and hash incoming data into to a pool of servers.

Point being, keeping an indefinite number of time-series metrics available for queries is a tractable problem. Even a single server will probably hold up longer than you think. And when you you need to, you can scale it out to keep up.

Which is what we’ve done at Fly. And, since we’ve built this, there’s no sense in keeping it to ourselves; if you’re running an app on Fly, you can hand us your metrics too. Let’s describe how that works.

Our Metrics Stack

A typical Fly.io POP will run some small number of lightweight edge nodes, a larger number of beefier worker nodes, and, in some cases, an infra host or two. All of these nodes — within and between POPs — are connected with a WireGuard mesh.

Our metrics stack — which is built around Prometheus-style metrics — features the following cast of characters:

Victoria Metrics (“Vicky”, for the rest of this post), in a clustered configuration, is our metrics database. We run a cluster of fairly big Vicky hosts.
Telegraf, which is to metrics sort of what Logstash is to logs: a swiss-army knife tool that adapts arbitrary inputs to arbitrary output formats. We run Telegraf agents on our nodes to scrape local Prometheus sources, and Vicky scrapes Telegraf. Telegraf simplifies the networking for our metrics; it means Vicky (and our iptables rules) only need to know about one Prometheus endpoint per node.
Vector, which is like a Rust version of Telegraf — we use Vector extensively for our log pipeline (a whole other blog post), and also as a remote-writer for some Prometheus metrics. Our Vector is likely to eat our Telegraf this year.
Prometheus exporters, to which Telegraf is wired to scrape, expose metrics for our services (and, in some cases, push stats directly to Vicky with the Prometheus remote_write API).
Consul, which manages the configuration for all these things, so each host’s Telegraf agent can find local metrics sources, and Vicky can find new hosts, &c.

Steve here

For metrics nerds, the only interesting thing about this stack is Vicky. The story there is probably boring. Like everyone else, we started with a simple Prometheus server. That worked until it didn’t. We spent some time scaling it with Thanos, and Thanos was a lot, as far as ops hassle goes. We’d dabbled with Vicky just as a long-term storage engine for vanilla Prometheus, with promxy set up to deduplicate metrics.

Vicky grew into a more ambitious offering, and added its own Prometheus scraper; we adopted it and scaled it as far as we reasonably could in a single-node configuration. Scaling requirements ultimately pushed us into a clustered deployment; we run an HA cluster (fronted by haproxy). Current Vicky has a really straightforward multi-tenant API — it’s easy to namespace metrics for customers — and it chugs along for us without too much minding.

Assume, in the abstract, an arbitrarily-scaling central Vicky cluster. Here’s a simplified view of an edge node:

An Edge node

So far so simple. Edge nodes take traffic from the Internet and route it to worker nodes. Our edge fly-proxy exports Prometheus stats, and so does a standard Prometheus node-exporter , to provide system stats (along with a couple other exporters that you can sort of mentally bucket into the node-exporter).

Jerome here

Fly-proxy uses the excellent metrics Rust crate with its sibling metrics-exporter-prometheus.

The proxy hosts further logic for transforming internal metrics into user-relevant metrics. We built our own metrics::Recorder that sends metrics through default prometheus Recorder, but also tees some of them off for users, rewriting labels in the process.

Here’s a worker node:

A Worker node

There’s more going on here!

A worker node hosts user app instances. We use HashiCorp Nomad to orchestrate Firecrackers; one of the three biggest components of our system is driver we wrote for Nomad that manages Firecrackers. Naturally, Nomad exports a bunch of metrics to Telegraf.

Real quick though, one more level of detail:

A Firecracker

In each Firecracker instance, we run our custom init, which launches the user’s application. Our Nomad driver, our init, and Firecracker conspire to establish a vsock — a host Unix domain socket that presents as a synthetic Virtio device in the guest — that allows init to communicate with the host; we bundle node-exporter-type JSON stats over the vsock for Nomad to collect and relay to Vicky.

The Firecracker picture includes one of the more important details in our system. These VMs are all IP-addressable. If you like (and you should!), you can expose a Prometheus exporter in your app, and then tell us about it in your fly.toml; it might look like this:

[metrics]
    port = 9091
    path = “/metrics”

When you deploy, our system notices those directives, and will arrange for your metrics to end up in our Vicky cluster.

Zooming out, our metrics architecture logically looks something like this:

The global view

Metrics > Checks

When it comes to automated monitoring, there are two big philosophies, “checks” and “metrics”. “Checks” is what many of us grew up with: you write a script that probes the system and makes sure it’s responding the way you expect it to. Simple.

“Metrics” is subtler. Instead of running scripts, you export metrics. Metrics are so much cheaper than checks that you end up tracking lots more things. You write rules over metrics to detect anomalies, like “more 502 responses than expected”, or “sudden drop in requests processed”, or “95th percentile latency just spiked”. Because you’ve got a historical record of metrics, you don’t even need to know in advance what rules you need. You just write them as you think of them.

That’s “white box monitoring”. Use metrics to open up your systems as much as you can, then build monitoring on top, so you can ramp up the monitoring over time without constantly redeploying new versions of your applications.

What’s more, you’re not just getting alerting; the same data flow gives you trending and planning and pretty graphs to put on LCD status boards on your wall.

Once again, the emerging standard for how to accomplish this appears to be Prometheus. It’s easy to add Prometheus exporters to your own applications, and you should do so, whether or not you run your app on Fly. Something will eventually collect those metrics and put them to good use!

At Any Rate

So that’s the metrics infrastructure we built, and the observability ideas we shoplifted to do it.

The neat thing about all of this is that Prometheus metrics are part of our public API. You can take them out for a spin with the free-plan Grafana Cloud right now; sign up with your Github account and you can add our API as a Prometheus data source. The URL will be https://api.fly.io/prometheus/$(your-org) — if you don’t know your org, personal works — and you’ll authenticate with a customer Authorization header Bearer $(flyctl auth token).

Add a dashboard (the plus sign on the left menu), add an empty panel, click “Metrics”, and you’ll get a dropdown of all the current available metrics. For instance:

sum(rate(fly_app_http_responses_count[$__interval])) by (status)

You can display any metric as a table to see the available labels; you can generally count on “app”, “region”, “host” (an identifier for the worker node your app is running on) and “instance”. Poke around and see what’s there!

Remember, again, that if you’re exporting your own Prometheus metrics and you’ve told us about it in fly.toml, those metrics will show up in Grafana as well; we’ll index and store and handle queries on them for you, with the networking and access control stuff taken care of automatically.

Next post ↑: Hooking Up Fly Metrics
Previous post ↓: Building a Distributed Turn-Based Game System in Elixir