World Page Speed Test – planet-wide elastic scale with FLAME

A balloon saying hello to different regions across the globe.
Image by Annie Ruygt

We’re Fly.io and we transmute containers into VMs, running them on our hardware around the world. We have fast booting VM’s, so why not take advantage of them?

I closed out the year the publishing a new library and programming model, called FLAME. If you missed it, FLAME replaces the need for AWS Lambda or Cloudflare Workers and the proprietary services that call proprietary services fractal of architecture patterns that come with them.

Instead, imagine if you could auto scale by wrapping any existing app code in a function and have that block of code run in a temporary copy of your app. That’s what FLAME is all about.

Since the release, I’ve built a few things using this pattern, and the most remarkable thing is just how unremarkable the solutions are. I can write my naive Elixir application on my laptop, deploy it to Fly.io running across the planet in a single command, then scale it out elastically by changing almost no code.

Measuring Page Speed

I’ve wanted to build a “World Page Speed Test” for a long time. Think Google’s Page Speed Insights, but as viewed from various observers around the globe. To do this correctly, you need to be running a full browser. And those browsers need to be running on servers around the planet. The client must download all scripts, styles, and images to reflect a real-world page load. Doing this at scale is usually Hard™. Doing this with Elixir, FLAME, and Fly.io is an afternoon of tinkering.

🔥🔥 Try World Page Speed Now! 🔥🔥

So we need a browser. Fortunately, the major browsers support headless drivers through a W3C standard. This allows starting a headless browser and communicating with it via local HTTP. You can drive page navigation, evaluate page JavaScript, and simulate user interaction.

Our Elixir app can start with the basics. We’ll need:

  • A process group of nodes and where they are located geographically
  • A LiveView page that accepts a URL
  • A running chromedriver process that can launch headless chrome sessions for us
  • The ability to tell chrome to visit a URL and monitor network performance events
  • When the page starts loading, we display a loading status
  • When the page finishes loading, we display the loading time

This doesn’t take much code. I’ll crib the highlights.

First, we need a way to locate all nodes on the cluster, and the Fly region they belong to. This lets us know where the chrome requests are happening from. The built-in Node.list() returns all reachable nodes, but we need metadata alongside their name showing where they are located.

Fortunately, we can use Phoenix.Tracker for this, which provides a process group with metadata:

defmodule WPS.Members do
  use GenServer

  @tracker WPS.Tracker

  def list(group_name \\ __MODULE__) do
    Phoenix.Tracker.list(@tracker, group_name)
  end

  def start_link(opts) do
    GenServer.start_link(__MODULE__, opts)
  end

  def init(opts) do
    my_region = System.get_env("FLY_REGION") || "ord"
    group_name = Keyword.get(opts, :name, __MODULE__)

    {:ok, ref} =
      Phoenix.Tracker.track(@tracker, self(), group_name, my_region, %{
        node: Node.self(),
        machine_id: System.get_env("FLY_MACHINE_ID")
      })

    {:ok, %{group_name: group_name, ref: ref}}
  end
end

We define a WPS.Members module which starts a process and calls Phoenix.Tracker.track/5 to register the current node’s FLY_REGION.

Now we need a bit of HTML inside a LiveView which drives our headless chrome when the user hits “go”:

def render(assigns) do
  ~H"""
  <h1>World Page Speed</h1>
  <form
    :if={!@ref}
    id="url-form"
    phx-change="validate"
    phx-submit="go"
  >...
  """
end

def handle_event("go", %{"url" => url}, socket) do
  validated_url = URI.to_string(uri)
  ref = make_ref()
  parent = self()

  node_times =
    for {region, meta} <- WPS.Members.list(), %{node: node} = meta, into: %{} do
      {node, Browser.Timing.build(validated_url, region)}
    end

  {:noreply,
  socket
  |> clear_flash()
  |> assign(ref: ref, uri: uri, form: to_form(%{"url" => validated_url}))
  |> stream(:timings, Map.values(node_times), reset: true)
  |> start_async(:timing, fn ->
    nodes = Enum.map(node_times, fn {node, _} -> node end)
    :erpc.multicall(nodes, fn -> timed_nav(node_times[node()], parent, ref) end)
  end)}
end

When the user hits the “go” button on the web page, our goal is to kick off some timed headless chrome navigations.

We’ll simulate a multi-node setup locally for now, with just a single member. We aren’t going to worry about going planet-wide multi-node just yet – though it won’t take any code changes to get there.

After validating the user’s URL, we build up a %Browser.Timing{} struct for reach member’s region in the cluster. Next, we asynchronously navigate to the page inside a start_async call.

Within the the start_async, we can see some built-in Erlang standard library treasures. :erpc.multicall accepts a list of nodes and a function to run. The Erlang VM will run the function on the passed nodes and blocked until it gets a result from all nodes. Any process id (Pid) we pass to the closure can just be messaged across the cluster as if it was local.

Processes on the Erlang VM are our messaging, concurrency, and state primitive. They’re used everywhere, and you can run millions per node. It’s a bit like if any object in your OO runtime had a globally addressable reference that allowed you to call methods on a given instance from anywhere in the cluster. And each object ran in its own lightweight, preemptable thread.

Our local node setup is the only node running in dev, so the function will just run locally. The timed_nav/3 function looks like this:

def timed_nav(%Browser.Timing{} = timing, parent, ref) do
  timing = Browser.Timing.loading(timing)
  send(parent, {ref, {:loading, timing}})

  case Browser.time_navigation(timing, @browser_timeout) do
    {:ok, %Browser.Timing{} = timing} ->
      send(parent, {ref, {:complete, timing}})

    {:error, {reason, %Browser.Timing{} = timing}} ->
      send(parent, {ref, {:error, {reason, timing}}})

    {:error, reason} ->
      send(parent, {ref, {:error, {reason, Browser.Timing.error(timing)}}})
  end
end

All we do here is send a message to the parent process (our LiveView), which we’ll handle in a moment.

We first tell it we’re about to start loading. Next, we call into a Browser.time_navigation function, which asks headless chrome to navigate to the webpage and give us the timing details. The headless chromedriver glue isn’t interesting here, but you can check the source if you’re interested.

If the navigation is successful, we send a {:complete, timing} message to the LiveView. If it fails, we send an error message. That’s it!

We now have a UI that looks like this:

We’re not really multi-node yet, but the moment we have a cluster and provide a list of real nodes, our :erpc.multicall will Just Work™ across the cluster and message our remote parent LiveView thanks to the distributed, location-transparent nature of Erlang’s process messaging.

For the LiveView to handle the timing messages, we only need to implement a few functions:

def handle_info({_ref, {:loading, %Browser.Timing{} = timing}}, socket) do
  {:noreply, stream_insert(socket, :timings, timing)}
end

def handle_info({_ref, {:complete, %Browser.Timing{} = timing}}, socket) do
  {:noreply, stream_insert(socket, :timings, timing)}
end

def handle_info({_ref, {:error, {_, %Browser.Timing{} = timing}}}, socket) do
  {:noreply, stream_insert(socket, :timings, timing)}
end

Here we pattern match on the :loading, :complete, or :error tuples and call stream_insert to update the UI. The rest is some HTML markup and tailwind classes in our template to make it look pretty.

Going Multi-node

So what about going multi-node? Is it really that easy with the Erlang VM? Let’s take our dev prototype and deploy it on Fly.io:

fly launch

Once our app is deployed, we can scale out to any number of regions. For good geographic coverage, let’s spread things out across the planet:

fly scale count 8 --max-per-region 1 --region bom,fra,gru,hkg,nrt,ord,scl,syd

We use fly scale count to scale our app to 8 regions across the planet, ensuring we only have a single instance per region. We’ll start with Mumbai, Frankfurt, Sao Paulo, Hong Kong, Tokyo, Chicago, Santiago, and Sydney for nice world-wide coverage.

SSH'ing into any one of our Fly machines will show the Elixir nodes discovered themselves automatically:

fly ssh console --pty --command="/app/bin/wps remote"
Connecting to fdaa:0:36c9:a7b:98:e121:5775:2... complete

iex(worldpagespeed@fdaa:0:36c9:a7b:98:e121:5775:2)1> Node.list()
[:"worldpagespeed-01HW66QY8SHX7RF7S1A3519S4J@fdaa:0:36c9:a7b:e3:a239:a2f1:2",
 :"worldpagespeed-01HW66QY8SHX7RF7S1A3519S4J@fdaa:0:36c9:a7b:b4f1:eef5:616f:2",
 :"worldpagespeed-01HW66QY8SHX7RF7S1A3519S4J@fdaa:0:36c9:a7b:1d7:533d:e9ea:2",
 :"worldpagespeed-01HW66QY8SHX7RF7S1A3519S4J@fdaa:0:36c9:a7b:177:a43d:b898:2",
 :"worldpagespeed-01HW66QY8SHX7RF7S1A3519S4J@fdaa:0:36c9:a7b:fb:a2df:b6a1:2",
 :"worldpagespeed-01HW66QY8SHX7RF7S1A3519S4J@fdaa:0:36c9:a7b:f1:7795:8435:2"]
iex(worldpagespeed@fdaa:0:36c9:a7b:98:e121:5775:2)2> WPS.Members.list()
[
  {"bom",
   %{
     node: :"worldpagespeed@fdaa:0:36c9:a7b:177:a43d:b898:2",
     phx_ref: "F8j_UQe5Z2-nXQDh",
     machine_id: "9185750da6ddd8"
   }},
  {"ord",
   %{
     node: :"worldpagespeed@fdaa:0:36c9:a7b:98:e121:5775:2",
     phx_ref: "F8j_T5qu7dN9JQDh",
     machine_id: "5683566b095058"
   }},
   ...
]

We’re in business. Let’s try it out at https://worldpagespeed.fly.dev/.

It works! Our nodes clustered automatically and now when our LiveView hits the start_async call, :erlang.multicall will call each node to visit the provided URL. It’s simply message sending and receiving from there.

The code we wrote on our laptop works across the planet now – without changes.

This is great! But your webscale alarm bells might be going off. We’re putting our web UI, APIs, etc, in the hot path of a very resource intensive operation. Headless chrome eats hundreds of mb of memory and will churn CPU loading web pages. It will do this concurrently across requests to our app. How can we possibly scale this?

Surely AWS Lambda or Cloudflare Workers have some proprietary APIs to sell us for exactly this task? We’ll need to pay to configure an API gateway of course. We’ll also need to put the results somewhere like SQS, or use SNS to receive updates elsewhere. And pay for those too.

Or we can change two LOC and keep shipping.

Elastic scaling with FLAME

I often tell people that FLAME is so remarkable in how unremarkable it is when you use it in practice. What is effectively an amazing elastic scale primitive is turned into a boring decision.

“Do I want elastic scale here?”

If the answer is yes, you wrap your code in a FLAME.call and you carry on with life. That’s really the beginning and end of the decision process. You aren’t thinking about infrastructure or fractals of AWS glue to get the results back into your UI. You aren’t over-provisioning to sustain bursts, or adopting Kubernetes and microservices to service elastic load.

Let’s make this thing scale:

def timed_nav(%Browser.Timing{} = timing, parent, ref) do
+ FLAME.call(WPS.BrowserRunner, fn ->
    timing = Browser.Timing.loading(timing)
    send(parent, {ref, {:loading, timing}})

    case Browser.time_navigation(timing, @browser_timeout) do
      {:ok, %Browser.Timing{} = timing} ->
        send(parent, {ref, {:complete, timing}})

      {:error, {reason, %Browser.Timing{} = timing}} ->
        send(parent, {ref, {:error, {reason, timing}}})

      {:error, reason} ->
        send(parent, {ref, {:error, {reason, Browser.Timing.error(timing)}}})
    end
+ end)
end

And we’re done.

We wrapped our timed_nav/3 function body in a FLAME.call/2. This will find or launch a fleeting instance of our application whose only job is to run this little slice of our app. Any state our function closes over is sent along to the ephemeral FLAME node. Any messaging we do inside via send(parent, ...) to the parent Just Works across the cluster because of course it does.

The fact that we sent our parent LiveView pid into a multicall, which sent it across the cluster to another node, who took the parent and sent it again to yet another node isn’t an issue . The message will make it back to the parent by design – it’s what the Erlang VM does.

Hot vs cold starts

We can start our FLAME pools hot or cold. For arguments sake, let’s say we have scale-to-zero behavior and an infrequently accessed app. What does our cold start time look like? We can update our app to show Starting browser on the local node when the user hits go, which will change to Loading page when the FLAME starts up with headless chrome. Here’s what that looks like:

Fly.io can start a fresh copy of our app in 3-5s, which includes headless chrome. We can also see how resubmitting the form catches the hot runners before they idle down from inactivity.

For this use case, cold starts aren’t an issue, but what if a three second cold start is too slow for your problem?

To avoid cold starts, we could configure our FLAME pool to always keep a single runner alive. Or we could configure our pool to warm up a number of runners at app start so our own cold deploys don’t cause users to hit a cold FLAME pool. Then those warmed up runners can idle down if no work is needed. The configuration to do that looks this:

{FLAME.Pool,
 name: WPS.BrowserRunner,
 min: 1,
 max: 50,
 max_concurrency: 20,
 min_idle_shutdown_after: :timer.seconds(30),
 idle_shutdown_after: :timer.seconds(30)}

Here we start the app with a single warmed up BrowserRunner, and up to a maximum of 50 runners that each can service 20 concurrent web page visits. We also configure a 30s idle shutdown for the general pool operation, as well as the warmed up initial runner. The pool will grow and shrink elastically as load comes and goes.

Solving the problem vs removing the problem

Just imagine what this would have taken on your FaaS of choice.

To achieve elastic scale with “only pay for what you use!” pricing, you would stand up separate deployment pipelines of various lambdas, configure SNS or SQS queues, and write the glue back in your application.

You would also need to configure god knows what kind of availability zones and knobs to run your functions at the edge.

Or you could use FLAME and remove these problems entirely. With Elixir and FLAME you write the code and carry on with life. If you need to scale, you put it into a FLAME and start shipping your next features. FLAME machines on Fly.io launch by default in the same region as their parent, because of course they do.

It’s really as simple as that.

My ElixirConfEU talk covers the motivations for FLAME, as well as few other demos that are worth checking out.

You can also read more about FLAME in the original post, or check out the documentation for integration in your own applications.

Happy hacking!

–Chris

Fly.io ❤️ Elixir

Fly.io is a great way to run your Phoenix LiveView apps. It’s really easy to get started. You can be running in minutes.

Deploy a Phoenix app today!