The Serverless Server

Author

Name: Will Jordan

An anthropomorphic bird wearing a vest, shocked to be revealed, standing alongside a large computer-like console, by the dramatic opening of emerald-green curtains. It's reminiscent of The Wizard of Oz but with slightly less-steampunk machinery. And, you know, a bird. — Image by Annie Ruygt

I’m Will Jordan, and I work on SRE at Fly.io. We transmogrify Docker containers into lightweight micro-VMs and run them on our own hardware in racks around the world, so your apps can run close to your users. Check it out—your app can be up and running in minutes. This is a post about how services like ours are structured, and, in particular, what the term “serverless” has come to mean to me.

Fly.io isn’t a “Gartner Magic Quadrant” kind of company. We use terms like “FaaS” and “PaaS” and “serverless”, but mostly to dunk on them. It’s just not how we think about things. But the rest of the world absolutely does think this way, and I want to feel at home in that world.

I think I understand what “serverless” means, so much so that I’m going to stop putting quotes around the term. Serverless is a magic spell. Set a cauldron to boil. Throw in some servers, a bit of code, some eye of newt, and a credit card. Now behold, a bright line appears through your application, dividing servers from services…and, look again, now the servers have disappeared. Wonderful! Servers are annoying, and services are just code, the same kind of code that runs when we push a button in VS Code. Who can deny it: “No server is easier to manage than no server.”

But, see, I work on servers. I’m a fan of magic, but I always have a guess at what’s going on behind the curtain. Skulking beneath our serverless services are servers. The work they’re doing isn’t smoke and mirrors.

Let’s peek behind the curtain. I’d like to perform the exercise of designing the most popular serverless platform on the Internet. We’ll see how close I can get. Then I want to talk about what the implications of that design are.

Close your eyes, tap your keyboard three times and think to yourself, “There’s no place like us-east-1”.

Let’s Start Building

The year is 2014, and the buzzword “elastic” is still on-trend. Our goal: liberate innocent app developers from the horrors of server management, abstracting services written in Python or Javascript from the unruly runtimes they depend on. You’ll give us a function, and we’ll run it.

Once this is invented, you’ll probably want to use it to optimize sandwich photos uploaded by users of your social sandwich side project.

ElasticSearch would shorten its name to Elastic in 2015, marking the peak of this fad.

The first tool in our toolbox is the virtual machine. VMs were arguably “serverless” avant la lettre, and Lambda itself literally stood on the shoulders of EC2, so that’s where we’ll begin.

Take a big, bare-metal x86 server sitting in a datacenter with all the standard hookups. Like every server, it has an OS. But instead of running apps on that OS, install a Type 1 (bare-metal) hypervisor, like the open-source Xen.

The hypervisor is itself like a tiny operating system, but it runs guest OSs the way a normal OS would run apps. Each guest runs in a facsimile of a real machine; when the guest OS tries to interact with hardware, the hypervisor traps the execution and does a close-up magic trick to maintain the illusion. It seems complicated, but in fact the hypervisor code can be made a good deal simpler than the OSs it runs.

Now give that hypervisor an HTTP API. Let it start and stop guests, leasing out small slices of the bare metal to different customers. To the untrained eye, it looks a lot like EC2.

Even back in 2014, EC2 was boring. What we want is Lambda: we want to run functions, not a guest OS. We need a few more components. Let’s introduce some additional characters:

The Placement service, with an API, that can start and stop VMs across a pool of Workers.
The Manager is a service with an API that tracks the VMs — we’ll start calling them Workers — running across that pool, and can tell us how to reach them.
The Frontend handles requests for things our Workers will actually do.
The function is the code the customer wants us to run. For your sandwich app, the function resizes and optimizes an image, and sends it to an S3 bucket.

The Frontend reads an Invoke request for a function we want to run. (Someone’s just uploaded an image to your S3 sandwich bucket through your app.) Frontend asks a Manager to provide the network address of a Worker VM containing an instance of your function, where it can forward the request. The Manager either quickly returns an existing idle function instance, or if none are currently available, asks a Placement service to create a new one.

This is all easier said than done. For instance, we don’t want to send multiple requests racing toward a single idle instance, and so we need to know when it’s safe to forward the next request. At the same time, we need Manager to be highly available; our Manager can’t just be a Postgres instance. Maybe we’ll use Paxos or Raft for strongly-consistent distributed consensus, or maybe gossiping load and health hints will be more resilient.

We can straightforwardly run a function instance on a Worker VM. But we can’t just use any old VM; we can’t trust a shared kernel with multitenant workloads. So: give each customer its own collection of dedicated EC2-instance Workers. Have Placement bin-pack function instances onto them. Boot up new Workers as needed.

Another catch: it takes seconds or even minutes to boot a new Worker. This means some of our requested functions have unacceptably (and unpredictably) high “cold start” time. (Imagine, in 2022, holding on to your excitement for over a minute waiting for your image of the local sandwicherie’s scorpion-pepper grilled cheese to insert itself into your chat.) Have Placement manage a “warm pool” of running VMs, shared across all customers. Now functions can scale up quickly. To scale down, Manager periodically vacuums idle VMs, returning them to the warm pool for reuse.

Scale is our friend. We have lots of customers, so the warm pool smooths out unpredictable workloads, reducing the total number of EC2 instances we need. But we’re not out of the woods yet. We can get huge spikes of consumption: say, an accidentally-recursive function. One broken customer brings everyone else back to cold-start latency. The easiest fix: soft limits (“contact us if you need more than 100 concurrent executions”). Beyond that, the service could adopt a token bucket rate-limiting mechanism to allow a controlled amount of sustained/burst scaling per customer or function.

We’ve sketched most of orchestration, but hand-waved the actual function invocation. It’s not all that complicated, though.

Once Placement allocates enough resources on a Worker, it can load up the function instance there. Remember, it’s still 2014, and Docker only just became production-ready, so we’ll roll our own container the old-fashioned way. A daemon on the Worker VM:

handles the function initialization request,
fetches the application code .zip file from object storage (S3),
unpacks it on top of a ready-made runtime environment,
launches the function-handler process in a chroot,
drops privileges,
uses namespaces and seccomp profiles to run in Docker-like incarceration,
enforces configured CPU and memory resource limits with cgroups,
uses the cgroup freezer to ensure that idle functions consume no resources outside of active requests proxied to the function instance.

Google Cloud Functions originally didn’t freeze its function instances and only billed the function execution- so you could run a Bitcoin miner in a background process on an idle function without paying a dime.

Iterating On The Design

We’ve come up with a relatively naive design for Lambda. That’s OK! We’re Amazon and we can paper over the gaps with money and still have enough left over to make hats. More importantly, we’re out in front of customers, and we can start learning.

Fast forward to 2018. We made it. “Serverless” is the new “elastic” and it’s all the rage. Now let’s make it fast.

What’s killing us in our naive design is Xen. Xen is a bare-metal hypervisor designed to run arbitrary operating systems in arbitrary hardware configurations. But our customers don’t want that. They’re perfectly happy running arbitrary Linux applications on a specific, simplified Linux configuration.

Enter Firecracker.

Firecracker is modern hypervisor built on KVM and exploits paravirtualization: the guest and the hypervisor are aware of each other, and cooperate. Unlike Xen, we don’t emulate arbitrary devices, but rather virtio devices designed to be efficient to implement in software. With no wacky device support, we lose hundreds of milliseconds of boot-time probing. We can be up and running in under 125ms.

Firecracker can fit thousands of micro-VMs on a single server, paying less than 5MB per instance in memory.

This has profound implications. Before, we were carefully stage-managing how function instances made their way onto EC2 VMs, and the lifecycle of those EC2 VMs. But now, function instances can potentially just be VMs. It’s safe for us to mix up tenants on the same hardware.

We can oversubscribe.

Oversubscription is a way of selling the same hardware to many people at once. We just bet they won’t all actually ask to use it at the same time. And, at scale, this works surprisingly well. The trick: get really good at spreading around the load across machines to keep resource usage as uncorrelated as possible. We want to maximize server usage, but minimize contention.

Firecracker lets us spread load more evenly, because we can run thousands of different customers on the same server.

Our Workers are now bare-metal servers, not EC2 VMs. We need a warm pool of them, too. It’s a lot of extra micro-management. And it’s worth it. The resource-sharing shell game is way more profitable. Reportedly, Lambda runs in production with CPU and memory oversubscription ratios as high as 10x. Not too shabby!

There’s a tradeoff to this. We’ve aggressively decorrelated our server workloads, shuffling customers onto machines like suits in a deck of cards. But now we can’t share memory across functions, like the classic pre-forking web server model.

On a single server, a function with n concurrent executions might consume only slightly more memory than a single function. Shuffled onto n machines, those executions cost n times more. Plus, on the single server, instances can fork instantly from a parent, effectively eliminating cold-start latency.

And now we have a network-sized hole in performance. Functions are related; they’re intrinsically correlated. Think about serverless databases, or map-reduce functions, or long chains of functions in a microservice ensemble. What we want is network locality, but we also want related loads spread across different hardware to minimize contention. Our goals are in tension.

So some functions might perform best packed tightly to optimize performance, while others are best spread thin for more distributed resource usage across servers. Some kind of hinting along the lines of EC2 placement groups could help thread the needle, but it’s still a hard problem.

At any rate, we have a design, and it works. Now let’s start thinking about the ramifications of the decisions we’ve made so far, and the decisions that we have yet to make.

Roll your own FaaS

Fly Machines are Firecracker VMs with a fast REST API that can boot instances in about 300ms, in any region supported by Fly.io. Care to craft your own twist on serverless?
Learn more →

Ramifications for Concurrency

Lambda’s one-request-per-instance concurrency model is simple and flexible: each function instance can handle one single request at a time. More load, more instances.

This works like Common Gateway Interface (CGI) of yore, or more precisely, like implementations of its successor FastCGI which reuse instances across requests.

Scaling is simple and straightforward. Each request is handled in its own instance, separate from all other concurrent requests. No locks, thread-safety or any other parallel programming concepts.

But handling concurrent requests in a single instance can be more efficient, especially for high-performance web application servers that can leverage asynchronous I/O event loops and user-space threads to minimize context-switching overheads. Google’s Cloud Run product supports configurable per-instance concurrency. Lambda’s design makes it harder for us to pull off tricks like that.

Ramifications for Pricing

If we’re Lambda, we bill per-second duration based on memory use, with a per-request surcharge; like a taxi meter, we have a base fee, and then the meter ticks up as long as we’re working.

Two ways of looking at the request fee. First, it’s a fudge factor representing the aggregate marginal costs of the various backends involved in handling the request.

But if you’re an MBA, it’s also a way to shift to “pay-for-value” or value-based pricing, a founding tenet of Lambda. Value pricing says that you pay based on how useful the service is; if we figure out ways to deliver the service more cheaply, that’s gravy for us. Without the surcharge, we’re doing cost-plus pricing. You’d just pay for the resources we allocated to you.

(Remember, we’re up to 10x oversubscribed. Customers are, on average, utilizing only 10% of the resources they pay for.)

We combine CPU and memory pricing to simplify duration-based pricing. Simple is good, but costs our users flexibility if they have lopsided CPU or memory-heavy functions. For that problem, there’s Fargate, Lambda’s evil twin.

This pricing seems simple! But it’s actually a little bit complicated, if you are sensitive to cost.

Your image-cruncher function might be making good use of its resources for most of its running time. But what if a function process is actually really fast? It might actually skew cheap in resources and expensive in requests.

And now, you’ve added a function to periodically scrape the major socials for new pictures tagged with any sandwich, artisanal sandwich stockist, or vending machine known to your database. Or, better, say you’re Max Rozen, doing uptime checks on every endpoint in your database. Now you’re paying full whack for CPU and RAM usage the whole time you wait (up to 10s) for a response from each one, to, you know, see if it’s online.

The value-based pricing here hits the sweet spot for functions that a) run long enough per request to amortize the request cost, and b) make enough use of the provisioned resources, while they run, to justify paying for them that long.

Prioritizing nimble scaling, combined with instance-per-request and per-request billing, does set up a potential footgun for our customers. Don’t DDoS yourself.

We’re counting on the product as a whole to add enough value to keep less price-sensitive customers coming back, even far from the sweet spot.

Ramifications for APIs

The public runtime API to a Lambda function is the Invoke REST API, which accepts a POST method specifying the function name and request “payload”, and requires a signature with appropriate AWS credentials. This conforms to Amazon’s monolithic, internally-mandated API structure, but practically unusable outside the API-wrangling comfort of the AWS SDK.

A cottage industry has sprung up around frameworks just to help you hook Lambda up to the web. Amazon built one of them into CloudFormation. Problem: too much YAML. Solution: more YAML.

The way out is embarrassingly simple: the runtime API can just pass HTTP requests directly to the function instance. Most of what “API gateways” do can be built into HTTP proxy layers. For the common case of web applications, an HTTP-based API eliminates a layer of indirection and plugs in nicely with the mature ecosystem of web utilities.

Ramifications for Resilience

Lambda’s execution environment sets strict limits:

on function initialization (10 seconds)
on Invoke duration (default 3 seconds; limit originally 60 seconds, later increased to 5 and then 15 minutes), and
zero guarantees around idle-function lifecycle (a function instance could get shut down any time it’s not handling a request, and will shut down once every 14 hours.)

This tightly-scoped lifecycle is great for the platform provider. It helps workloads quickly migrate away from overloaded or unhealthy instances, and makes it easy to shuffle functions around during server maintenance and upgrades without impacting services. And what’s good for the platform is probably good for most customers, too!

But it’s not ideal for apps

with expensive or time-consuming initialization steps
or that depend heavily on dynamic local caches for performance
or when you’re just not sure how long a response might take.

One alternative is for the platform to try to keep servers up and running forever, but sometimes you just have to reboot servers to patch stuff. Another option to recycling VMs is live migration, sending a snapshot of the running VM over the network to the new server with as little downtime as possible. Google Compute Engine supports live migration for its instances and uses the feature to seamlessly conduct maintenance on its servers every few weeks.

Despite the simple runtime interface, Lambda functions run in a full Linux runtime environment that let you run your own x86 executables on the platform, which gives you all of POSIX for your application to play with.

If your apps can do with less, a “language sandbox” can offer some isolation without the overhead of virtualization. Google App Engine adopted this with tuned language runtimes that disabled networking and writing to the filesystem by disabling/customizing Python modules and restricting Java class usage. CloudFlare Workers adopt a similar approach with the v8 runtime wrapping JavaScript code in ‘isolates’, offering a restricted set of runtime APIs modeled loosely after JavaScript browser APIs.

WebAssembly extends the language-sandbox approach with a virtual instruction set architecture, either embedded within v8 isolates or run by a dedicated server-side runtime like wasmtime.

Fastly built its Compute@Edge product around WebAssembly/WASI. However, WASI is still young and evolving quickly. On the serverside, WASM’s overhead doesn’t pay its freight: there’s as much as a 50% performance gap between WASM and native code, which makes virtualization look cheap by comparison.

How did I do?

I just designed a shameless knockoff of Lambda, the most popular specimen of the most serverless of serverless services: a fleeting scrap of compute you can will into being, that scales freely (not in the monetary sense) and fades into oblivion when it’s no longer needed.

This article contains no small degree of bias! There’s also no small degree of appreciation for the craft that goes on behind the curtain at AWS and other purveyors of “serverless” services.

Next post ↑: SOC2: The Screenshots Will Continue Until Security Improves
Previous post ↓: Logbook - 2022-06-23