Graceful VM exits, some dials

Author

Name: Michael Dwan
@michaeldwan: @michaeldwan

Fly.io transforms containers into swarms of fast-booting VMs and runs them close to users. You can now delay VM shutdowns up to 24 hours to let the overly attached clients finish their work.

Fly apps are typically fast to boot, and it’s relatively easy to boot new VMs. We start them up, do some health checks, and then add them to our load balancer and DNS service discovery. But what comes up must go down. We shut VMs down for any number of reasons – new deploys, or scaling, or for maintenance on underlying hardware.

By default, we send a SIGINT to tell a VM it’s time to go away. Then we wait 5 seconds and, if the VM is still running, we forcefully terminate it. This works fine most of the time, especially for application servers, but some work takes longer to clean up. A live video streaming service may a have users (often teenagers) connected for hours at a time. Or database servers might have in flight transactions to commit before terminating.

Keeping processes alive longer is a boring, simple way to solve these kinds of problems. So you can now tell us to keep VMs for up to 24 hours after we send a shutdown signal. And you can also specify what signal we send (because signal handling is wildly inconsistent). Just add these options to your fly.toml configuration file:

kill_timeout: Number of seconds to wait before killing a VM. Shared CPU VMs allow up to 5 minutes and dedicated CPU VMs allow up to 24 hours. The default is 5 seconds.
kill_signal: defaults to SIGINT. Also accepts SIGTERM, SIGQUIT, SIGUSR1, SIGUSR2, SIGKILL, or SIGSTOP.

Run apps with long lived connections

Launch your Docker apps on Fly and we’ll keep them alive while your users finish what they’re doing.
Try Fly for free →

Example: drain long lived TCP connections

HAProxy is an open source project for load balancing TCP and HTTP connections. Most HTTP requests are fast, but you might also run HAProxy to handle large user uploads or load balance across databases.

By default, a SIGINT causes an HAProxy server to immediately close all connections and shut down, aka “hard stop”. If you’d rather cleanly drain connections instead of serving errors, you can use the “soft stop” mode and specify a long kill timeout.

    # stop accepting new connections while existing connections drain
kill_signal = "SIGUSR1"
# allow 2 minutes for all connections to finish before killing the server
kill_timeout = 3600

  

Example: gracefully shutdown a database server

Postgres responds to the SIGINT signal (our default) by immediately aborting open transactions and closing all connections. This is called the “fast shutdown” mode and results in discarding data and causing application errors. Instead, you can now use the “smart shutdown” mode by sending SIGTERM and giving it five minutes to commit transactions.

    # stop accepting new connections while existing sessions complete
kill_signal = "SIGTERM"
# allow 5 minutes to cleanly shutdown
kill_timeout = 300

  

Shared infrastructure and long lived connections

Modern cloud infrastructure forces a lot of application compromises, especially when you’re sharing infrastructure. Most cloud function and container hosting products sit behind layers of shared services, each needing frequent releases to keep them humming along. Releases are disruptive, especially for software that proxies user connections to arbitrary containers.

Providers simplify their lives by limiting what customer containers can do – they might only serve HTTP, for example, or have to implement a custom event handler. If an app can only speak HTTP, has to complete every requests within 30 seconds, it’s very simple to roll out new proxy releases.

This is silly, but we’re not immune. We run a global load balancer service in front of Fly apps. When you use it for HTTP or TCP connections, our releases can disrupt in flight connections. We do as much as we can to minimize the impact and drain connections over a period of minutes when necessary.

Some apps need us to get the heck out of the way. We’ve built our plumbing specifically to allow this. You can opt out of our HTTP and TLS handlers, for example. And if you run a UDP service, our load balancing is entirely stateless. And we have experimental stateless TCP load balancing! If you have an app that needs to keep connections alive as long as possible, let us know, we’ll help you how to try it out.

Do you want to know more? Or have an idea? We’ve got a community forum just for you.

Next post ↑: Get fly with your Fly command line
Previous post ↓: IPv6 WireGuard Peering

Run apps with long lived connections

Example: drain long lived TCP connections

Example: gracefully shutdown a database server

Shared infrastructure and long lived connections