Seamless Deployments on Fly.io

Zero-downtime deploys aren’t magic. They’re just health checks that work. Fly.io apps run on individual VMs, and we won’t start routing traffic to a new one until it proves it’s alive. Here’s how to get seamless deploys without breaking things or relying on hope as a strategy.

Your deploy is only as good as your health checks

Fly.io deployments look like magic when health checks are working. New Machines boot. Old ones go down. Users never notice. But behind the scenes, the Fly Proxy is watching your Machines and only sending them traffic if they pass your health checks.

Health checks are how you tell the platform that your app is alive. If they’re missing or broken, you’re flying in the dark.

In your fly.toml, you’ll usually define health checks under http_service.checks (or the alternative services.http_checks if you’re not using the http_service shortcut) for HTTP-based apps. These checks are simple: make a request to a path like /, expect a 200 OK response, and fail if you get a redirect or timeout. If your app redirects HTTP to HTTPS, add a header like X-Forwarded-Proto = "https" to avoid false failures.

The most common health check failure modes are easy to trip over:

getting 301/302s instead of 200 responses
hitting a path that loads slowly under cold start conditions
forgetting to delay checks with a grace_period after boot

A typical HTTP health check might look like this:

[http_service.checks]
  interval = "15s"
  timeout = "2s"
  grace_period = "10s"
  method = "GET"
  path = "/healthz"
  headers = { X-Forwarded-Proto = "https" }

For apps that aren’t serving HTTP, or where you want lighter checks, TCP health checks exist too. These just attempt to open a socket to the port you’ve configured; if the connection is accepted, the Machine is healthy.

Top-level health checks—defined in the [checks] section of your fly.toml—are for observability, not traffic control. They don’t influence routing; if a Machine fails one, we won’t pull it out of rotation. That makes them ideal for internal monitoring and alerting, especially for background workers or other non-public-facing services where you still want to know if something’s gone sideways, but don’t need Fly to reroute traffic automatically.

Deployments get another layer: machine_checks. These aren’t about routing traffic; they’re about catching bad deploys before they go live. A machine_check boots a throwaway Machine, sets FLY_TEST_MACHINE_IP, and lets you poke it however you want. You can run integration tests, perform simulated traffic, check database access, or validate config. Fail the check, and the deploy halts. This is perfect for canary testing new releases.

Example:

[services.machine_checks]
  command = ["/bin/bash", "-c", "curl -f http://$FLY_TEST_MACHINE_IP:8080/healthz"]
  interval = "30s"
  timeout = "5s"

Note: machine_checks run in a temporary Machine with the new image before it’s deployed anywhere else. They’re isolated, so they won’t affect your running app—but they can save you from pushing broken code into production.

If a Machine fails a health check, Fly Proxy pulls it out of the rotation. That Machine can still be running, logging, and debugging, but it’s invisible to your users.

Deployment strategies: choose your own trade-offs

Working with health checks and at least two Machines, you can avoid downtime. The platform won’t kill a healthy Machine until a new one is up and running.

Fly.io supports a few deployment strategies in the [deploy] section of fly.toml:

Rolling (default): Replace Machines one at a time. Use max_unavailable to control how many go down at once (e.g., 1 or 25%).
Immediate: Burn the ships. All Machines update at once, no health checks, hope for the best.
Canary: Start with one Machine. If it’s healthy, continue with rolling. This can’t be used with attached volumes.
Bluegreen: Boot new Machines alongside old ones. Only switch traffic after all new Machines pass checks. Fastest, safest, but also can’t use attached volumes.

You can also define a release_command to run before any Machines update. It gets a fresh Machine and your new image, but no volume. Use it for migrations or other one-off prep work.

Every deploy also includes a smoke check: Fly watches Machines for ~10 seconds after they start. If they crash repeatedly, the deploy fails.

Don’t forget to set wait_timeout if your image is big or startup is slow. It’s easy to hit timeouts before Machines even start.

Zero-downtime is a shared responsibility

The Fly.io platform will keep your Machines healthy and traffic flowing. But there’s one thing it can’t fix: application incompatibility during deploys.

Say you’re changing your database schema. You run the migration in a release_command, then your Machines update one by one. If the old app code can’t handle the new schema, users will see errors while old Machines are still running. Same goes for bluegreen deployments: the new Machines are healthy, but traffic doesn’t switch until all of them are.

Fixing this is on you. You need to design schema and app changes that can coexist briefly. That usually means:

App code that works with both the old and new database schemas
Migrations that add columns or tables but don’t drop or rename until later

Every framework has guidance for this. Read it. Follow it. Then health checks, deployment strategies, and the platform can do their jobs.

Deploying without downtime isn’t automatic. But with a little help from your health checks and some care in your app logic, it’s well within reach.

Report an issue or edit this page on GitHub

Seamless Deployments on Fly.io

Your deploy is only as good as your health checks

Deployment strategies: choose your own trade-offs

Zero-downtime is a shared responsibility

Related reading: