Machine Suspend and Resume

Machine suspend lets you pause a running Fly Machine and save its complete state, including memory, to persistent storage. When resumed, the machine picks up exactly where it left off, without rebooting the OS or restarting your app. That can make startup take just hundreds of milliseconds instead of multiple seconds.

You can think of suspend as what a laptop does when you close the lid, except your “laptop” is a microVM running in, say, dfw or fra or syd.

How it works

Suspend uses Firecracker snapshots to capture the entire VM state: CPU registers, memory contents, open file handles. When you start a suspended machine, Fly restores from this snapshot instead of cold booting.

Typical performance:

  • Resume from suspend: a few hundred ms
  • Cold start: ~2+ seconds for common apps
  • TCP connections may survive if the remote side keeps them open

Using Suspend

Manually

# Suspend a machine
fly machine suspend <machine-id>

# Check status (running, suspending, suspended, etc.)
fly machine status <machine-id>

# Resume from snapshot
fly machine start <machine-id>

# Force a cold start (discard snapshot)
fly machine stop <machine-id>
fly machine start <machine-id>

Automatically via Fly Proxy

Configure in fly.toml:

[http_service]
  auto_stop_machines = "suspend"  # or "stop"
  auto_start_machines = true

  [[http_service.concurrency]]
    type = "requests"
    soft_limit = 25

The proxy will automatically suspend machines during low traffic, checking for idle periods every few minutes, and resume them when requests arrive.

Machines API

# Suspend
POST /v1/apps/{app_name}/machines/{machine_id}/suspend

# Wait for suspension to complete
GET /v1/apps/{app_name}/machines/{machine_id}/wait?state=suspended

# Resume (standard start endpoint)
POST /v1/apps/{app_name}/machines/{machine_id}/start

Requirements

A machine can use suspend if it has:

  • ≤ 2 GB memory (For larger memory sizes, suspend is discouraged due to increased suspend times)
  • No swap configured
  • No schedule configured
  • No GPU configured
  • Been updated since June 20, 2024 20:00 UTC

If you have an older machine, or you’re not sure when it was last updated, you can bring it up to date with:

fly machine update <machine-id> --yes 

This updates the machine in place to the latest supported configuration for suspend, without changing your app code or image.


Limitations and considerations

  • Suspend is not currently recommended for large machine memory sizes (> 2 GB)
  • Suspending many machines at once is not recommended
  • Some logs may be lost after resume
  • Unlike stop, suspend does not reset the machine’s rootfs
  • On resume, the clock can lag a few seconds until NTP syncs

Always design for both resume and cold start paths.


Snapshot behavior with suspend

Snapshots are tied to the exact code and state of the machine they were taken from. If you deploy new code, the old snapshot can’t be resumed safely and will be discarded.

Snapshots aren’t guaranteed to persist. Cold starts may happen if:

  • You deploy a new version of your app — deployments rebuild the machine image, which invalidates the old snapshot. Since a snapshot is a literal memory dump of the old process, resuming it after you’ve swapped in new code or dependencies would be unsafe and unpredictable.
  • The machine is migrated to a different host
  • The snapshot file is lost or corrupted — Hardware failures, space reclamation, or corruption can cause them to be deleted
  • We perform system maintenance or updates

Handling Network Connections After Resume

On resume, the machine thinks its network connections are still live. External systems (databases, APIs) may disagree.

Common symptoms:

  • ECONNRESET
  • “Connection closed”
  • Timeouts on first request
  • Database pool errors

Fix: Reconnect on failure.

Example (Python + DB):

try:
    result = db.execute(query)
except (ConnectionError, OperationalError):
    db.reconnect()
    result = db.execute(query)

Tips:

  • Use connection pools with disconnect handling (see this excellent SQLAlchemy guide)
  • Shorten connection timeouts to fail fast
  • Use retry/backoff for HTTP clients
  • Test after long suspensions

Billing

Suspended machines cost the same as stopped machines: storage only. There are no CPU/RAM charges.


Monitoring & Debugging

fly machine status <machine-id>

States:

  • running
  • suspending
  • suspended
  • starting (resume or cold start)
  • stopped

If machines cold start unexpectedly:

  • Check requirements
  • Confirm no migrations or deployments occurred
  • Check logs for suspend/resume events

Test cold start:

fly machine stop <machine-id>
fly machine start <machine-id>

Availability

Suspend works in all Fly.io regions as of July 2024.


Related reading: