--- title: Machine Suspend and Resume layout: docs nav: firecracker author: kcmartin date: 2025-08-15 --- **Machine suspend** lets you pause a running Fly Machine and save its complete state, including memory, to persistent storage. When resumed, the machine picks up exactly where it left off, without rebooting the OS or restarting your app. That can make startup take just **hundreds of milliseconds** instead of multiple seconds. You can think of suspend as what a laptop does when you close the lid, except your “laptop” is a microVM running in, say, `dfw` or `fra` or `syd`. ## How it works Suspend uses [Firecracker snapshots](https://firecracker-microvm.github.io/) to capture the entire VM state: CPU registers, memory contents, open file handles. When you start a suspended machine, Fly restores from this snapshot instead of cold booting. **Typical performance:** - Resume from suspend: a few hundred ms - Cold start: ~2+ seconds for common apps - TCP connections may survive if the remote side keeps them open --- ## Using Suspend ### Manually ```bash # Suspend a machine fly machine suspend <machine-id> # Check status (running, suspending, suspended, etc.) fly machine status <machine-id> # Resume from snapshot fly machine start <machine-id> # Force a cold start (discard snapshot) fly machine stop <machine-id> fly machine start <machine-id> ``` ### Automatically via Fly Proxy Configure in `fly.toml`: ``` [http_service] auto_stop_machines = "suspend" # or "stop" auto_start_machines = true [[http_service.concurrency]] type = "requests" soft_limit = 25 ``` The proxy will automatically suspend machines during low traffic, checking for idle periods every few minutes, and resume them when requests arrive. ### Machines API ``` # Suspend POST /v1/apps/{app_name}/machines/{machine_id}/suspend # Wait for suspension to complete GET /v1/apps/{app_name}/machines/{machine_id}/wait?state=suspended # Resume (standard start endpoint) POST /v1/apps/{app_name}/machines/{machine_id}/start ``` Generally, you need an API token to use the Machines API. But if you're just suspending _your own_ machine, you can skip the token and hit the `/.fly/api` Unix socket directly: ```bash $ curl --unix-socket /.fly/api -X POST \ http://flaps/v1/apps/$FLY_APP_NAME/machines/$FLY_MACHINE_ID/suspend ``` --- ## Requirements A machine can use suspend if it has: - **≤ 2 GB** memory (For larger memory sizes, suspend is discouraged due to increased suspend times) - **No** [**swap**](https://fly.io/docs/reference/configuration/#swap_size_mb-option) **configured** - **No** [**schedule**](https://fly.io/docs/machines/flyctl/fly-machine-run/#start-a-machine-on-a-schedule) **configured** - **No GPU configured** - Been updated since **June 20, 2024 20:00 UTC** If you have an older machine, or you’re not sure when it was last updated, you can bring it up to date with: ```bash fly machine update <machine-id> --yes ``` This updates the machine in place to the latest supported configuration for suspend, without changing your app code or image. --- ## Limitations and considerations - Suspend is not currently recommended for large machine memory sizes (> 2 GB) - Suspending many machines at once is not recommended - Some logs may be lost after resume - Unlike stop, suspend **does not** reset the machine's `rootfs` - On resume, the clock can lag a few seconds until NTP syncs <div class="callout"> Always design for both resume and cold start paths. </div> --- ## Snapshot behavior with suspend <div class="warning icon"> Snapshots are tied to the exact code and state of the machine they were taken from. If you deploy new code, the old snapshot can’t be resumed safely and will be discarded. </div> **Snapshots** **aren’t guaranteed to persist.** Cold starts may happen if: - **You deploy a new version of your app** — deployments rebuild the machine image, which invalidates the old snapshot. Since a snapshot is a literal memory dump of the _old_ process, resuming it after you’ve swapped in new code or dependencies would be unsafe and unpredictable. - The machine is migrated to a different host - The snapshot file is lost or corrupted — Hardware failures, space reclamation, or corruption can cause them to be deleted - We perform system maintenance or updates --- ## Handling Network Connections After Resume On resume, the machine thinks its network connections are still live. External systems (databases, APIs) may disagree. Common symptoms: - `ECONNRESET` - "Connection closed" - Timeouts on first request - Database pool errors **Fix:** Reconnect on failure. Example (Python + DB): ```python try: result = db.execute(query) except (ConnectionError, OperationalError): db.reconnect() result = db.execute(query) ``` Tips: - Use connection pools with disconnect handling (see this excellent [SQLAlchemy guide](https://docs.sqlalchemy.org/en/20/core/pooling.html#dealing-with-disconnects)) - Shorten connection timeouts to fail fast - Use retry/backoff for HTTP clients - Test after long suspensions --- ## Billing Suspended machines cost the same as stopped machines: storage only. There are no CPU/RAM charges. --- ## Monitoring & Debugging ```bash fly machine status <machine-id> ``` States: - `running` - `suspending` - `suspended` - `starting` (resume or cold start) - `stopped` If machines cold start unexpectedly: - Check requirements - Confirm no migrations or deployments occurred - Check logs for suspend/resume events Test cold start: ```bash fly machine stop <machine-id> fly machine start <machine-id> ``` --- ## Availability Suspend works in **all Fly.io regions** as of July 2024. --- **Related reading:** - [Autostop & Autostart](/docs/launch/autostop-autostart/) - [Fly Proxy Config](/docs/reference/fly-proxy-autostop-autostart/) - [Scaling Machines](/docs/apps/scale-count/) - [Machines API](https://docs.machines.dev/)

Machine Suspend and Resume

Machine suspend lets you pause a running Fly Machine and save its complete state, including memory, to persistent storage. When resumed, the machine picks up exactly where it left off, without rebooting the OS or restarting your app. That can make startup take just hundreds of milliseconds instead of multiple seconds.

You can think of suspend as what a laptop does when you close the lid, except your “laptop” is a microVM running in, say, dfw or fra or syd.

How it works

Suspend uses Firecracker snapshots to capture the entire VM state: CPU registers, memory contents, open file handles. When you start a suspended machine, Fly restores from this snapshot instead of cold booting.

Typical performance:

Resume from suspend: a few hundred ms
Cold start: ~2+ seconds for common apps
TCP connections may survive if the remote side keeps them open

Using Suspend

Manually

    # Suspend a machine
fly machine suspend <machine-id>

# Check status (running, suspending, suspended, etc.)
fly machine status <machine-id>

# Resume from snapshot
fly machine start <machine-id>

# Force a cold start (discard snapshot)
fly machine stop <machine-id>
fly machine start <machine-id>

  

Automatically via Fly Proxy

Configure in fly.toml:

[http_service]
  auto_stop_machines = "suspend"  # or "stop"
  auto_start_machines = true

  [[http_service.concurrency]]
    type = "requests"
    soft_limit = 25

The proxy will automatically suspend machines during low traffic, checking for idle periods every few minutes, and resume them when requests arrive.

Machines API

# Suspend
POST /v1/apps/{app_name}/machines/{machine_id}/suspend

# Wait for suspension to complete
GET /v1/apps/{app_name}/machines/{machine_id}/wait?state=suspended

# Resume (standard start endpoint)
POST /v1/apps/{app_name}/machines/{machine_id}/start

Generally, you need an API token to use the Machines API. But if you’re just suspending your own machine, you can skip the token and hit the /.fly/api Unix socket directly:

    $ curl --unix-socket /.fly/api -X POST \
  http://flaps/v1/apps/$FLY_APP_NAME/machines/$FLY_MACHINE_ID/suspend

Requirements

A machine can use suspend if it has:

≤ 2 GB memory (For larger memory sizes, suspend is discouraged due to increased suspend times)
No swap configured
No schedule configured
No GPU configured
Been updated since June 20, 2024 20:00 UTC

If you have an older machine, or you’re not sure when it was last updated, you can bring it up to date with:

fly machine update <machine-id> --yes

This updates the machine in place to the latest supported configuration for suspend, without changing your app code or image.

Limitations and considerations

Suspend is not currently recommended for large machine memory sizes (> 2 GB)
Suspending many machines at once is not recommended
Some logs may be lost after resume
Unlike stop, suspend does not reset the machine’s rootfs
On resume, the clock can lag a few seconds until NTP syncs

Always design for both resume and cold start paths.

Snapshot behavior with suspend

Snapshots are tied to the exact code and state of the machine they were taken from. If you deploy new code, the old snapshot can’t be resumed safely and will be discarded.

Snapshots aren’t guaranteed to persist. Cold starts may happen if:

You deploy a new version of your app — deployments rebuild the machine image, which invalidates the old snapshot. Since a snapshot is a literal memory dump of the old process, resuming it after you’ve swapped in new code or dependencies would be unsafe and unpredictable.
The machine is migrated to a different host
The snapshot file is lost or corrupted — Hardware failures, space reclamation, or corruption can cause them to be deleted
We perform system maintenance or updates

Handling Network Connections After Resume

On resume, the machine thinks its network connections are still live. External systems (databases, APIs) may disagree.

Common symptoms:

ECONNRESET
“Connection closed”
Timeouts on first request
Database pool errors

Fix: Reconnect on failure.

Example (Python + DB):

    try:
    result = db.execute(query)
except (ConnectionError, OperationalError):
    db.reconnect()
    result = db.execute(query)

  

Tips:

Use connection pools with disconnect handling (see this excellent SQLAlchemy guide)
Shorten connection timeouts to fail fast
Use retry/backoff for HTTP clients
Test after long suspensions

Billing

Suspended machines cost the same as stopped machines: storage only. There are no CPU/RAM charges.

Monitoring & Debugging

fly machine status <machine-id>

States:

running
suspending
suspended
starting (resume or cold start)
stopped

If machines cold start unexpectedly:

Check requirements
Confirm no migrations or deployments occurred
Check logs for suspend/resume events

Test cold start:

fly machine stop <machine-id>
fly machine start <machine-id>

Availability

Suspend works in all Fly.io regions as of July 2024.

Related reading:

or Open in ChatGPT

Report an issue or edit this page on GitHub

On this page

Open in ChatGPT