Metrics on Fly.io

The Fly.io platform includes a fully-managed metrics solution to help you easily monitor your Fly apps. It includes the following components:

Prometheus on Fly.io: Managed Prometheus-compatible time series storage
Dashboards: Managed Grafana with detailed visualizations of all built-in metrics
Built-in Metrics: Metrics automatically sent from every Fly app you deploy
Custom Metrics: Expose additional metrics from Fly apps for further customization

Prometheus on Fly.io

Prometheus is a popular open source monitoring system used to store and query metrics efficiently, with a stable HTTP querying API compatible with a range of systems.

Prometheus on Fly.io is a fully-managed service based on VictoriaMetrics. It supports most common Prometheus querying API endpoints:

Note that remote read (/api/v1/read) remote storage integration is not supported.

MetricsQL

Prometheus queries are typically based on the PromQL query language. Prometheus on Fly.io queries use VictoriaMetrics MetricsQL, a backwards-compatible query language that fixes user experience issues and adds useful features and functions on top of PromQL.

Key features:

Better rate() and increase() functions that just work. No need for irate workarounds or appending Grafana’s magical $__rate_interval selector to every query. In fact, you can even omit the square brackets entirely and MetricsQL will do the right thing.
Many more label manipulation functions such as drop_common_labels, label_set, etc.
topk_avg, which returns the top k time series averaged across the entire series range (not just individual points), plus the sum of all remaining series in an “other” label. Useful for giving a small, filtered view across a potentially large number of series.

Querying

Queries can be sent to the following endpoint:

https://api.fly.io/prometheus/<org-slug>/

You’ll need to authenticate with a Fly Access Token sent in the standard Bearer Token format (e.g., an HTTP request header Authorization: FlyV1 <TOKEN>), and you may only query series scoped to your organizations.

Manually

Find your Organization slug

List your organizations, find the org slug and set it as a local variable.

flyctl orgs list
ORG_SLUG=[org-slug]

Get an access token

TOKEN=$(flyctl auth token)

Test it out!

    curl https://api.fly.io/prometheus/$ORG_SLUG/api/v1/query \
  --data-urlencode 'query=sum(increase(fly_edge_http_responses_count)) by (app, status)' \
  -H "Authorization: FlyV1 $TOKEN"

  

Dashboards

For more advanced metrics monitoring, you can use dashboards to organize and visualize complex Prometheus queries.

The Metrics tab on the Fly.io Dashboard provides an overview of your Fly apps using the built-in metrics stored in Prometheus.

Managed Grafana

Grafana is a popular open source data visualization web application, that allows you to compose queries against data sources into dynamic, reusable dashboards.

We provide a managed Grafana instance at fly-metrics.net, preconfigured with your Prometheus data source and detailed dashboards covering the full set of built-in metrics.

You can also use the Explore panel to run ad-hoc queries against the preconfigured Prometheus datasource, or create/import additional dashboards for further customization or to visualize custom metrics.

Switch between your Fly.io Organizations by clicking the “Switch organization” link beneath the user icon in the lower-left of the screen.

External or self-hosted Grafana

You can also configure your Prometheus endpoint with an existing Grafana installation, or host one on Fly.io. Either way, you can set it up like this:

Add a Prometheus data source (Settings -> Data Sources -> Add data source -> Prometheus)
Fill the form with the following:
HTTP -> URL: https://api.fly.io/prometheus/<org-slug>/
Custom HTTP Headers -> + Add Header:
- Header: Authorization, Value: Bearer <token>

You’re all set.

We publish our Fly.io Dashboards to Grafana.com for use with external Grafana instances. To install, just import the dashboard using the listed IDs. If you’d like to contribute changes to the dashboards, we have created a repository for them.

Built-in metrics

Fly apps automatically publish a number of built-in metrics.

Metric types are all Gauges unless otherwise marked.

Metrics with names ending in _count are all Counters.

Histogram metrics with a base name of <name> expose multiple series:

<name>_bucket{le}
<name>_sum
<name>_count

Standard Labels

All published series include the following labels:

app: App name
region: Fly.io Region
host: 4-character host ID (lowercase hexadecimal)
instance: App instance ID (for all series except fly_edge_ and fly_volume_)

If your app exposes custom metrics with the same labels, they will be overwritten.

Proxy series

Any app using a TCP-based handler (HTTP, TLS or straight TCP) publishes edge and app proxy metrics:

Labels:

proxy_id: “blue” or “green” (flips when the proxy is restarted/updated)

Edge - `fly_edge_`

fly_edge_http_responses_count{status}
fly_edge_http_response_time_seconds{status} (Histogram)
fly_edge_tcp_connects_count
fly_edge_tcp_disconnects_count
fly_edge_data_out (Counter, bytes)
fly_edge_data_in (Counter, bytes)
fly_edge_tls_handshake_errors{servername} (Counter)
fly_edge_tls_handshake_time_seconds{version} (Histogram)

App - `fly_app_`

fly_app_concurrency
fly_app_http_responses_count{status}
fly_app_http_response_time_seconds{status} (Histogram)
fly_app_connect_time_seconds (Histogram)
fly_app_tcp_connects_count
fly_app_tcp_disconnects_count

Instance series - `fly_instance_`

Derived from the /proc file system of your app VMs.

fly_instance_up = 1 shows the VM is reporting correctly.

Instance memory - `fly_instance_memory_`

Derived from /proc/meminfo. All units are in bytes.

fly_instance_memory_mem_total
fly_instance_memory_mem_free
fly_instance_memory_mem_available
fly_instance_memory_buffers
fly_instance_memory_cached
fly_instance_memory_swap_cached
fly_instance_memory_active
fly_instance_memory_inactive
fly_instance_memory_swap_total
fly_instance_memory_swap_free
fly_instance_memory_dirty
fly_instance_memory_writeback
fly_instance_memory_slab
fly_instance_memory_shmem
fly_instance_memory_vmalloc_total
fly_instance_memory_vmalloc_used
fly_instance_memory_vmalloc_chunk

Instance Load and CPU

load_average is derived from /proc/loadavg (getloadavg). It’s a “system load average” measuring the number of processes in the system run queue, with samples representing averages over 1, 5, and 15 minutes.
cpu is derived from /proc/stat, and counts the amount of time each CPU (cpu_id) has spent performing different kinds of work (mode, which may be one of user, nice, system, idle, iowait, irq, softirq, steal, guest, guest_nice). The time unit is ‘clock ticks’ of centiseconds (0.01 seconds).

The following CPU metrics are related to CPU Performance:

cpu_baseline is the baseline quota in number of CPUs, calculated from the CPU type and number of vCPUs.
cpu_balance is the accrued CPU burst balance in clock ticks (centiseconds).
cpu_throttle is derived from the throttled_time field of the cgroup cpu.stat, and counts the amount of time the CPU was throttled after exhausting its quota, in ‘clock ticks’ (centiseconds).

fly_instance_load_average{minutes}
fly_instance_cpu{cpu_id, mode} (Counter, centiseconds)
fly_instance_cpu_baseline (CPUs)
fly_instance_cpu_balance (centiseconds)
fly_instance_cpu_throttle (Counter, centiseconds)

Instance Disks - `fly_instance_disk_`

Counters derived from fields 1-11 of /proc/diskstats. The unit for time_ series is milliseconds, and the unit for sectors_ is 512-byte sectors.

Labels:

device: Published for the ephemeral VM root disk (vdb) and any mounted Volume (vdc).

fly_instance_disk_reads_completed
fly_instance_disk_reads_merged
fly_instance_disk_sectors_read
fly_instance_disk_time_reading
fly_instance_disk_writes_completed
fly_instance_disk_writes_merged
fly_instance_disk_sectors_written
fly_instance_disk_time_writing
fly_instance_disk_io_in_progress
fly_instance_disk_time_io
fly_instance_disk_time_io_weighted

Instance Networking - `fly_instance_net_`

Counters derived from /proc/net/dev.

Labels:

device: interface name, either eth0 or dummy0 (ignore).

fly_instance_net_recv_bytes
fly_instance_net_recv_packets
fly_instance_net_recv_errs
fly_instance_net_recv_drop
fly_instance_net_recv_fifo
fly_instance_net_recv_frame
fly_instance_net_recv_compressed
fly_instance_net_recv_multicast
fly_instance_net_sent_bytes
fly_instance_net_sent_packets
fly_instance_net_sent_errs
fly_instance_net_sent_drop
fly_instance_net_sent_fifo
fly_instance_net_sent_colls
fly_instance_net_sent_carrier
fly_instance_net_sent_compressed

Instance File Descriptors - `fly_instance_filefd_`

Information about allocated, and maximum allowed allocated file descriptors derived from /proc/sys/fs/file-nr.

fly_instance_filefd_allocated
fly_instance_filefd_maximum

Instance Filesystem - `fly_instance_filesystem_`

Filesystem metrics derived from VFS File System Information.

Labels:

mount: mount point name(s), / and if using Volumes, the destination name in fly.toml.

fly_instance_filesystem_blocks
fly_instance_filesystem_block_size
fly_instance_filesystem_blocks_free
fly_instance_filesystem_blocks_avail

Volumes - `fly_volume_`

Labels:

id: Volume ID

If you’re using Volumes for any of your organization’s apps, you’ll be able to query these series, derived from the LSize and Data% of the volume’s thin LV.

fly_volume_size_bytes
fly_volume_used_pct (0-100)

Postgres - `pg_`

If you have a Postgres database hosted on Fly.io, you’ll automatically get the following series, published via postgres_exporter:

pg_stat_activity_count
pg_stat_activity_max_tx_duration
pg_stat_archiver_archived_count
pg_stat_archiver_failed_count
pg_stat_bgwriter_buffers_alloc
pg_stat_bgwriter_buffers_backend_fsync
pg_stat_bgwriter_buffers_backend
pg_stat_bgwriter_buffers_checkpoint
pg_stat_bgwriter_buffers_clean
pg_stat_bgwriter_checkpoint_sync_time
pg_stat_bgwriter_checkpoint_write_time
pg_stat_bgwriter_checkpoints_req
pg_stat_bgwriter_checkpoints_timed
pg_stat_bgwriter_maxwritten_clean
pg_stat_bgwriter_stats_reset
pg_stat_database_blk_read_time
pg_stat_database_blk_write_time
pg_stat_database_blks_hit
pg_stat_database_blks_read
pg_stat_database_conflicts_confl_bufferpin
pg_stat_database_conflicts_confl_deadlock
pg_stat_database_conflicts_confl_lock
pg_stat_database_conflicts_confl_snapshot
pg_stat_database_conflicts_confl_tablespace
pg_stat_database_conflicts
pg_stat_database_deadlocks
pg_stat_database_numbackends
pg_stat_database_stats_reset
pg_stat_database_tup_deleted
pg_stat_database_tup_fetched
pg_stat_database_tup_inserted
pg_stat_database_tup_returned
pg_stat_database_tup_updated
pg_stat_database_xact_commit
pg_stat_database_xact_rollback
pg_stat_replication_pg_current_wal_lsn_bytes
pg_stat_replication_pg_wal_lsn_diff
pg_stat_replication_reply_time
pg_replication_lag
pg_database_size_bytes

Custom Metrics

For further customization beyond built-in metrics, Fly apps can expose a metrics endpoint we’ll automatically scrape every 15 seconds and send the results to Prometheus.

Configuration

Add a [metrics] section to your application’s fly.toml:

    [metrics]
port = 9091
path = "/metrics" # default for most prometheus exporters

  

If your app uses multiple processes, you can add multiple [[metrics]] sections, each with its own set of processes:

    [[metrics]]
port = 9394
path = "/metrics"
processes = ["web"]

[[metrics]]
port = 9113
path = "/metrics"
processes = ["proxy"]

  

Instrumentation

Instrument your app and expose your metrics on 0.0.0.0.

There are many supported client libraries as well as off-the-shelf exporters able to return Prometheus-formatted metrics.

Authentication

Authenticating to the Prometheus API can be achieved a few different ways, depending on the level of access you want your token to have.

Fly Access Token

As in the earlier example, a full access token can be generated with flyctl auth token and then passed as a bearer token in the Authorization header. The header looks like:

Authorization: Bearer THE_TOKEN

Fly org-restricted or read-only token

This kind of token or “macaroon” can be scoped to a single organization or configured to only allow read operations, which can be safer than using a full-blown read-write token that grants access to all organizations under your account.

Generating tokens

Create an org-restricted token:

fly token create org -o THE_ORGANIZATION

Create a read-only org-restricted token:

fly token create readonly

These tokens look like this once generated: FlyV1 fm2_lJPECAAAAAAAAC7txBAzYI6PRWhHLT...(a lot of base64-encoded text).

Using tokens

To use one of these tokens in an HTTP header, the “FlyV1” identifier replaces the “Bearer” token identifier. So it would look like this:

Authorization: FlyV1 fm2_lJPECAAAAAAAAC7txBAzYI6PRWhHLT...(a lot of base64-encoded text)

Report an issue or edit this page on GitHub

Metrics on Fly.io

Prometheus on Fly.io

MetricsQL

Querying

Manually

Dashboards

Managed Grafana

External or self-hosted Grafana

Built-in metrics

Standard Labels

Proxy series

Edge - fly_edge_

App - fly_app_

Instance series - fly_instance_

Instance memory - fly_instance_memory_

Instance Load and CPU

Instance Disks - fly_instance_disk_

Instance Networking - fly_instance_net_

Instance File Descriptors - fly_instance_filefd_

Instance Filesystem - fly_instance_filesystem_

Volumes - fly_volume_

Postgres - pg_

Custom Metrics

Configuration

Instrumentation

Authentication

Fly Access Token

Fly org-restricted or read-only token

Generating tokens

Using tokens

Edge - `fly_edge_`

App - `fly_app_`

Instance series - `fly_instance_`

Instance memory - `fly_instance_memory_`

Instance Disks - `fly_instance_disk_`

Instance Networking - `fly_instance_net_`

Instance File Descriptors - `fly_instance_filefd_`

Instance Filesystem - `fly_instance_filesystem_`

Volumes - `fly_volume_`

Postgres - `pg_`