Metrics on Fly.io

The Fly.io platform includes a fully-managed metrics solution to help you easily monitor your Fly apps. It includes the following components:

  • Prometheus on Fly.io: Managed, Prometheus-compatible time series storage
  • Dashboards: Managed Grafana with detailed visualizations of all built-in metrics
  • Built-in Metrics: Metrics automatically sent from every Fly app you deploy
  • Custom Metrics: Expose additional metrics from Fly apps for further customization

Prometheus on Fly.io

Prometheus is a popular open source monitoring system used to store and query metrics efficiently, with a stable HTTP querying API compatible with a range of systems.

Prometheus on Fly.io is a fully-managed service based on VictoriaMetrics. It supports most common Prometheus querying API endpoints:

Note that remote read (/api/v1/read) remote storage integration is not supported.

MetricsQL

Prometheus queries are typically based on the PromQL query language. Prometheus on Fly.io queries use VictoriaMetrics MetricsQL, a backwards-compatible query language that fixes user experience issues and adds useful features and functions on top of PromQL.

Key features:

  • Better rate() and increase() functions that just work. No need for irate workarounds or appending Grafana’s magical $__rate_interval selector to every query. In fact, you can even omit the square brackets entirely and MetricsQL will do the right thing.
  • Many more label manipulation functions such as drop_common_labels, label_set, etc.
  • topk_avg, which returns the top k time series averaged across the entire series range (not just individual points), plus the sum of all remaining series in an “other” label. Useful for giving a small, filtered view across a potentially large number of series.

Querying

Queries can be sent to the following endpoint:

https://api.fly.io/prometheus/<org-slug>/

You’ll need to authenticate with a Fly Access Token sent in the standard Bearer Token format (e.g., an HTTP request header Authorization: Bearer <TOKEN>), and you may only query series scoped to your organizations.

Manually

Find your Organization slug

List your organizations, find the org slug and set it as a local variable.

flyctl orgs list
ORG_SLUG=[org-slug]

Get an access token

TOKEN=$(flyctl auth token)

Test it out!

curl https://api.fly.io/prometheus/$ORG_SLUG/api/v1/query \
  --data-urlencode 'query=sum(increase(fly_edge_http_responses_count)) by (app, status)' \
  -H "Authorization: Bearer $TOKEN"

Dashboards

For more advanced metrics monitoring, you can use dashboards to organize and visualize complex Prometheus queries.

The Metrics tab on the Fly.io Dashboard provides an overview of your Fly apps using the built-in metrics stored in Prometheus.

Managed Grafana

Grafana is a popular open source data visualization web application, that allows you to compose queries against data sources into dynamic, reusable dashboards.

We provide a managed Grafana instance at fly-metrics.net, preconfigured with your Prometheus data source and detailed dashboards covering the full set of built-in metrics.

You can also use the Explore panel to run ad-hoc queries against the preconfigured Prometheus datasource, or create/import additional dashboards for further customization or to visualize custom metrics.

Switch between your Fly.io Organizations by clicking the “Switch organization” link beneath the user icon in the lower-left of the screen.

External or self-hosted Grafana

You can also configure your Prometheus endpoint with an existing Grafana installation, or host one on Fly.io. Either way, you set it up thusly:

  1. Add a Prometheus data source (Settings -> Data Sources -> Add data source -> Prometheus)
  2. Fill the form with the following:
  3. HTTP -> URL: https://api.fly.io/prometheus/<org-slug>/
  4. Custom HTTP Headers -> + Add Header:
    • Header: Authorization, Value: Bearer <token>

You’re all set.

We publish our Fly.io Dashboards to Grafana.com for use with external Grafana instances. To install, just import the dashboard using the listed IDs. If you’d like to contribute changes to the dashboards, we have created a repository for them.

Built-in metrics

Fly apps automatically publish a number of built-in metrics.

Metric types are all Gauges unless otherwise marked.

Metrics with names ending in _count are all Counters.

Histogram metrics with a base name of <name> expose multiple series:

  • <name>_bucket{le}
  • <name>_sum
  • <name>_count

Standard Labels

All published series include the following labels:

  • app: App name
  • region: Fly.io Region
  • host: 4-character host ID (lowercase hexadecimal)
  • instance: App instance ID (for all series except fly_edge_ and fly_volume_)

If your app exposes custom metrics with the same labels, they will be overwritten.

Proxy series

Any app using a TCP-based handler (HTTP, TLS or straight TCP) publishes edge and app proxy metrics:

Labels:

  • proxy_id: “blue” or “green” (flips when the proxy is restarted/updated)

Edge - fly_edge_

fly_edge_http_responses_count{status}
fly_edge_http_response_time_seconds{status} (Histogram)
fly_edge_tcp_connects_count
fly_edge_tcp_disconnects_count
fly_edge_data_out (Counter, bytes)
fly_edge_data_in (Counter, bytes)
fly_edge_tls_handshake_errors{servername} (Counter)
fly_edge_tls_handshake_time_seconds{version} (Histogram)

App - fly_app_

fly_app_concurrency
fly_app_http_responses_count{status}
fly_app_http_response_time_seconds{status} (Histogram)
fly_app_connect_time_seconds (Histogram)
fly_app_tcp_connects_count
fly_app_tcp_disconnects_count

Instance series - fly_instance_

Derived from the /proc filesystem of your app VMs.

fly_instance_up = 1 shows the VM is reporting correctly.

Instance memory - fly_instance_memory_

Derived from /proc/meminfo. All units are in bytes.

fly_instance_memory_mem_total
fly_instance_memory_mem_free
fly_instance_memory_mem_available
fly_instance_memory_buffers
fly_instance_memory_cached
fly_instance_memory_swap_cached
fly_instance_memory_active
fly_instance_memory_inactive
fly_instance_memory_swap_total
fly_instance_memory_swap_free
fly_instance_memory_dirty
fly_instance_memory_writeback
fly_instance_memory_slab
fly_instance_memory_shmem
fly_instance_memory_vmalloc_total
fly_instance_memory_vmalloc_used
fly_instance_memory_vmalloc_chunk

Instance Load and CPU

  • load_average is derived from /proc/loadavg (getloadavg). It’s a “system load average” measuring the number of processes in the system run queue, with samples representing averages over 1, 5, and 15 minutes.

  • cpu is derived from /proc/stat, and counts the amount of time each CPU (cpu_id) has spent performing different kinds of work (mode, which may be one of user, nice, system, idle, iowait, irq, softirq, steal, guest, guest_nice).
    The time unit is ‘clock ticks’ of centiseconds (0.01 seconds).

fly_instance_load_average{minutes}
fly_instance_cpu{cpu_id, mode} (Counter, centiseconds)

Instance Disks - fly_instance_disk_

Counters derived from fields 1-11 of /proc/diskstats. The unit for time_ series is milliseconds, and the unit for sectors_ is 512-byte sectors.

Labels:

  • device: Published for the ephemeral VM root disk (vdb) and any mounted Volume (vdc).
fly_instance_disk_reads_completed
fly_instance_disk_reads_merged
fly_instance_disk_sectors_read
fly_instance_disk_time_reading
fly_instance_disk_writes_completed
fly_instance_disk_writes_merged
fly_instance_disk_sectors_written
fly_instance_disk_time_writing
fly_instance_disk_io_in_progress
fly_instance_disk_time_io
fly_instance_disk_time_io_weighted

Instance Networking - fly_instance_net_

Counters derived from /proc/net/dev.

Labels:

  • device: interface name, either eth0 or dummy0 (ignore).
fly_instance_net_recv_bytes
fly_instance_net_recv_packets
fly_instance_net_recv_errs
fly_instance_net_recv_drop
fly_instance_net_recv_fifo
fly_instance_net_recv_frame
fly_instance_net_recv_compressed
fly_instance_net_recv_multicast
fly_instance_net_sent_bytes
fly_instance_net_sent_packets
fly_instance_net_sent_errs
fly_instance_net_sent_drop
fly_instance_net_sent_fifo
fly_instance_net_sent_colls
fly_instance_net_sent_carrier
fly_instance_net_sent_compressed

Instance File Descriptors - fly_instance_filefd_

Information about allocated, and maximum allowed allocated file descriptors derived from /proc/sys/fs/file-nr.

fly_instance_filefd_allocated
fly_instance_filefd_maximum

Instance Filesystem - fly_instance_filesystem_

Filesystem metrics derived from VFS File System Information.

Labels:

  • mount: mountpoint name(s), / and if using Volumes, the destination name in fly.toml.
fly_instance_filesystem_blocks
fly_instance_filesystem_block_size
fly_instance_filesystem_blocks_free
fly_instance_filesystem_blocks_avail

Volumes - fly_volume_

Labels:

  • id: Volume ID

If you’re using Volumes for any of your organization’s apps, you’ll be able to query these series, derived from the LSize and Data% of the volume’s thin LV.

fly_volume_size_bytes
fly_volume_used_pct (0-100)

Postgres - pg_

If you have a Postgres database hosted on Fly.io, you’ll automatically get the following series, published via postgres_exporter:

pg_stat_activity_count
pg_stat_activity_max_tx_duration
pg_stat_archiver_archived_count
pg_stat_archiver_failed_count
pg_stat_bgwriter_buffers_alloc
pg_stat_bgwriter_buffers_backend_fsync
pg_stat_bgwriter_buffers_backend
pg_stat_bgwriter_buffers_checkpoint
pg_stat_bgwriter_buffers_clean
pg_stat_bgwriter_checkpoint_sync_time
pg_stat_bgwriter_checkpoint_write_time
pg_stat_bgwriter_checkpoints_req
pg_stat_bgwriter_checkpoints_timed
pg_stat_bgwriter_maxwritten_clean
pg_stat_bgwriter_stats_reset
pg_stat_database_blk_read_time
pg_stat_database_blk_write_time
pg_stat_database_blks_hit
pg_stat_database_blks_read
pg_stat_database_conflicts_confl_bufferpin
pg_stat_database_conflicts_confl_deadlock
pg_stat_database_conflicts_confl_lock
pg_stat_database_conflicts_confl_snapshot
pg_stat_database_conflicts_confl_tablespace
pg_stat_database_conflicts
pg_stat_database_deadlocks
pg_stat_database_numbackends
pg_stat_database_stats_reset
pg_stat_database_tup_deleted
pg_stat_database_tup_fetched
pg_stat_database_tup_inserted
pg_stat_database_tup_returned
pg_stat_database_tup_updated
pg_stat_database_xact_commit
pg_stat_database_xact_rollback
pg_stat_replication_pg_current_wal_lsn_bytes
pg_stat_replication_pg_wal_lsn_diff
pg_stat_replication_reply_time
pg_replication_lag
pg_database_size_bytes

Custom Metrics

For further customization beyond built-in metrics, Fly apps can expose a metrics endpoint we’ll automatically scrape every 15 seconds and send the results to Prometheus.

Configuration

Add a [metrics] section to your application’s fly.toml:

[metrics]
port = 9091
path = "/metrics" # default for most prometheus exporters

If your app uses multiple processes, you can add multiple [[metrics]] sections, each with its own set of processes:

[[metrics]]
port = 9394
path = "/metrics"
processes = ["web"]

[[metrics]]
port = 9113
path = "/metrics"
processes = ["proxy"]

Instrumentation

Instrument your app and expose your metrics on 0.0.0.0.

There are many supported client libraries as well as off-the-shelf exporters able to return Prometheus-formatted metrics.