2025-03-15

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact.)

  • March 6: Multitenant Consul Outage For PG-Flex (11:00): We run two different kinds of Consul cluster at Fly.io; the one you’ll read about commonly on this infra-log is our gigantic main cluster, which we use to manage configuration and customer workload state, and the other is the “multitenant clusters” that we use to backend automated Postgres for our customers. Automated Postgres relies on multitenant Consul to designate leader Postgres servers in leader/replica deployments. There are two generations of automated Postgres at Fly.io; one based on Stolon, which has extensive dependencies on Consul, and the other based on Flex, which needs Consul only when the cluster is deployed. The multitenant Consul cluster we operate in IAD had a server reboot, which caused a leader election, which (because Raft leader election is a cosmic horror trope) wedged. Bouncing the Consul cluster resolved the issue, which affected a very small subset of Postgres used, within about 10 minutes.