May 19: Fly dashboard outage from broken GraphQL API deploy

May 19: Fly dashboard outage from broken GraphQL API deploy (23:13UTC)

A deploy of Fly’s GraphQL API service (which backs much of flyctl and parts of the dashboard) reduced the number of healthy instances enough that the HAProxy layer in front of it began timing out its /status health checks and returning fast 503s, which showed up as intermittent dashboard/API failures and elevated deploy failure rates. We recovered by removing the broken instances and cloning a known-good Machine to restore capacity until HAProxy backends were stable and green again.

Afterward we found that a local fly deploy of the GraphQL API service could produce a broken image because it skipped a CI-only step that fetches a supporting binary (resulting in an empty placeholder being copied into the image), and we merged a change to prevent that failure mode.