NRT Machines API errors thrashed Managed Postgres

#April 10: NRT Machines API errors thrashed Managed Postgres (18:37UTC)

Managed Postgres clusters in NRT intermittently went unavailable when the regional Machines API began timing out and occasionally returning truncated 502 responses, which caused Kubernetes (via our virtual-kubelet) and the Postgres operator to repeatedly reschedule and recreate Machines. That feedback loop produced extra operator/pgbouncer Machines and kept some pods stuck “not initialized,” breaking routing for affected clusters for several minutes at a time. We stabilized the region by shifting load off unhealthy workers, bringing additional NRT capacity online, and restarting the affected control components, after which cluster health checks returned to normal.