April 17: Vault outage broke TLS certificate lookups

April 17: Vault outage broke TLS certificate lookups (13:04UTC)

A failure in our certificate store (Vault) caused fly-proxy to time out or fail while resolving some TLS certificates, leading to intermittent TLS handshake errors for affected apps. The same issue also affected a subset of MPG clusters.

The issue started after a migration left the Vault cluster in a bad state, where the Raft leader was stopped before transferring leadership to a different node. As Vault (and other Raft-based services such as Consul) need to load the entirety of the database into memory at process start, a cold boot of the leader would have taken hours; so we restored service by rebuilding the cluster.

This is not the first time Vault has caused fly-proxy outages. Longer term, we have plans to migrate to something more resilient to this specific failure mode.