2024-12-07

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

  • December 5: Vault Server Outage (16:45EST): An infrastructure deploy run was kicked off, in order to add a new server to our Hashicorp Consul cluster; because of a stale script in our infrastructure tooling, that run had the effect of rolling back a certificate update we’d done a month earlier, which broke our Vault cluster (which has a cross-dependency on Consul, an artifact of our old Nomad+Consul+Vault platform). Fortunately, several months back we deployed Pet Sematary, our in-house replacement for Vault, in a configuration that has PetSem kick in when Vault is unavailable (and vice/versa), so this likely had no customer impact over the 40 minutes it took to investigate and resolve (those 40 minutes also being an artifact of the reduced staffing needed because of the PetSem mitigation).

  • December 6: Fly Machines API Outage (08:50EST): An engineer deployed a change to our machines API code that had the effect of breaking the Fly Machines create API once deployed in production. The error was discovered immediately through alerts, and rolled back within 10 minutes (it should have been 5, but a CI issue disrupted the first attempted deploy).