March 2: Petsem got overwhelmed

March 2: Petsem got overwhelmed (21:15UTC)

A slow database query in Petsem caused secret lookups to start timing out, preventing some machines from starting during the incident. Our theories were pointing toward a machine on one of our hosts that was trying very hard to initialize a volume with a missing block device, triggering a large number of secret lookups. We recreated the missing block device, and also flipped some feature flags around to reduce load on Petsem. Fortunately, one of these actions worked and Petsem recovered, marking the end of the public incident.

This failure mode was triggered because this app had a lot of volume encryption keys stored in Petsem – even though it only had a handful of existent volumes, we don’t delete the encryption keys when volumes are deleted, as we still need them to decrypt volume snapshots. Usually the impact of a slow read query is limited, as we have multiple read replicas and can easily scale them up. However, changes to routing (as part of the regionalization project) had caused most of North America (including SJC, where the problematic volume was located) to route directly to the write primary, instead of a nearby read replica (which was seeing almost no load throughout the incident). This meant, not only was Petsem unable to accept writes, but it also couldn’t accept reads.

As a response to this incident, we’re working to migrate Petsem to a new database schema that is compatible with the database indexes needed to make these read queries consistently fast.