May 5: Petsem primary host lost networking (IAD)
#May 5: Petsem primary host lost networking (IAD) (13:13UTC)
A worker host running the primary instance of Petsem, our secrets storage service, lost network connectivity after a NIC driver stall. We restored service about 30 minutes after the start of impact by reloading the host’s NIC driver.
During the incident, requests to set/update secrets and create apps failed globally. Some other platform functionality was also affected because an internal Redis used for rate limiting was on the same host. However, existing apps/machines continued running, and existing secrets continued being accessible from our read replicas.