May 27: App creation timeouts from petsem-certs disk full

May 27: App creation timeouts from petsem-certs disk full (03:43UTC)

New app creation requests (including flyctl apps create) were failing with 504 timeouts because they weren’t able to create certificates in petsem-certs, our new certificate store (which we’re in the process of provisioning and migrating to from Vault). While petsem-certs is not yet operational, we are writing certificates to the store, which (disappointingly) caused it to run out of disk space. Existing apps and Machines were unaffected - only app creation timed out. We restored app creation by expanding the storage.

Since we’re dual-writing to both petsem-certs and Vault, and the Fly Proxy is reading from Vault, we didn’t expect the loss of petsem-certs to have any impact and it hadn’t yet been hooked up to our monitoring. We also had excessive retries on requests to the store, which caused the issue to present as a timeout rather than a failure, so we’ve tuned that as well.