March 19: Metrics outage

March 19: Metrics outage (06:26UTC)

Metrics graphs were impacted when the storage backing our metrics pipeline filled up, which stalled ingestion. Typically a metrics outage like this will backfill the missing data on recovery, but a configuration mistake this time meant we briefly accepted metrics without forwarding them on, resulting in roughly an hour of permanently lost metrics. We still have some open investigations on this one, such as why the storage filled up (it shouldn’t have), and why our alarms didn’t fire (they should have).