Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact.)
April 22: Network Maintenance in SCL (13:00EST): As with last week, this incident was a placeholder for planned upstream network maintenance; no ops work here to discuss. Last week’s bulletin incorrectly stated that
QRO
was Santiago; it’s Querétaro, Mexico. We regret the error. It’s worth knowing that LATAM data centers are generally higher-maintenance for us than those outside LATAM.April 23: Dashboard Interface Disruption (13:30EST): Synthetic monitoring alerts on 503s and extreme latency on our central API services, particularly in
ui-ex
, our Elixir API and frontend server. Additional capacity is added as a best-practice mitigation, which helps; some pathological queries are identified and resolved or debounced; the incident, which made our web frontend sporadically nonresponsive, was mitigated within 30 minutes.April 24: API Capacity Issues (04:00EST): Support flags errors from our API servers and frontend, which is confirmed by our infra ops team. The problem is localized to our prod
ui-ex
cluster (staging is working fine, other API resources are working fine), and capacity is increased, resolving the issue.April 26: Sensu Outage (12:30EST): This is an internal incident for our infra team. One of the tools we use to monitor our fleet is Sensu, which coordinates health checks on all our running services (like our
flyd
orchestrator, ourfly-proxy
Anycast router, and a cast of dozens of local services). We run Sensu in a geographically distributed configuration, backed byetcd
as a state store. Sensu workers lose contact withetcd
and freak out; it turns out that our cluster has Raft-ed itself into a state where paths between components run over Cogent. Adding Sensu workers and forcing a newetcd
leader election resolves the problem. Resolved within 90 minutes, no customer impact (though there would eventually be customer impact if we lost Sensu too long).