2025-04-26

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact.)

April 22: Network Maintenance in SCL (13:00EST): As with last week, this incident was a placeholder for planned upstream network maintenance; no ops work here to discuss. Last week’s bulletin incorrectly stated that QRO was Santiago; it’s Querétaro, Mexico. We regret the error. It’s worth knowing that LATAM data centers are generally higher-maintenance for us than those outside LATAM.
April 23: Dashboard Interface Disruption (13:30EST): Synthetic monitoring alerts on 503s and extreme latency on our central API services, particularly in ui-ex, our Elixir API and frontend server. Additional capacity is added as a best-practice mitigation, which helps; some pathological queries are identified and resolved or debounced; the incident, which made our web frontend sporadically nonresponsive, was mitigated within 30 minutes.
April 24: API Capacity Issues (04:00EST): Support flags errors from our API servers and frontend, which is confirmed by our infra ops team. The problem is localized to our prod ui-ex cluster (staging is working fine, other API resources are working fine), and capacity is increased, resolving the issue.
April 26: Sensu Outage (12:30EST): This is an internal incident for our infra team. One of the tools we use to monitor our fleet is Sensu, which coordinates health checks on all our running services (like our flyd orchestrator, our fly-proxy Anycast router, and a cast of dozens of local services). We run Sensu in a geographically distributed configuration, backed by etcd as a state store. Sensu workers lose contact with etcd and freak out; it turns out that our cluster has Raft-ed itself into a state where paths between components run over Cogent. Adding Sensu workers and forcing a new etcd leader election resolves the problem. Resolved within 90 minutes, no customer impact (though there would eventually be customer impact if we lost Sensu too long).

Next post ↑: 2025-05-03
Previous post ↓: 2025-04-19