2025-03-22

A note on incidents: incidents are internal events for our infrastructure team. Incidents often correspond to degraded service on our platform, but not always. This log aims for 100% fidelity to internal incidents, and is a superset both of our status page events and of customer-impacting events on the platform. It includes events reported to subsets of customers on their personal status pages, as well as events without any status page impact.

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact.)

March 17: Capacity Issues In IAD, AMS, FRA (11:30EST): An unexpected spike in usage (it’s been a busy month) drove us to our scheduling limits in 3 regions. Past those limits, adding new Fly Machines to the worker servers in a region risks harming the performance of existing Fly Machines, so, when these limits are hit, we stop allowing new deployments to those regions. Generally, we operate two kinds of worker servers: long-term stable capacity and “burst” capacity; burst hardware is more expensive, but available on demand. We resolved this issue, within the span of about 3 hours, by adding a substantial number of new burst worker servers. Capacity planning this month has been complicated not just by an increase in usage, but in work we’re doing to diversify both our long-term and burst colo providers; usually, we’re several steps ahead of capacity.
March 18: Registry Performance Issues In Europe (11:30EST): A fun one. We experienced elevated latency and an elevated error rate for container registry operations in FRA, WAW, and AMS. The problems were subthreshold, so our response was delayed; the acute phase of this incident was prompted by elevated customer complaints about deploy issues in these regions. We managed to tie the incident to an upstream network incident; Cogent experienced a failure in its cross-Atlantic links. Because our registry is backed by AWS S3 in Transport Acceleration mode, registry transactions are DNS-load-balanced to CloudFront endpoints. For whatever reason, during the Cogent outage, DNS load balancing was sending European traffic to a KCMO endpoint half the time, which sent traffic over the busted Cogent link. We resolved the incident by cordoning off the European registry endpoints, but there’s a long-term fix here that involves registering our IPv6 addresses to ensure Europe isn’t randomly routed to KCMO, Arthur Bryants notwithstanding.

Next post ↑: 2025-04-05
Previous post ↓: 2025-03-15