2024-07-20

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

  • July 14: CPU Steal On A Subset of IAD (01:00EST): A customer turned up a large, multi-machine, high performance CPU reservation, which impacted shared CPU Machines on the servers their own Machines ran on. We provisioned additional IAD capacity within about an hour and rebalanced workloads.

  • July 15: Storage System Failure in ORD (15:00EST): A newly-provisioned physical in ORD had an incompatible RAID10 configuration, which historically causes LVM2 to get stuck. This tripped alerts; the problem was diagnosed within a few minutes and resolved by draining the server, which was later reprovisioned.

  • July 16: Upstream Network Connectivity Between IAD and ORD (06:00EST): The on-call team raised an incident when our product team couldn’t get a working deploy for our Elixir API backend. After a great deal of time pointlessly debugging whether a kernel update had resulted in incompatible WireGuard implementations (it’s never kernel WireGuard), we tracked the issue down to a routing loop in ORD that involved Cogent (iykyk). Resolved by our upstream providers at around 08:00EST.

  • July 16: Elevated Registry Errors (17:00EST): Upstream network problems at Cogent persisted throughout the day. Our container registry is a distributed application running in multiple regions with an object storage backend; it began generating persistent 500 errors, which would have corresponded to transient deployment failures for someone running flyctl. The interim solution: kill the registry Fly Machines that were throwing errors; sometimes it’s good to be distributed; we’re not default-free routed, so we have limited control over our transit, but we have application-layer control over the routing of the portions of our stack that are fully distributed.

  • July 18: Elevated DNS Errors (10:30EST): A team member called an incident after getting lots of metrics errors for DNS services. These weren’t customer-visible issues; they were internal DNS, and were caused by maintenance by an upstream in Melbourne. No impact, but we document everything here.

  • July 18: Memory Capacity Limited in SJC (16:00EST): A customer suddenly bursted to several hundred high-memory performance instances in SJC, which request we were able to satisfy, but left us out of capacity for additional deployments in the region. We provisioned additional capacity, they naturally scaled down their workload without us doing anything, and we worked on the long term resource limit policy we’re going to use to address these issues in the future (we can handle almost literally any load customers want to generate, but for very large allocations in some regions we’re going to want some notice).

This Week In Infra Engineering

One thing you’re starting to see now is that Fly Machine migration and host draining is ironed out enough to be a casual solution to problems that would not have been casually resolved a year ago; “bring up new capacity, move the noisy workloads there” is a no-escalation runbook for us now. See the last 10 infra-logs for some of the effort that it took us to get to this point.

Akshit shipped a new egress billing model, which applies only to new customers for now. Under the new scheme, we segregate egress bandwidth by region in invoices, and private networking (between Fly Machines in different regions) is now cheaper. Our product team shipped a new billing system last month, and billing improvements are likely to be a continuing theme of our work.

Andres continued improving our internal synthetics monitoring systems.

Ben fixed several Fly Machine migration bugs: migration RPCs were breaking the configuration for static assets (we can serve static file HTTP directly off our proxies without bouncing HTTP requests off your Fly Machines, if you ask us to); we had a coordination bug in one of the FSMs our orchestrator flyd uses to migrate volumes; and high-availability Postgres cluster migrations were made less tricky (we do these by hand currently, for reasons we’re blogging about this week).

Matt shipped alert-critic, a chat-ops service that monitors our busiest alerting channels and tracks first-responder satisfaction with those alerts, in order to generate reports that spotlight problematic alerts that are either poorly reviewed or that don’t end up needing responses at all.

Peter generated network telemetry data to inform a fleetwide rollout of fly-proxy fallback routing, which routes requests through our overlay network during periods of network instability, at the application layer, automatically. This was deployed in Singapore last week, and is deployed more widely this week.

Tom overhauled our alerting layer for internal server health check alerts; we have hundreds of these, and they currently route directly from our health check system to our on-call system (and thus, ultimately, PagerDuty). We’ve scaled past our alert system’s ability to reliably alert (for very ambitious values of “reliably”). The new alert system routes through Vector, like the rest of our logs and telemetry, and fires alerts from Grafana; both these systems are used for customer workloads and were built to scale, unlike our internal server health check system.