2024-08-17

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

August 13: Edge Capacity Saturated in DFW (04:00EST): Synthetic alerts indicated to the on-call team that we were having capacity issues in the DFW region, which may have been due to a routing configuration that effectively had some of our DFW edge capacity offline. We mitigated the problem within minutes, and added additional long-term edge capacity to the region.
August 15: Network Latency Spike In CDG (09:00EST): A customer deployment of a very high-traffic public API saw a spike in requests that overwhelmed their deployment; they at the same time had hard request limits (the threshold past which our load balancer won’t forward traffic to a loaded instance) effectively disabled. The combination of the under-scaled service and over-ridden load balancing set up a small request storm that degraded network performance regionally for about 30 minutes.
August 15: Regional Orchestrator Outage In EWR (09:30EST): The first of three related incidents. Internal monitoring alerted us to slow and timing-out Fly Machines API operations in EWR. Investigation turned up a stuck instance of Corrosion, our internal state sharing system. Most services in our orchestration system are customers of Corrosion, and queries to Corrosion were effectively hanging, which caused cascading failures in the region. Corrosion (in EWR) was stuck with a very full queue of writes to apply and a very large WAL. Bouncing Corrosion and arranging the truncation of the WAL mitigated the problem, which was acute in EWR for about an hour.
August 15: Orchestrator Out-Of-Memory Crashes (14:30EST): The second and most central of three related issues. In flyd, the core of our orchestrator, metrics had begun showing that very old Fly Machines, with long histories recorded in flyd‘s BoltDB store, could see slow queries on account of flyd needing to iterate through their entire history to load their current state. A PR was merged to consistently track the current state, in order to speed those queries up. Over the course of several months, the extra per-Machine metadata this change tracked bloated both flyd’s BoltDB database and the in-memory state flyd tracks (specifically: the process of cloning and merging current state over time caused some keys in that metadata to balloon with duplicated data). On a small number of busy worker servers (roughly 10), this problem became acute and resulted in OOM crashes. A fix was rolled out over the course of several hours (this involved both a flyd code change and some brain surgery to manage the embiggened databases).
August 15: Elixir API Server Deployment Failure (16:30EST): The third of three related incidents. A failed deployment and rollback of ui-ex, our Elixir API server, left that service in an inconsistent state with respect to our load balancer, with old, defunct instances remaining installed in our routing table and newly online instances remaining cordoned off. Two things went wrong here; the deployment failure itself was organic (a developer did something wrong), but it was amplified by orchestrator instability. During the acute phase of this incident, lasting around 30 minutes, API clients and UI users may have seen sporadic HTTP 502 errors. A careful manual redeploy resolved this issue in about 45 minutes. (Full disclosure: this incident merges two tracked incidents over the same time period, both of which were directly handling this issue).

August 15th was spicy; with the exception of the Elixir API issue, the problems were contained regionally or to a small minority of specific hosts, but it was an infra-intensive day.

This Week In Infra Engineering

Several infra people were out on vacation this week, and the rest of the team did interesting work that deserves a showcase, but we’re giving this week in infra engineering over to Peter; everybody else will get their due next week.

One of the features we offer to applications hosted on Fly.io are a dense set of Prometheus performance metrics, along with Grafana dashboards to render them. One of the things our users can monitor with these metrics is TLS handshake latency. Why do we expose TLS latency metrics? We don’t know. It was easy to do.

Anyways, a consequence of exposing TLS latency metrics is that customers can, if they’re watchful, notice spikes in TLS latency. And so someone did, and reported to us anomalously slow TLS handshakes originated from AWS, which set off several weeks of engineering effort to isolate the problem.

Two things complicated the analysis. First, while we could identify slower-than-usual latencies, we couldn’t (yet) isolate extreme latency. Second, packet-level analytics didn’t show any weird intra-handshake packet latency originating “from us”; the TCP 3WH completed quickly, our TLS ServerHello messages rapidly followed after ClientHello, etc.

What we did notice were specific clients that appeared to have large penalties on “their side” inside the TLS handshake; a delay of up to half a second between the TCP ACK for our ServerHello and the ChangeCipherSpec message that would move the handshake forward. Reverse DNS for all these events traced back to AWS. This was interesting evidence of something, but doesn’t dispose of the question of whether there are events causing our own Anycast proxies to lag.

To better catch and distinguish these cases, what we needed was a more fine-grained breakdown of the handshake. A global tcpdump would produce a huge pcap file on each of our edges within within seconds, and even if we could do that, it would be painful to filter out connections with slow TLS handshakes from those files.

Instead, Peter ended up (1) making fly-proxy log all slow handshakes; and (2) writing a tool called petertron4000 (real infra-log heads may remember its progenitor, petertron3000) to temporarily save all packets in a fixed-size ring buffer, ingest the slow-handshake log events from systemd-journald, and (3) write durable logs of the flows corresponding to those slow handshakes.

With petertron4000 in hand, he was able to spot check a random selection of edges. The outcome:

Several latency issues we were able to trace to bad routing upstream.
Recurrances of the issue described above, where clients are lagging, on “their side”, in the middle of the handshake, overwhelmingly sourced from AWS.

Peter was able to reproduce the problem directly on EC2: running 8 concurrent curl requests on a t2.medium reliably triggers it. Accompanying the TLS lag: high CPU usage on the instance. Further corroborating it: we can get the same thing to happen curl-ing Google.

This was an investigation set in motion by an ugly looking graph in the metrics we expose, and not from any “organically observed” performance issues; a customer saw a spike and wanted to know what it was. We’re pretty sure we know now: during the times of those spikes, they were getting requests from overloaded EC2 machines, which slow-played a TLS handshake and made our metrics look gross. :shaking-fist-emoji:.

That said: by dragnetting slow TLS handshakes, we still uncovered a bunch of pessimal routing issues, and we’re on a quest this summer to shake all of those out. The big question we’re working on now: what’s the best way to roll something like the petertron4000 (the petertron4500) so that it can run semipermanently and alert for us? Fun engineering problem! We definitely don’t want to be doing high-volume low-level packet capture work indefinitely on all our edges, right? Stay tuned.

Next post ↑: 2024-08-24
Previous post ↓: 2024-08-10