West coast edge proxies overloaded

#May 28: West coast edge proxies overloaded (21:08UTC)

This incident requires some background which will become important later:

fly-proxy: a Rust-based, userspace L4 / L7 load balancer
Corrosion: our distributed service discovery / state propagation system
Airtime: fly-proxy‘s inbuilt dynamic defense against sudden traffic spikes; this was put in late last year / earlier this year, before which we had no way to prevent one app from monopolizing bandwidth on a host. Earlier in the year, we spent some time tuning Airtime’s parameters so that it triggers near our bandwidth saturation point for what each of our edge servers can handle.
Lazy loader: in the long before-times, fly-proxy used to ingest almost all data in Corrosion into its process memory, through Corrosion’s subscription API. That proved to not scale well a long time ago, and we switched to a lazy-loading model where only entries required for active requests are loaded.

The incident started with us noticing flappiness in our US west coast regions, primarily in SJC at the beginning. Our logs and metrics indicated that the lazy loader latency was high, on the order of 500 ms to several seconds. This means that many new requests will need to wait that long or even longer to be served. On the other hand, proxy’s CPU usage was not especially high, and neither was the inbound connection rate. We’ve seen this kind of issue before: it usually is indicative of inefficient sqlite queries, certain apps with excessively large state stored in Corrosion, or general host performance issues. At this point, we happened to have spotted one app with extremely large state in Corrosion, and quickly “concluded” that it must be contributing to the issue, so we put in a temporary mitigation and deployed the proxy in SJC.

It momentarily seemed to improve the situation, but latency quickly shot through the roof again after the new proxy processes warmed up. We began doubting whether it is inefficient sqlite queries, which we ruled out, or whether there was lock contention simply due to our recent growth resulting in increased connection rates. This is also the point where we noticed Airtime reporting increased bandwidth in SJC, but it was below what we have concluded before was the ceiling of what a single edge server could handle. In either case, our edge capacity in SJC was also underprovisioned due to a couple of servers being out of production, so we decided to first shift Anycast traffic to LAX and see if it handles the load better.

Again, initially it seemed to help, but after a while LAX started struggling as well (side note: at certain points we also attempted to shift traffic out of the west coast entirely, which was why edges in other regions may have been momentarily affected). We finally decided to adjust down the bandwidth limit of Airtime, even though we were pretty sure our edges could take the level of traffic seen throughout this incident. It did bring softirq CPU usage and host load average down, but the proxy was still struggling with slow lazy loader queries. We bounced the proxy, which seemed to clear up the lazy loader issues as well. This marks the end of the first acute phase of this incident.

It would have been nice if this was the actual end of the incident. It was not, and it was primarily due to 2 other issues:

Airtime, the system we used to limit impact of traffic spikes, works entirely within one single process and does not propagate its knowledge outside. This would not have been a problem (we initiated a hard-kill of all pending-shutdown proxy processes when we bounced them), if not for:
Due to a bug with how our proxy deployment script interacts with systemd, we have somehow left multiple instances of the proxy running indefinitely on some of the affected nodes (TLDR: systemctl kill does not actually transition a unit to a stopped state; combined with Restart=always it simply causes the process to restart);

The combination of these two means that any limit we set in Airtime could, at any point, become effectively doubled if some heavy connections landed on a different proxy instance, causing the same issue to repeat after the initial phase was resolved. It is also worth noting that the fact that we needed to bounce proxy processes after tuning Airtime is itself contributing to this issue: that revealed that there are issues with queuing behavior around the lazy loader. Specifically, it seems that it is possible to end up with effectively infinite queues waiting on the sqlite connections when lazy loader itself is slow (due to softirq contending with userspace for CPU under high load, for example), which will not resolve unless the process itself is bounced (and in turn, that revealed the other issues causing recurrence of the incident).

In summary, this incident was caused by a combination of factors:

Our edge capacity is underprovisioned in some regions; they have not caught up with our recent growth in user base.
Airtime’s tuning no longer matches reality, either due to a shift in traffic patterns or other non-bandwidth scaling issues in the proxy.
A bug caused multiple active proxy instances to coexist without code to handle shared state.
The lazy loader exhibits runaway queuing behavior at high load.

We’re working hard to address each and every one of these issues. As a starter, we are going to provision significantly more edge capacity in the coming weeks/months. We have addressed the bug that caused multiple proxy instances to coexist, and changed Airtime so that, for now, it applies a much stricter limit when it is not the expected active proxy instance. We have fixed load-shedding behavior in the lazy loader so that there is a more reasonable upper bound on the maximum latency serving requests. Other work is currently under way:

We believe that the reason why proxy seems to run into lazy loader-related performance issues much earlier now, compared to before, is due to our single coarse-grained lock on the proxy’s in-memory state is no longer scaling well as we grow. We have observed high queuing delays not in sqlite queries, but simply in trying to insert data into the in-memory service catalog. We’re planning to shard the catalog and move to finer-grained locking, assisted with testing such as Antithesis to ensure migration to this does not cause more outages.
We are going to rework Airtime so that it reacts better to overall system load instead of just the proxy. This will hopefully serve as a backstop when we somehow end up with multiple proxy processes running, or when any non-proxy processes on the same host consume any of the bandwidth headroom.
We’re looking into better monitoring for when the proxy is not under its expected configuration.