April 12: High edge CPU usage resulting in high latency in ORD
#April 12: High edge CPU usage resulting in high latency in ORD (18:43UTC)
ORD edge nodes became CPU-saturated, which made traffic entering through the ORD region intermittently slow (and in some cases time out). Profiling on the affected edges showed fly-proxy spending an unexpectedly large amount of time in pthread_mutex_{lock,unlock} calls. This is weird, because fly-proxy itself does not, in fact, use pthread mutexes – it uses locks from the parking_lot crate, which is based on futex system calls directly. Eventually, we traced the likely cause of this lock contention to SQLite, which is used to directly access Corrosion’s local database to load app metadata used for routing. We reduced the number of SQLite connections fly-proxy opens on the ORD edges, which immediately dropped CPU usage and brought lookup latencies back to normal.
We are, however, still unsure exactly which lock in SQLite caused the contention: initially, we suspected the per-connection lock SQLite uses to prevent concurrent access, but our Rust side code (based on rusqlite) has explicitly marked connections as !Sync and therefore they are never shared between threads in the first place. Our current hypothesis is that this is due to rusqlite’s use of the flag SQLITE_ENABLE_MEMORY_MANAGEMENT, which puts a mutex on SQLite’s per-process page cache. However, we are still unable to definitely confirm that this is the case due to the lack of stack traces through SQLite during the incident (and that we have not managed to reproduce the issue at all). We have enabled more instrumentation in our code, which will hopefully give us more complete stack trace profiles should this happen again.