April 6: Web Sidekiq backlog from stuck usage jobs

April 6: Web Sidekiq backlog from stuck usage jobs (19:32UTC)

This was a weird one – our Sidekiq instance, the one that powers background jobs of our GraphQL API and dashboard, repeatedly got stuck processing long-running billing and usage sync jobs. That led to large job backlogs and delayed processing. We first attempted to mitigate by scaling up and restarting stuck workers, which worked for a while, but eventually everything ground to a halt again. After chasing down a few false leads, we eventually tracked this down to two major causes:

(1) when Sidekiq workers use too much memory, Sidekiq sends SIGUSR2 to kill the process, which stops the process from accepting new work and wait for any existing work to complete before exitting; (2) however, our database connections for those billing / usage jobs sometimes got stuck without any proper timeout. When their worker processes are killed with SIGUSR2, they get into a state where they neither exit nor make any further progress; eventually, we are left with no workers that can process jobs.

We hardened our setup by adding proper timeouts to both Sidekiq worker shutdown and the database connection pool. We also separated billing / usage jobs into their own queue to avoid blocking other tasks that need to be processed.