2024-08-31

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

  • August 26: Fly Machines Lease Acquisition Failures (05:30EST): We began observing elevated HTTP 500 errors for the lease endpoint in flyd, our orchestrator kernel, on certain hosts. Leases are used to take exclusive control of a particular Fly Machine in order to update it; leasing is a basic component of doing a deployment. Our initial investigation turned up a particular Fly Application that appeared to be absolutely hammering the flyd lease endpoint. That turned out to be a customer CI job that was (reasonably!) rapidly responding to a lease-timeout deployment failure by re-queueing jobs. After an exhaustive investigation, we determined that the flyd BoltDB on three impacted hosts had found their way to a pathological state similar to the August 15 issue previously in the log. Rebuilding the database on the impacted hosts resolved the problem. Customers on the small number of impacted hosts would have seen sporadic deployment timeouts during the 3-4 hours this investigation took. That’s longer than we’re comfortable with; we’ve added substantial telemetry for this particular problem.

  • August 27: Background Job Starvation In API Server (10:30EST): In a previous episode of this log we discussed the July 12 incident in which a Redis/Sidekiq interaction locked up all our background job processing, which caused a 5-minute incident in which deploys failed. In response to that issue, we ported our API server to managed Redis. We ran into problems (all us, not them) that delayed background jobs and caused our billing pages not to update for approximately an hour; we rolled back the change. Minimal customer impact.

This Week In Infra Engineering

This was a holiday week during which we experienced a significant reliability incident that has the infra-log busily copyediting an in-progress postmortem, so the infra-log is giving itself (and the team) a break. More of the continuing adventures of Peter and Somtochi in the weeks to come. Thanks!