2024-08-03

Trailing 2 Weeks Incidents

A diagram of two weeks of incidents

(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).

July 29: GraphQL API Unavailable (04:00EST): This is a pretty big deal, because deploys depend on the GQL API. The GQL API is a (legacy) Rails app. It depends on a Golang service to talk to our Macaroon validation server; that service is called tksidecar. tksidecar is built from a seperate repo, and our CI system pulls it into the container build for the API. Somehow, we managed to build a tksidecar that was truncated; somehow, the SHA256 checksum for the build didn’t prevent this from getting deployed. The resulting build brought up a GQL API server that HTTP 503’d any useful request. Our API was unavailable for about 90 minutes while we performed build and CI surgery to roll back the change. Numerous process improvements ensued.
July 31: Elevated GraphQL API Errors (19:30EST): Metrics indicated elevated errors from our Machines API, which we traced to a callback to our Rails GraphQL server. Like other Rails apps of the same vintage, our GQL API relies on Sidekiq to process background jobs; those jobs include code that records new deploys in Corrosion, our fleetwide state sharing system. We stopped seeing reliable Corrosion updates (thus causing the Machines API errors). A restart and rescale patched up the problem 20 minutes later; several hours of investigation uncovered that a new set of billing jobs were driving the Fly Machines powering Sidekiq jobs to out-of-memory death.
August 2: GraphQL API Unavailable (13:30EST): Similar outcome as July 29, but shorter duration. This time, we deployed an update with a “benign Postgres migration” (there is no such thing as a benign migration) — all it did was add a single column. Unbeknownst to the deployer, we run recurring business analytics queries against that Postgres database that take upwards of 30 minutes to complete. This is ordinarily not that big of a deal; the Postgres server is beefy and the analytics queries don’t block writes. Unfortunately, the “benign Postgres migration”, like any DDL change, takes an exclusive lock on the database that does manage to conflict with the analytics query. The result: a hanging GraphQL API server. We reverted this change and restored the API server to good function within 15 minutes, and moved these analytics queries to an OLAP database.

This Week In Infra Engineering

Akshit rolled out opt-in granular bandwidth billing. The new bandwidth billing scheme saves most of our customers money (especially if you make good use of our private networks, for instance by running highly utilized Postgres clusters), but, because it can end up costing a bit extra for users that don’t use private networks are are deploying in expensive regions (most notably India), it’s opt-in for existing customers. This work involved working through bugs with our upstream billing partner; Akshit has our sympathies. Akshit was also part of the response to the July 29 GQL outage, which meant they spent a chunk of this week reworking parts of our GQL server CI/CD system.

Steve got cross-connects deployed between Fly.io and Oracle Cloud, in order to accelerate object storage for our partners; objects stored off-network should no longer traverse the public Internet.

Andres and Matt improved synthetic monitoring (we built a new synthetic monitoring system a few weeks back), notably by creating and deploying new reference apps for us to measure. We have improved visibility into behavior we weren’t directly alerting on, like obviously-broken routing (think Asia->Europe->Asia). Synthetics surfaces some fly-proxy bugs, which got fixed. We flirted with making synthetics a customer-visible feature and decided we hadn’t worked out the privacy issues yet.

Ben, Dusty, Steve and John continued migrating workloads from old servers to newly provisioned ones; this involved building out more migration tooling, fixing bugs in migration tooling, and wrestling with particularly persnickity physical servers in Asia. We are asked to relay the following: “Servers go out. Servers come in. We are thus ever trapped in samsara”. Ok then.

Peter rolled out fallback routing fleetwide. He writes it up better than I do, as usual. In addition to metrics-based fallback routing, we now have rule-based routing that takes known backbone topology issues into account. Peter also resolved the LVM2 metadata issues we talked about a few weeks ago, and is deep into debugging (very) sporadic TLS handshake time delays.

Kaz updated and simplified the public status page, which now does a better job of answering the most important question (is the problem my app, or something going wrong at Fly.io?).

Next post ↑: 2024-08-10
Previous post ↓: 2024-07-27