Trailing 2 Weeks Incidents
(Larger boxes are longer, darker boxes are higher-impact; see last week’s bulletin for details about the top row of incidents).
May 31: Billing Issue With Upstash Redis (8:00EST): We’re in the middle of a transition from our old billing system to a new one based on Metronome. Billing is a gnarly problem. On Friday morning, someone called an incident after noticing a bug wherein a small number of Upstash Redis customers might have gotten double billed for something. We refunded them. This was an issue we detected internally, with no customer impact, but technically we called an incident for it and by the rules of this page we have to log it.
May 31: Network Filtering Breaks Flycast (13:00EST): As part of a project we’re running to do automatic authenticated connections between Fly Machines, our ProdSec team rolled out an
nftables
change. It was tested in our dev region, but had an unexpected interaction with our deployment tooling (something about the order in which tables are dropped and rebuilt). The net effect was that the fleetwide deployment broke FlyCast. Diagnosis and remediation took about 30 minutes.
This Week In Infra Engineering
Short week. Couple people out sick.
Kaz worked on getting Fly Machine creation success rates onto our status page, which you should see soon. The two most important things you can know about the Fly Machines API: “create” and “start” are two different operations (“start” is the fast one; you can pre-“create” a bunch of stopped machines and start them whenever you need them), and “create” can fail; for instance, you can ask for more resources than are available in the region you target. Read more about that here. We (well, Kaz, but we agree with him) want the success rates for this operation to be visible to customers.
Dusty and Simon spent the week heads-down on Postgres cluster migration. Read last week’s bulletin for more on that. We’re getting somewhere, but we’re not done until we can push a button and safely clear all the Machines of a physical server without having to worry too much about it.
Will won his next boss battle with NATS. We’ve successfully upgraded the whole fleet to current NATS (recall: the last attempt drove a terabit-scale message storm), on a custom branch with some of his fixes from last week. Metrics are down up to 90% across the board (a good thing) and problems we’ve been having with connection stability after network outages (inevitable at our scale!) seem to have resolved. Will’s writing a Fresh Produce release about this and we won’t steal any more of his thunder here.
Matt spent the week making log monitoring more resilient. “Logs” here mean “the platform feature we offer that ships logs off physical servers and to customers using NATS”. What Matt’s doing is, we run a Machine on every physical server in our fleet, the “debug app”, and it checks various things and freaks out and generates alerts when things go wrong. One more thing “debug” does now is track our server inventory, and make sure we’re getting NATS logs from all of them. In other words, another constantly-running, all-points end-to-end test of log shipping, from the vantage point of our customers.
Tom is doing topological work on Corrosion. As we keep saying, we have “edge” servers and “worker” servers; the “edges” are much, much smaller than the “workers”, and we don’t want to tax them too much, so they can just do their thing terminating TLS and routing traffic. But that routing function depends on Corrosion, our gossip-based state tracking system, and Corrosion is expensive. One answer, which Tom is pursuing, is for (most) edges not to run it at all, but instead to be remote clients of it on other machines.
Dave (and Matt and Will and Simon) did a bunch of hiring work, including revamping our challenges and updating our internal processes for reviewing them. We should be much more responsive to infra candidates (we already were within tolerances, but we’re raising the bar for ourselves).