<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
  <title>Infra Log</title>
  <subtitle>The home for Infra Engineering content on Fly.io.</subtitle>
  <id>https://fly.io/infra-log/</id>
  <link href="https://fly.io/infra-log/"/>
  <link href="https://fly.io/infra-log/" rel="self"/>
  <updated>2026-06-11T00:00:00+00:00</updated>
  <author>
    <name>Fly</name>
  </author>
  <entry>
    <title>Stale 6PN mappings wreaking havoc</title>
    <link rel="alternate" href="https://fly.io/infra-log/stale-6pn-mappings/"/>
    <id>https://fly.io/infra-log/stale-6pn-mappings/</id>
    <published>2026-06-11T00:00:00+00:00</published>
    <updated>2026-06-11T22:32:21+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;This is another case where, as we were working towards improving the platform, we ended up with multiple ways of doing one thing, some of which are considered legacy and should eventually be removed, but the removal was never completed. An unexpected interaction between the old and new systems then wreaked havoc.&lt;/p&gt;

&lt;p&gt;In this case, the system in question is &lt;a href='https://fly.io/docs/networking/private-networking/' title=''&gt;6PN&lt;/a&gt;, our private network powered by Wireguard that connects all of your Machines. When this system was designed, each Machine&amp;rsquo;s private 6PN address was bound to the host where it was created. This made routing simple to implement, but also started to cause issues when we migrated Machines between different hosts. The reason is that some apps depended on a static 6PN address per Machine: even our own legacy unmanaged Postgres offering depended on it, despite the fact that these addresses were never meant to be stable.&lt;/p&gt;

&lt;p&gt;At some point, we finally decided that this is not sustainable and we should, instead, meet the expectation of a majority of apps: that is, to keep 6PN addresses stable. The first iteration of this work was a simple DNAT, where machines still get new 6PN addresses, but an eBPF program rewrites packets targeting a machine&amp;rsquo;s old 6PN address(es) to the new one. The price we pay is that, because technically the 6PN address still changes, we need to keep track of &lt;em&gt;every single 6PN address&lt;/em&gt; a Machine has ever had. This is all stored in Corrosion, which bloated its storage, not to mention the map we needed to synchronize into the eBPF program.&lt;/p&gt;

&lt;p&gt;This was changed roughly a year ago. Instead of keeping track of all old 6PN addresses, we simply made it so that Machines do not get reassigned a new 6PN on a new host if it is migrated. All Machines created after this change retain their initial 6PN after migration. Of course, this alone would break routing, because that depended on a per-host fixed 6PN prefix. Some sort of NAT is still needed, but now we only need to keep track of a Machine&amp;rsquo;s current host and its initial 6PN.&lt;/p&gt;

&lt;p&gt;&amp;hellip;which brings us to today. We have two types of &amp;ldquo;stable 6PN&amp;rdquo; Machines: some before the change above, and some after. The intention was that when an old-style 6PN Machine gets migrated, it will become a new-style stable 6PN Machine with all the new-style plumbing. We&amp;rsquo;d delete unneeded Corrosion entries in this case and slowly drain them away as they&amp;rsquo;re moved around. At some point this year, we realized that the Corrosion subscriptions used for old-style 6PN DNAT were creating a lot of load on Corrosion. As a result, we shipped a change to only apply 6PN DNAT rules once when the service responsible for this is started, since we did not except any new Machines to be created with old-style 6PN anymore. However, there was an oversight: in some cases, the existence of old-style DNAT rules actually &lt;em&gt;overrides&lt;/em&gt; new-style stable 6PN&amp;rsquo;s rewriting logic. So, when a Machine gets migrated to use new-style stable 6PN, it is possible that some peers might still be rewriting its address to a host-specific address that no longer exists.&lt;/p&gt;

&lt;p&gt;This exact scenario started happening first for our multi-tenant Consul clusters (used for unmanaged Postgres and LiteFS), and then for some customer Machines as a spike of rebalancing migrations happened for various reasons. A considerable amount of time was spent on triaging the issue because it was an unexpected failure mode. We did not expect that old-style stable 6PN would interact with new stable 6PN in this way, especially not several weeks after the last round of changes were deployed.&lt;/p&gt;

&lt;p&gt;We mitigated this problem by adding code to delete old-style 6PN DNAT entries when new-style stable 6PN rules are set up. This, unfortunately, briefly caused another bug where the daemon responsible for this became too slow to catch up with Corrosion (we need Corrosion to decide whether a Machine has been migrated and thus needs rules for stable 6PN), which caused issues with Managed Postgres in LAX for a little while. This was then fixed up by making the cleanup code opportunistic and non-blocking for the main processing path.&lt;/p&gt;

&lt;p&gt;We see a few directions as the next steps to preventing this from happening again:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Old-style 6PN DNAT mappings should really not exist anymore. We need to migrate all the remaining machines that still use it to new-style stable 6PN addresses.
&lt;/li&gt;&lt;li&gt;The reason why the Corrosion subscription and its processing code became slow was partially due to the query&amp;rsquo;s inefficiency; we&amp;rsquo;re working on addressing that too.
&lt;/li&gt;&lt;li&gt;We need a way to gracefully recover from such an event; the daemon should not just miss updates.
&lt;/li&gt;&lt;li&gt;Finally, we should be able to &amp;ldquo;fill in&amp;rdquo; missing stable 6PN NAT rules &lt;em&gt;even if&lt;/em&gt; the Corrosion subscription happened to miss some updates. The subscription can still be used for updates, but not as the single point of failure.
&lt;/li&gt;&lt;/ol&gt;
</content>
  </entry>
  <entry>
    <title>Deploys blocked by billing error</title>
    <link rel="alternate" href="https://fly.io/infra-log/deploys-billing-error/"/>
    <id>https://fly.io/infra-log/deploys-billing-error/</id>
    <published>2026-06-06T00:00:00+00:00</published>
    <updated>2026-06-11T22:29:26+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;For a few hours, deploys for some organizations were failing with a &amp;ldquo;We require your billing information&amp;rdquo; error, despite having just added payment methods or credits to their organizations.
This was due to a mis-ordered deployment of a new Corrosion schema.&lt;/p&gt;

&lt;p&gt;For some context: organization information is managed by our central GraphQL API backed by a local database in &lt;code&gt;iad&lt;/code&gt;; when an organization is updated, for instance when the billing information is updated, the GraphQL API pushes the changes to the global Corrosion cluster so it can be read by the Machines API. When new information needs to be stored in Corrosion, we need to deploy two changes: a global change to the Corrosion (sqlite) schema, and a change to the GraphQL API to push the new data to the global cluster.&lt;/p&gt;

&lt;p&gt;Earlier in the day, we had prepared a change to push some new organization data to Corrosion. This is usually a safe change, however this time the GraphQL API was deployed prior to the global schema being updated. This caused all organization updates to fail to be propagated to Corrosion, thus causing the Machines API to not know about the updated billing status of organizations.
To resolve this incident, we quickly reverted the change to the GraphQL API and backfilled the missing data in Corrosion.&lt;/p&gt;

&lt;p&gt;We are looking into ways to alert on repeated sync failures, as well as failing GraphQL API deployments if the Corrosion schema is out of date.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>West coast edge proxies overloaded</title>
    <link rel="alternate" href="https://fly.io/infra-log/edge-proxies-overload/"/>
    <id>https://fly.io/infra-log/edge-proxies-overload/</id>
    <published>2026-06-04T00:00:00+00:00</published>
    <updated>2026-06-11T22:29:26+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;This incident requires some background which will become important later:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;fly-proxy&lt;/code&gt;: a Rust-based, userspace L4 / L7 load balancer
&lt;/li&gt;&lt;li&gt;Corrosion: our distributed service discovery / state propagation system
&lt;/li&gt;&lt;li&gt;Airtime: &lt;code&gt;fly-proxy&lt;/code&gt;&amp;lsquo;s inbuilt dynamic defense against sudden traffic spikes; this was put in late last year / earlier this year, before which we had no way to prevent one app from monopolizing bandwidth on a host. Earlier in the year, we spent some time tuning Airtime&amp;rsquo;s parameters so that it triggers near our bandwidth saturation point for what each of our edge servers can handle.
&lt;/li&gt;&lt;li&gt;Lazy loader: in the long before-times, &lt;code&gt;fly-proxy&lt;/code&gt; used to ingest almost all data in Corrosion into its process memory, through Corrosion&amp;rsquo;s subscription API. That proved to not scale well a long time ago, and we &lt;a href='https://fly.io/blog/corrosion/' title=''&gt;switched to&lt;/a&gt; a lazy-loading model where only entries required for active requests are loaded.
&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;The incident started with us noticing flappiness in our US west coast regions, primarily in SJC at the beginning. Our logs and metrics indicated that the lazy loader latency was high, on the order of 500 ms to several seconds. This means that many new requests will need to wait that long or even longer to be served. On the other hand, proxy&amp;rsquo;s CPU usage was not especially high, and neither was the inbound connection rate. We&amp;rsquo;ve seen this kind of issue before: it usually is indicative of inefficient sqlite queries, certain apps with excessively large state stored in Corrosion, or general host performance issues. At this point, we happened to have spotted one app with extremely large state in Corrosion, and quickly &amp;ldquo;concluded&amp;rdquo; that it must be contributing to the issue, so we put in a temporary mitigation and deployed the proxy in SJC.&lt;/p&gt;

&lt;p&gt;It momentarily seemed to improve the situation, but latency quickly shot through the roof again after the new proxy processes warmed up. We began doubting whether it is inefficient sqlite queries, which we ruled out, or whether there was lock contention simply due to our recent growth resulting in increased connection rates. This is also the point where we noticed Airtime reporting increased bandwidth in SJC, but it was below what we have concluded before was the ceiling of what a single edge server could handle. In either case, our edge capacity in SJC was also underprovisioned due to a couple of servers being out of production, so we decided to first shift Anycast traffic to LAX and see if it handles the load better.&lt;/p&gt;

&lt;p&gt;Again, initially it seemed to help, but after a while LAX started struggling as well (side note: at certain points we also attempted to shift traffic out of the west coast entirely, which was why edges in other regions may have been momentarily affected). We finally decided to adjust down the bandwidth limit of Airtime, even though we were pretty sure our edges could take the level of traffic seen throughout this incident. It did bring &lt;code&gt;softirq&lt;/code&gt; CPU usage and host load average down, but the proxy was still struggling with slow lazy loader queries. We bounced the proxy, which seemed to clear up the lazy loader issues as well. This marks the end of the first acute phase of this incident.&lt;/p&gt;

&lt;p&gt;It would have been nice if this was the actual end of the incident. It was not, and it was primarily due to 2 other issues:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Airtime, the system we used to limit impact of traffic spikes, works entirely within one single process and does not propagate its knowledge outside. This would not have been a problem (we initiated a hard-kill of all pending-shutdown proxy processes when we bounced them), if not for:
&lt;/li&gt;&lt;li&gt;Due to a bug with how our proxy deployment script interacts with &lt;code&gt;systemd&lt;/code&gt;, we have somehow left multiple instances of the proxy running indefinitely on some of the affected nodes (TLDR: &lt;code&gt;systemctl kill&lt;/code&gt; does not actually transition a unit to a stopped state; combined with &lt;code&gt;Restart=always&lt;/code&gt; it simply causes the process to restart);
&lt;/li&gt;&lt;/ol&gt;

&lt;p&gt;The combination of these two means that any limit we set in Airtime could, at any point, become effectively doubled if some heavy connections landed on a different proxy instance, causing the same issue to repeat after the initial phase was resolved. It is also worth noting that the fact that we needed to bounce proxy processes after tuning Airtime is itself contributing to this issue: that revealed that there are issues with queuing behavior around the lazy loader. Specifically, it seems that it is possible to end up with effectively infinite queues waiting on the sqlite connections when lazy loader itself is slow (due to &lt;code&gt;softirq&lt;/code&gt; contending with userspace for CPU under high load, for example), which will not resolve unless the process itself is bounced (and in turn, that revealed the other issues causing recurrence of the incident).&lt;/p&gt;

&lt;p&gt;In summary, this incident was caused by a combination of factors:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Our edge capacity is underprovisioned in some regions; they have not caught up with our recent growth in user base.
&lt;/li&gt;&lt;li&gt;Airtime&amp;rsquo;s tuning no longer matches reality, either due to a shift in traffic patterns or other non-bandwidth scaling issues in the proxy.
&lt;/li&gt;&lt;li&gt;A bug caused multiple &lt;em&gt;active&lt;/em&gt; proxy instances to coexist without code to handle shared state.
&lt;/li&gt;&lt;li&gt;The lazy loader exhibits runaway queuing behavior at high load.
&lt;/li&gt;&lt;/ol&gt;

&lt;p&gt;We&amp;rsquo;re working hard to address each and every one of these issues. As a starter, we are going to provision significantly more edge capacity in the coming weeks/months. We have addressed the bug that caused multiple proxy instances to coexist, and changed Airtime so that, for now, it applies a much stricter limit when it is not the expected &lt;em&gt;active&lt;/em&gt; proxy instance. We have fixed load-shedding behavior in the lazy loader so that there is a more reasonable upper bound on the maximum latency serving requests. Other work is currently under way:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;We believe that the reason why proxy seems to run into lazy loader-related performance issues much earlier now, compared to before, is due to our single coarse-grained lock on the proxy&amp;rsquo;s in-memory state is no longer scaling well as we grow. We have observed high queuing delays not in sqlite queries, but simply in trying to insert data into the in-memory service catalog. We&amp;rsquo;re planning to shard the catalog and move to finer-grained locking, assisted with testing such as Antithesis to ensure migration to this does not cause more outages.
&lt;/li&gt;&lt;li&gt;We are going to rework Airtime so that it reacts better to overall system load instead of just the proxy. This will hopefully serve as a backstop when we somehow end up with multiple proxy processes running, or when any non-proxy processes on the same host consume any of the bandwidth headroom.
&lt;/li&gt;&lt;li&gt;We&amp;rsquo;re looking into better monitoring for when the proxy is not under its expected configuration.
&lt;/li&gt;&lt;/ol&gt;
</content>
  </entry>
  <entry>
    <title>DNS cache was broken for CNAME'd domains</title>
    <link rel="alternate" href="https://fly.io/infra-log/dns-cache-cname/"/>
    <id>https://fly.io/infra-log/dns-cache-cname/</id>
    <published>2026-06-04T00:00:00+00:00</published>
    <updated>2026-06-04T10:07:10+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;Some customers saw persistent DNS resolution failures for certain external hostnames that only cleared when we restarted &lt;code&gt;corro-dns&lt;/code&gt;, our recursive DNS resolver. It turns out that the domains they were trying to resolve had intermittent failures upstream. The weird thing is that by itself should &lt;em&gt;not&lt;/em&gt; cause &lt;em&gt;persistent&lt;/em&gt; problems: even though &lt;code&gt;corro-dns&lt;/code&gt; does cache DNS responses, it only caches failures for a very brief moment and will retry pretty quickly if one resolution failed. The cache should eventually be populated with a valid response, and if more upstream errors happen, &lt;code&gt;corro-dns&lt;/code&gt; is allowed to serve an expired cache in that case.&lt;/p&gt;

&lt;p&gt;It turns out that this cache logic failed to take into account cases where a domain A is &lt;code&gt;CNAME&lt;/code&gt;&amp;lsquo;d onto domain B, and &lt;em&gt;only&lt;/em&gt; domain B failed to resolve. In that case, &lt;code&gt;corro-dns&lt;/code&gt; ended up with a cached &lt;code&gt;CNAME&lt;/code&gt; entry for &lt;code&gt;A -&amp;gt; B&lt;/code&gt;, but without any corresponding entry for B. A subsequent request for domain A will hit the cache for the &lt;code&gt;CNAME&lt;/code&gt;, but &lt;code&gt;corro-dns&lt;/code&gt; will not spawn a new query for domain &lt;code&gt;B&lt;/code&gt; since it thinks we&amp;rsquo;ve already hit the cache. It then returns only the &lt;code&gt;CNAME&lt;/code&gt; record to the client, and most clients will not spawn another query either and will just report to the user that no &lt;code&gt;A&lt;/code&gt; or &lt;code&gt;AAAA&lt;/code&gt; records are returned. This situation will not clear itself until the TTL of the &lt;code&gt;CNAME&lt;/code&gt; record expires, which in this case was very long.&lt;/p&gt;

&lt;p&gt;We mitigated this issue for now by skipping cache when &lt;em&gt;any&lt;/em&gt; unexpected failure happens while resolving a domain. The root cause, however, is that &lt;code&gt;corro-dns&lt;/code&gt; caches full DNS responses and not individual DNS records, and does not &amp;ldquo;fill in&amp;rdquo; additional records when only a &lt;code&gt;CNAME&lt;/code&gt; can be cached. Our plan is to refactor this layer of caching to prevent similar bugs in the future.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>App creation timeouts from petsem-certs disk full</title>
    <link rel="alternate" href="https://fly.io/infra-log/petsem-certs-diskfull/"/>
    <id>https://fly.io/infra-log/petsem-certs-diskfull/</id>
    <published>2026-06-03T00:00:00+00:00</published>
    <updated>2026-06-04T02:50:31+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;New app creation requests (including &lt;code&gt;flyctl apps create&lt;/code&gt;) were failing with 504 timeouts because they weren&amp;rsquo;t able to create certificates in &lt;code&gt;petsem-certs&lt;/code&gt;, our new certificate store (which we&amp;rsquo;re in the process of provisioning and migrating to from Vault). While &lt;code&gt;petsem-certs&lt;/code&gt; is not yet operational, we &lt;em&gt;are&lt;/em&gt; writing certificates to the store, which (disappointingly) caused it to run out of disk space. Existing apps and Machines were unaffected - only app creation timed out. We restored app creation by expanding the storage.&lt;/p&gt;

&lt;p&gt;Since we&amp;rsquo;re dual-writing to both &lt;code&gt;petsem-certs&lt;/code&gt; and Vault, and the Fly Proxy is reading from Vault, we didn&amp;rsquo;t expect the loss of &lt;code&gt;petsem-certs&lt;/code&gt; to have any impact and it hadn&amp;rsquo;t yet been hooked up to our monitoring. We also had excessive retries on requests to the store, which caused the issue to present as a timeout rather than a failure, so we&amp;rsquo;ve tuned that as well.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>SYD egress IP networking broken on new workers</title>
    <link rel="alternate" href="https://fly.io/infra-log/egress-ip-syd/"/>
    <id>https://fly.io/infra-log/egress-ip-syd/</id>
    <published>2026-05-28T00:00:00+00:00</published>
    <updated>2026-06-04T02:50:31+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;Some newly provisioned hosts in our Sydney (SYD) region failed to be configured properly for &lt;a href='https://fly.io/docs/networking/egress-ips/' title=''&gt;egress IP&lt;/a&gt; connectivity. As a result, a number of Machines using egress IPs in the region were unable to access the network.
During the incident, we immediately migrated the affected Machines to known-good hosts.&lt;/p&gt;

&lt;p&gt;Recently, we moved configuration for some infra components (including the VXLAN interface backing egress IPs) to a new, more scalable system. The rollout appeared to be successful, but an interaction with a legacy deployment method caused the configuration service to not be restarted correctly - so VXLAN worked on existing hosts, but would not be provisioned on new hosts.&lt;/p&gt;

&lt;p&gt;Our egress IP monitoring was set up in a world where egress IPs were &lt;em&gt;machine-scoped&lt;/em&gt; rather than &lt;em&gt;app-scoped&lt;/em&gt; (see &lt;a href='https://community.fly.io/t/migrating-from-machine-scoped-to-app-scoped-egress-ips/276770' title=''&gt;this forum post&lt;/a&gt; for more context). As such, a couple monitoring Machines were set up in each region, and not every host was being monitored - as that would require one IP address for every host. After this incident, we ported egress IP monitoring to app-scoped IPs with a Machine running on every host.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Fly dashboard outage from broken GraphQL API deploy</title>
    <link rel="alternate" href="https://fly.io/infra-log/graphql-deploy-outage/"/>
    <id>https://fly.io/infra-log/graphql-deploy-outage/</id>
    <published>2026-05-28T00:00:00+00:00</published>
    <updated>2026-06-04T02:50:31+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;A deploy of Fly’s GraphQL API service (which backs much of &lt;code&gt;flyctl&lt;/code&gt; and parts of the dashboard) reduced the number of healthy instances enough that the HAProxy layer in front of it began timing out its &lt;code&gt;/status&lt;/code&gt; health checks and returning fast 503s, which showed up as intermittent dashboard/API failures and elevated deploy failure rates.
We recovered by removing the broken instances and cloning a known-good Machine to restore capacity until HAProxy backends were stable and green again.&lt;/p&gt;

&lt;p&gt;Afterward we found that a local &lt;code&gt;fly deploy&lt;/code&gt; of the GraphQL API service could produce a broken image because it skipped a CI-only step that fetches a supporting binary (resulting in an empty placeholder being copied into the image), and we merged a change to prevent that failure mode.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>IAD observability disrupted by NATS misconfiguration</title>
    <link rel="alternate" href="https://fly.io/infra-log/nats-misconfig-iad/"/>
    <id>https://fly.io/infra-log/nats-misconfig-iad/</id>
    <published>2026-05-27T00:00:00+00:00</published>
    <updated>2026-06-04T02:50:31+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;A server being decommissioned began advertising a bad NATS config (specifically, an empty connection URL), which caused the logs/metrics exporters of various hosts in the IAD region to crash.
During the incident, many Machines in IAD had missing metrics, and some customers may have also seen gaps or delays in log delivery.
We mitigated by removing the decommissioned server from the NATS cluster and restarting the affected metrics exporters across IAD to restore normal telemetry flow, and are discussing options to remove the reliance on NATS from the logs/metrics pipelines.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Proxy and Corrosion in SIN weren’t on the same page</title>
    <link rel="alternate" href="https://fly.io/infra-log/proxy-corrosion-sin/"/>
    <id>https://fly.io/infra-log/proxy-corrosion-sin/</id>
    <published>2026-05-27T00:00:00+00:00</published>
    <updated>2026-06-04T02:50:31+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;During a rollout of the Fly Proxy, a new Corrosion query started throwing errors on a subset of hosts in Singapore. This query relied on a new column in our Corrosion schema, which had been rolled out globally the day prior. It turns out these hosts had received the new schema but hadn’t successfully reloaded it.&lt;/p&gt;

&lt;p&gt;Once the new proxy came up, it failed to load apps from Corrosion and couldn’t serve any traffic. This made machines on these hosts unavailable, and caused a wave of Managed Postgres (MPG) healthcheck failures in the region.&lt;/p&gt;

&lt;p&gt;During the incident this was fixed by forcing a reload of the Corrosion schema on these hosts, after which traffic returned to normal and all MPG cluster alerts resolved.&lt;/p&gt;

&lt;p&gt;We made two changes to prevent this happening in the future. First, we didn’t notice this during the schema rollout as Corrosion didn’t return an error for a failed reload. Corrosion now &lt;a href='https://github.com/superfly/corrosion/pull/481' title=''&gt;returns an error code&lt;/a&gt; when this happens, so we can revisit those hosts after a rollout. Second, this is the sort of thing we &lt;em&gt;should&lt;/em&gt; catch in the proxy’s bluegreen deployment. This error wasn’t hit until after the proxy marked itself healthy, though, so it had already taken over as the primary. Now the proxy prepares all SQL queries against Corrosion during its startup sequence, so the new proxy won’t successfully come up if any of these fail.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>FRA managed Postgres control-plane outage</title>
    <link rel="alternate" href="https://fly.io/infra-log/postgres-control-plane-outage/"/>
    <id>https://fly.io/infra-log/postgres-control-plane-outage/</id>
    <published>2026-05-25T00:00:00+00:00</published>
    <updated>2026-06-04T02:50:31+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;Managed Postgres clusters in the FRA region became intermittently unreachable after the regional Kubernetes control plane (FKS) got overloaded/stuck and the Kubernetes API began timing out. Because Patroni uses Kubernetes for coordination in this setup, those API failures prevented clusters from reliably determining primary/replica state, causing widespread connection failures and flapping health. We recovered by reducing resource pressure, defragmenting the affected etcd instances, restoring the control plane’s ability to reconcile, and then repairing clusters one-by-one.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Bad TLS cert update broke Consul</title>
    <link rel="alternate" href="https://fly.io/infra-log/tls-cert-consul/"/>
    <id>https://fly.io/infra-log/tls-cert-consul/</id>
    <published>2026-05-22T00:00:00+00:00</published>
    <updated>2026-06-04T02:50:31+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;A configuration automation run accidentally overwrote Consul TLS certificates with invalid ones, which caused Consul lookups to fail across parts of the fleet. We have spent a lot of time in the past couple of years to remove Consul as a key dependency, and as such most aspects of our platform were not directly impacted: all running Machines were unaffected and &lt;code&gt;fly-proxy&lt;/code&gt; routing remained functional. The impact was concentrated on the few control-plane operations that still rely on Consul: mainly &lt;code&gt;fly ssh console&lt;/code&gt; and OIDC tokens. We restored the correct certificates and restarted Consul agents to pick up the fixed TLS configuration, after which errors and alerts subsided.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Usage ingestion blocked by stuck Oban jobs</title>
    <link rel="alternate" href="https://fly.io/infra-log/oban-jobs-stuck/"/>
    <id>https://fly.io/infra-log/oban-jobs-stuck/</id>
    <published>2026-05-19T00:00:00+00:00</published>
    <updated>2026-06-04T02:50:31+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;Usage ingestion fell behind and then stopped making progress when several background jobs hung indefinitely, eventually consuming all available worker concurrency for the ingestion queue. This was triggered by a bug in an Elixir &lt;code&gt;decimal&lt;/code&gt; dependency where converting certain values (like &lt;code&gt;0.0&lt;/code&gt;) to an integer could loop forever, causing specific volume-usage receipt processing jobs to never finish. We fixed this by updating the dependency to a version containing the upstream bugfix, after which ingestion resumed and the backlog drained; once the queue cleared, the delayed ~18 hours of usage data was backfilled and reflected normally.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Kernel upgrade caused machine `stdout` to become wedged by Cloud Hypervisor</title>
    <link rel="alternate" href="https://fly.io/infra-log/cloud-hypervisor-stdout/"/>
    <id>https://fly.io/infra-log/cloud-hypervisor-stdout/</id>
    <published>2026-05-18T00:00:00+00:00</published>
    <updated>2026-06-04T02:50:31+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;This is probably one of the more interesting / confusing / complicated bugs we have had since the revival of Infra Log. It began on this day with us receiving reports about an outage of the Upstash Redis extension, reflected on &lt;a href='https://status.upstash.com/incidents/zjy39rtm1skv' title=''&gt;their status page&lt;/a&gt; as well. Upstash Redis, when used as a Fly extension, runs on Fly Machines, just like other customers. The only difference is that, for various reasons, we run their Machines using Cloud Hypervisor rather than Firecracker. This has never caused problems before, and initially we were pretty certain this is an issue on Upstash side. As we worked with them to investigate, though, we got something confusing: the report that these Machines are stuck writing to &lt;code&gt;stdout&lt;/code&gt;!&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;stdout&lt;/code&gt; of a process running inside the machine is a pipe, one end acting as &lt;code&gt;stdout&lt;/code&gt; and the other end connected to &lt;code&gt;init&lt;/code&gt;, where it &lt;code&gt;splice()&lt;/code&gt;s the pipe into a &lt;code&gt;vsock&lt;/code&gt; connected to the hypervisor. A process on the host, called &lt;code&gt;firefly&lt;/code&gt;, collects these logs from &lt;code&gt;vsock&lt;/code&gt;s. For the &lt;code&gt;stdout&lt;/code&gt; (write) end of the pipe to be stuck, one of these steps must have gone wrong: either &lt;code&gt;firefly&lt;/code&gt; on host is failing to collect logs fast enough, or &lt;code&gt;init&lt;/code&gt; inside the Machine is failing to &lt;code&gt;splice()&lt;/code&gt; them into the &lt;code&gt;vsock&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We were quickly able to rule out an issue with &lt;code&gt;firefly&lt;/code&gt;. This means something went wrong when copying logs into the &lt;code&gt;vsock&lt;/code&gt;, but that is extremely weird: the &lt;code&gt;init&lt;/code&gt; process does use Rust async with Tokio, but at the end of the day, it is just repeatedly calling &lt;code&gt;splice()&lt;/code&gt; to move pages around. We suspected all sorts of wakeup issues on the Rust side, because the way &lt;code&gt;splice()&lt;/code&gt; works does not play nicely with Rust async and we had to use another pair of internal pipes as a workaround (we&amp;rsquo;ll not get into the details here, but the TL;DR is one cannot tell which side of the &lt;code&gt;splice()&lt;/code&gt; caused an &lt;code&gt;EWOULDBLOCK&lt;/code&gt; or &lt;code&gt;EAGAIN&lt;/code&gt;; see something like &lt;a href='https://github.com/hanyu-dev/tokio-splice2/blob/fc47199fffde8946b0acf867d1fa0b2222267a34/src/context.rs#L320' title=''&gt;tokio-splice2&lt;/a&gt; if you are curious).&lt;/p&gt;

&lt;p&gt;That was not what was happening. The Rust side was behaving just fine (* for some definition of &amp;ldquo;fine&amp;rdquo;, but no weird wakeup issues). Instead, the &lt;code&gt;vsock&lt;/code&gt; itself seemed to be getting stuck. The only change(s) recently in this path was a batch of kernel upgrades due to recent vulnerabilities. We found &lt;a href='https://github.com/cloud-hypervisor/cloud-hypervisor/issues/7672' title=''&gt;this issue&lt;/a&gt; on Cloud Hypervisor&amp;rsquo;s GitHub repo, which points to a kernel change incompatible with older Cloud Hypervisor builds. The change was also backported to LTS kernels, which we upgraded recently, but without the corresponding Cloud Hypervisor fixes. Because the majority of Machines do not use Cloud Hypervisor (except GPU machines, and specific organizations), this did not show up in our testing. And because the change only affects &lt;em&gt;large&lt;/em&gt; transfers, it only really triggers when a large write is performed on &lt;code&gt;stdout&lt;/code&gt;, and by extension, the &lt;code&gt;vsock&lt;/code&gt;. That is also a gap in our testing: not many apps emit logs as often as some of our biggest customers.&lt;/p&gt;

&lt;p&gt;We ended up resolving the incident by upgrading Cloud Hypervisor to a version with the fixes. To mitigate similar issues in the future, we also introduced safeguards in &lt;code&gt;init&lt;/code&gt; that will not allow a vsock or host-side issue to block the &lt;code&gt;stdout&lt;/code&gt; pipe indefinitely. Unfortunately, the &lt;code&gt;splice()&lt;/code&gt; + Rust async issue combined with the lack of a way to determine a Linux pipe&amp;rsquo;s remaining capacity reliably means we had to resort to a simple timeout on the &lt;code&gt;vsock&lt;/code&gt; write side for this, which will still introduce &lt;em&gt;some&lt;/em&gt; latency to &lt;code&gt;stdout&lt;/code&gt; when the vsock side is wedged. It will not be an infinite delay, though, and it will also emit a corresponding warning for visibility when a timeout happens.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Machines API hitting failed hosts in SIN</title>
    <link rel="alternate" href="https://fly.io/infra-log/machines-api-sin/"/>
    <id>https://fly.io/infra-log/machines-api-sin/</id>
    <published>2026-05-14T00:00:00+00:00</published>
    <updated>2026-05-15T19:45:00+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;Some ListMachines API calls returned 500s—primarily for apps/orgs that had Machines in the &lt;code&gt;sin&lt;/code&gt; region. This was caused by these API queries hitting failed hosts in SIN. For some context, when an organization has machines located in regions faraway from the one that handled the Machines API request, we replay / forward that request to their respective regions for more up-to-date information. At the time of the incident, one host in SIN appeared up (accepting TCP connections) but responded every connection attempt with a reset. &lt;code&gt;fly-proxy&lt;/code&gt;, our load-balancer component, had an independent bug that prevented it from treating these requests as retryable. Cordoning that host mitigated the incident, and the &lt;code&gt;fly-proxy&lt;/code&gt; bug has also been fixed since.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Petsem primary host lost networking (IAD)</title>
    <link rel="alternate" href="https://fly.io/infra-log/primary-host-networking-iad/"/>
    <id>https://fly.io/infra-log/primary-host-networking-iad/</id>
    <published>2026-05-12T00:00:00+00:00</published>
    <updated>2026-05-12T20:57:49+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;A worker host running the primary instance of Petsem, our secrets storage service, lost network connectivity after a NIC driver stall. We restored service about 30 minutes after the start of impact by reloading the host’s NIC driver.&lt;/p&gt;

&lt;p&gt;During the incident, requests to set/update secrets and create apps failed globally. Some other platform functionality was also affected because an internal Redis used for rate limiting was on the same host. However, existing apps/machines continued running, and existing secrets continued being accessible from our read replicas.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Machines API bug caused `fly deploy` to create duplicate Machines</title>
    <link rel="alternate" href="https://fly.io/infra-log/machines-api-duplicates/"/>
    <id>https://fly.io/infra-log/machines-api-duplicates/</id>
    <published>2026-05-07T00:00:00+00:00</published>
    <updated>2026-05-12T20:57:49+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;On April 28th, some &lt;code&gt;fly deploy&lt;/code&gt; runs incorrectly received an empty Machines list from the Machines API for apps that had already been deployed. When that happened, &lt;code&gt;flyctl&lt;/code&gt; created new Machines instead of updating the existing ones, resulting in duplicates.&lt;/p&gt;

&lt;p&gt;This was tracked down to new affinity behavior in &lt;code&gt;flaps&lt;/code&gt; (our Machines API service). This will likely make its way to a &lt;a href='https://community.fly.io/c/fresh-produce/27' title=''&gt;Fresh Produce&lt;/a&gt; near you sometime soon, but the abstract is that after some operations, API requests are replayed to the same &lt;code&gt;flaps&lt;/code&gt; instance for a brief duration (while state propagates through Corrosion, our distributed database).&lt;/p&gt;

&lt;p&gt;Another place &lt;code&gt;flaps&lt;/code&gt; uses &lt;code&gt;fly-replay&lt;/code&gt; is when fanning out to list Machines from multiple regions, where it used the replay itself as a signal to strictly return Machines local to its region. When this received a new &lt;em&gt;affinity&lt;/em&gt; replay, it returned the response for a fanout replay instead (that is, it did &lt;em&gt;not&lt;/em&gt; list Machines from other regions). So, if all your Machines were in &lt;code&gt;yyz&lt;/code&gt;, but you had affinity with &lt;code&gt;flaps&lt;/code&gt; in &lt;code&gt;ord&lt;/code&gt;, &lt;code&gt;flyctl&lt;/code&gt; would be given an empty list of Machines. &lt;/p&gt;

&lt;p&gt;At 00:28 UTC on April 29th, we mitigated the issue by disabling app affinity in the API. Since then the bug has been fixed, but duplicate Machines created during the incident will persist until removed manually. Customers who ran &lt;code&gt;fly deploy&lt;/code&gt; on &lt;code&gt;flyctl&lt;/code&gt; v0.4.41 or v0.4.42 between April 28th and April 29th can run &lt;code&gt;fly scale show&lt;/code&gt; to review their apps&amp;rsquo; Machine counts. If more Machines appear than are wanted, they can be removed individually with &lt;code&gt;fly machines destroy &amp;lt;id&amp;gt; --force&lt;/code&gt;.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>MPG provisioning failures from revoked org token</title>
    <link rel="alternate" href="https://fly.io/infra-log/mpg-provisioning-token/"/>
    <id>https://fly.io/infra-log/mpg-provisioning-token/</id>
    <published>2026-05-04T00:00:00+00:00</published>
    <updated>2026-05-12T20:57:49+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;A revoked token caused new Managed Postgres clusters to fail when bootstrapping the default database and user during a 6-hour window. We mitigated the problem by rotating the token and redeploying the secret, and added alerts to detect these types of failures more quickly in the future.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>GitHub integration management callbacks returned 500s on Fly.io secondary nodes</title>
    <link rel="alternate" href="https://fly.io/infra-log/github-callbacks-500s/"/>
    <id>https://fly.io/infra-log/github-callbacks-500s/</id>
    <published>2026-05-04T00:00:00+00:00</published>
    <updated>2026-05-12T20:57:49+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;Some dashboard flows for adding/updating GitHub integrations intermittently returned 500 errors (notably the GitHub app callback endpoint after opening &amp;ldquo;Manage GitHub Integration&amp;rdquo; link) when those requests were served by a secondary instances (no database connections, used for static routes). Existing GitHub integrations and deployment activity weren’t affected, but users trying to manage integrations could hit errors (sometimes succeeding after a refresh). We deployed a fix to ensure these callback paths are handled correctly regardless of which instance receives the request, and followed up by patching a few related endpoints found via Sentry.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Extension provider polling overloaded Postgres</title>
    <link rel="alternate" href="https://fly.io/infra-log/provider-polling-postgres/"/>
    <id>https://fly.io/infra-log/provider-polling-postgres/</id>
    <published>2026-04-30T00:00:00+00:00</published>
    <updated>2026-05-12T20:57:49+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;An extension provider increased how often they polled one of our private API endpoints. The endpoint ran an expensive Postgres query, and at the higher rate it saturated CPU on the database backing our dashboard and GraphQL API. This caused intermittent 500s on the dashboard and GraphQL API endpoints for about 40 minutes. The provider reverted the polling frequency change and traffic dropped back to normal.&lt;/p&gt;

&lt;p&gt;The query was checking whether an organization had a registered extension with the provider, but it was scanning far more rows than it needed to. We rewrote it to short-circuit on the first match. We are also adding rate limiting on this endpoint to stop a similar spike from saturating the database again.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Duplicate Wireguard Mesh IPs Wreaking Havoc</title>
    <link rel="alternate" href="https://fly.io/infra-log/wireguard-ip-duplicates/"/>
    <id>https://fly.io/infra-log/wireguard-ip-duplicates/</id>
    <published>2026-04-27T00:00:00+00:00</published>
    <updated>2026-05-12T20:57:49+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;Some background: at Fly.io, we run a fleet of bare metal servers hosting your workload, be it Machines or Sprites, all connected over a Wireguard mesh. When we provision new servers, something has to set up Wireguard such that it is reachable by the rest of the fleet. This is done by something we call &lt;code&gt;flywire&lt;/code&gt;. It generates a Wireguard public / private key pair, sends the public part to our Consul cluster to be read by other nodes, and picks a IP in a private /8 range.&lt;/p&gt;

&lt;p&gt;You have probably read about our last incident where some nodes just lost connectivity over this Wireguard mesh. This incident began as what looked like a recurrence of that one: some nodes not being able to talk to others. Only that this time, resetting the &lt;code&gt;wg0&lt;/code&gt; interface did &lt;em&gt;not&lt;/em&gt; do anything to fix the issue. On one of the affected edge servers, we also noticed that NATS (used to propagate app load information, logs, etc.) is using an abnormally high amount of CPU. This actually gave us some clue, since its logs kept complaining about some of its peers do not report the expected regions (they should be in &lt;code&gt;sin&lt;/code&gt; but report as &lt;code&gt;fra&lt;/code&gt;, for example).&lt;/p&gt;

&lt;p&gt;We went to check on those nodes in &lt;code&gt;fra&lt;/code&gt; as well. Turns out, they have the same IPs as the problematic nodes in &lt;code&gt;sin&lt;/code&gt;! In fact, after a quick sweep of our entire fleet using a script, we found a couple more pairs / triples of servers with this exact same problem. They were all provisioned recently, and we were also lucky that many of them were not yet set up to accept new Machines. Duplicate IPs are problematic, because other nodes may end up selecting one but not the other as the &amp;ldquo;active&amp;rdquo; peer, causing partial connectivity. Most of our platform components also assume that Wireguard IPs are unique. We quickly took all of them out of production to investigate.&lt;/p&gt;

&lt;p&gt;It turned out that there were two bugs in the provisioning process that caused this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;When generating new Wireguard key pairs and IPs, we acquire a Consul lock on the respective resources, but the duration of the lock only covers generating the IP. We do check duplicate IPs at this stage, but by the time we write the IP into Consul, the lock would have been released already. Any parallel writers could cause a classic TOCTOU condition.
&lt;/li&gt;&lt;li&gt;In some cases though, nodes all get the first IP available in the &lt;code&gt;/8&lt;/code&gt; range. That is too unlikely to be explained away by pure chance. Rather, the bug here is that our code to generate the next IP by checking consul ignored errors emitted by Consul and just defaulted to the first IP in that case.
&lt;/li&gt;&lt;/ol&gt;

&lt;p&gt;As an immediate measure, we have reset the &lt;code&gt;wg0&lt;/code&gt; IP addresses of all these servers and added an alert when we detect duplicates. We are also going to fix the two bugs in our provisioning script to avoid this in the future.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Vault outage broke TLS certificate lookups</title>
    <link rel="alternate" href="https://fly.io/infra-log/vault-tls-outage/"/>
    <id>https://fly.io/infra-log/vault-tls-outage/</id>
    <published>2026-04-24T00:00:00+00:00</published>
    <updated>2026-04-24T13:36:28+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;A failure in our certificate store (Vault) caused fly-proxy to time out or fail while resolving some TLS certificates, leading to intermittent TLS handshake errors for affected apps. The same issue also affected a subset of MPG clusters.&lt;/p&gt;

&lt;p&gt;The issue started after a migration left the Vault cluster in a bad state, where the Raft leader was stopped before transferring leadership to a different node. As Vault (and other Raft-based services such as Consul) need to load the entirety of the database into memory at process start, a cold boot of the leader would have taken hours; so we restored service by rebuilding the cluster.&lt;/p&gt;

&lt;p&gt;This is not the first time Vault has caused fly-proxy outages. Longer term, we have plans to migrate to something more resilient to this specific failure mode.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>WireGuard wg0 one-way host connectivity</title>
    <link rel="alternate" href="https://fly.io/infra-log/wireguard-one-way/"/>
    <id>https://fly.io/infra-log/wireguard-one-way/</id>
    <published>2026-04-21T00:00:00+00:00</published>
    <updated>2026-04-21T13:49:57+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;Over the past few weeks, we observed individual pairs of hosts fail to send traffic over our global WireGuard mesh. Specifically, the tunnel between the hosts would appear up and handshaking correctly, but packets would only flow in one direction (or sometimes none at all). Neither WireGuard configuration nor firewall rules were able to explain this behaviour.
This caused a few issues: &lt;code&gt;fly-proxy&lt;/code&gt; on affected hosts wouldn&amp;rsquo;t be able to talk to each other, breaking load balancing in some cases; a more severe problem is with static egress IPs, since the return path depends on edge nodes being able to forward packets back to workers &amp;ndash; if one edge node happens to lose connectivity with a worker in this way, some packets might be silently dropped depending on which node upstream flow hashing decides to forward the packets to.&lt;/p&gt;

&lt;p&gt;Eventually, we tracked this issue down to a regression in the 5.15 stable kernel tree. We attempted to resolve this problem by removing and re-add the peer, but that caused Netlink in the kernel to hang, as described in &lt;a href='https://lore.kernel.org/lkml/CALrw=nGoSW=M-SApcvkP4cfYwWRj=z7WonKi6fEksWjMZTs81A@mail.gmail.com/' title=''&gt;this LKML thread&lt;/a&gt;. Fortunately, we later realized that even though resetting one single peer would hang, restarting the entire Wireguard interface (by downing the interface and re-initializing it) does not. This causes much less disruption to customer workloads on affected hosts, and we quickly fixed up all that we could find.&lt;/p&gt;

&lt;p&gt;To close up the incident, we added an alert for any WireGuard peers stuck in this way, and scheduled a kernel upgrade to a later version in the future.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>High edge CPU usage resulting in high latency in ORD</title>
    <link rel="alternate" href="https://fly.io/infra-log/edge-cpu-ord/"/>
    <id>https://fly.io/infra-log/edge-cpu-ord/</id>
    <published>2026-04-19T00:00:00+00:00</published>
    <updated>2026-04-21T13:49:57+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;ORD edge nodes became CPU-saturated, which made traffic entering through the ORD region intermittently slow (and in some cases time out). Profiling on the affected edges showed fly-proxy spending an unexpectedly large amount of time in &lt;code&gt;pthread_mutex_{lock,unlock}&lt;/code&gt; calls. This is weird, because fly-proxy itself does not, in fact, use pthread mutexes &amp;ndash; it uses locks from the parking_lot crate, which is based on futex system calls directly. Eventually, we traced the likely cause of this lock contention to SQLite, which is used to directly access Corrosion&amp;rsquo;s local database to load app metadata used for routing. We reduced the number of SQLite connections fly-proxy opens on the ORD edges, which immediately dropped CPU usage and brought lookup latencies back to normal.&lt;/p&gt;

&lt;p&gt;We are, however, still unsure exactly which lock in SQLite caused the contention: initially, we suspected the per-connection lock SQLite uses to prevent concurrent access, but our Rust side code (based on rusqlite) has explicitly marked connections as &lt;code&gt;!Sync&lt;/code&gt; and therefore they are never shared between threads in the first place. Our current hypothesis is that this is due to rusqlite&amp;rsquo;s use of the flag &lt;code&gt;SQLITE_ENABLE_MEMORY_MANAGEMENT&lt;/code&gt;, which puts a mutex on SQLite&amp;rsquo;s per-process page cache. However, we are still unable to definitely confirm that this is the case due to the lack of stack traces through SQLite during the incident (and that we have not managed to reproduce the issue at all). We have enabled more instrumentation in our code, which will hopefully give us more complete stack trace profiles should this happen again.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>NRT Machines API errors thrashed Managed Postgres</title>
    <link rel="alternate" href="https://fly.io/infra-log/machines-api-nrt/"/>
    <id>https://fly.io/infra-log/machines-api-nrt/</id>
    <published>2026-04-17T00:00:00+00:00</published>
    <updated>2026-04-21T13:49:57+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;Managed Postgres clusters in NRT intermittently went unavailable when the regional Machines API began timing out and occasionally returning truncated 502 responses, which caused Kubernetes (via our virtual-kubelet) and the Postgres operator to repeatedly reschedule and recreate Machines. That feedback loop produced extra operator/pgbouncer Machines and kept some pods stuck “not initialized,” breaking routing for affected clusters for several minutes at a time. We stabilized the region by shifting load off unhealthy workers, bringing additional NRT capacity online, and restarting the affected control components, after which cluster health checks returned to normal.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>SYD host I/O saturation</title>
    <link rel="alternate" href="https://fly.io/infra-log/host-io-saturation/"/>
    <id>https://fly.io/infra-log/host-io-saturation/</id>
    <published>2026-04-15T00:00:00+00:00</published>
    <updated>2026-04-21T13:49:57+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;Some workers in our SYD region became saturated on disk I/O, which in turn caused several managed Postgres clusters to go unhealthy (with some temporarily offline) and led to slower/less reliable machine operations on affected hosts. This was mainly caused by a large amount of machines suspending at once, and a lack of concurrency limits / queuing on this operation. We addressed this by limiting how much I/O is allocated to writing machines&amp;rsquo; memory snapshots to disk on suspend, and adding limits to the number of suspending machines allowed at once.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Web Sidekiq backlog from stuck usage jobs</title>
    <link rel="alternate" href="https://fly.io/infra-log/sidekiq-backlog-usage/"/>
    <id>https://fly.io/infra-log/sidekiq-backlog-usage/</id>
    <published>2026-04-14T00:00:00+00:00</published>
    <updated>2026-04-21T13:49:57+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;This was a weird one &amp;ndash; our Sidekiq instance, the one that powers background jobs of our GraphQL API and dashboard, repeatedly got stuck processing long-running billing and usage sync jobs. That led to large job backlogs and delayed processing. We first attempted to mitigate by scaling up and restarting stuck workers, which worked for a while, but eventually everything ground to a halt again. After chasing down a few false leads, we eventually tracked this down to two major causes:&lt;/p&gt;

&lt;p&gt;(1) when Sidekiq workers use too much memory, Sidekiq sends SIGUSR2 to kill the process, which stops the process from accepting new work and wait for any existing work to complete before exitting;
(2) however, our database connections for those billing / usage jobs sometimes got stuck without any proper timeout. When their worker processes are killed with SIGUSR2, they get into a state where they neither exit nor make any further progress; eventually, we are left with no workers that can process jobs.&lt;/p&gt;

&lt;p&gt;We hardened our setup by adding proper timeouts to both Sidekiq worker shutdown and the database connection pool. We also separated billing / usage jobs into their own queue to avoid blocking other tasks that need to be processed.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Private networking outage in Sydney (again)</title>
    <link rel="alternate" href="https://fly.io/infra-log/private-networking-syd/"/>
    <id>https://fly.io/infra-log/private-networking-syd/</id>
    <published>2026-04-14T00:00:00+00:00</published>
    <updated>2026-04-21T13:49:57+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;Private networking between our Sydney region and other regions started failing again as an upstream provider was filtering UDP traffic again. This was resolved promptly by moving traffic away from the affected provider.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Sprite creation errors in SJC and AMS</title>
    <link rel="alternate" href="https://fly.io/infra-log/sprite-errors-sjc-ams/"/>
    <id>https://fly.io/infra-log/sprite-errors-sjc-ams/</id>
    <published>2026-04-07T00:00:00+00:00</published>
    <updated>2026-04-08T19:33:38+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;Sprites are provisioned from pools, which are our regional buffers for spinning up the underlying apps and machines. This was a brief incident where the pool monitors, responsible for replenishing these pools, weren&amp;rsquo;t running in a couple of regions. Once the monitors were up again, pools refilled and sprite creation returned to normal. Ideally we catch this before the pools are empty, so we&amp;rsquo;re adjusting our alerting to make sure that&amp;rsquo;s the case.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Sprites API errors</title>
    <link rel="alternate" href="https://fly.io/infra-log/sprites-api-errors/"/>
    <id>https://fly.io/infra-log/sprites-api-errors/</id>
    <published>2026-04-03T00:00:00+00:00</published>
    <updated>2026-04-03T22:42:11+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;Some Sprites API endpoints saw elevated errors for a subset of organizations when the per-org database was resuming from an idle state. Only requests needing fresh sprite data were affected, whereas already-running Sprites continued to work normally, including new connections to them. Our orchestrator normally catches anything like this and tries again, but this was a novel failure mode it didn&amp;rsquo;t recognize as retryable.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>IAD CPU crunch</title>
    <link rel="alternate" href="https://fly.io/infra-log/cpu-crunch-iad/"/>
    <id>https://fly.io/infra-log/cpu-crunch-iad/</id>
    <published>2026-04-03T00:00:00+00:00</published>
    <updated>2026-04-03T22:31:48+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;IAD had a spike of machine start failures due to increased CPU usage across machines in the region. We had some dormant hosts in IAD sitting aside for boring things like system upgrades, so we brought some of these back online to fill the gap while spinning other resources up. &lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>ORD machine creates bogged down</title>
    <link rel="alternate" href="https://fly.io/infra-log/machine-creates-ord/"/>
    <id>https://fly.io/infra-log/machine-creates-ord/</id>
    <published>2026-04-02T00:00:00+00:00</published>
    <updated>2026-04-02T15:27:29+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;A solid chunk of machine creation requests in ORD were timing out. We tracked it down to one ORD server that had become a very frequent placement target but was taking extremely long to complete &lt;code&gt;flyd&lt;/code&gt; machine create operations. In poking at it, we noticed this host&amp;rsquo;s bolt store on disk was &lt;em&gt;huge&lt;/em&gt;, which bogged down &lt;code&gt;flyd&lt;/code&gt; enough that &lt;code&gt;flaps&lt;/code&gt;, our Machines API frontend, timed out the request before it completed. This was mitigated by pulling the host as a placement option, and then compacting its event store before putting it back in the pool.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>FRA region outage</title>
    <link rel="alternate" href="https://fly.io/infra-log/fra-outage/"/>
    <id>https://fly.io/infra-log/fra-outage/</id>
    <published>2026-04-02T00:00:00+00:00</published>
    <updated>2026-04-02T12:36:47+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;Both redundant fibre links to our primary provider&amp;rsquo;s rack in FRA dropped offline. This caused most apps hosted in the FRA region to become inaccessible.
The links came back up after about 40 minutes, but some Managed Postgres clusters needed additional time to catch up.&lt;/p&gt;

&lt;p&gt;Honestly, we (and our provider) are unsure why this happened; it was likely human error of some sort. We don&amp;rsquo;t expect this to happen again.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>GraphQL timeouts</title>
    <link rel="alternate" href="https://fly.io/infra-log/graphql-timeouts/"/>
    <id>https://fly.io/infra-log/graphql-timeouts/</id>
    <published>2026-03-31T00:00:00+00:00</published>
    <updated>2026-04-02T12:36:47+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;We called an incident due to elevated errors through our GraphQL API, where requests were timing out when talking to our primary database. All errors were coming from a single machine in IAD, where the physical host for that machine seemed to be misbehaving and couldn&amp;rsquo;t hold a connection, so it earned itself a reboot. The host and its machines were behaving once it came back up, and everything seems fine, but this is a server we&amp;rsquo;re going to squint at with suspicion if it misbehaves in the future.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Errors viewing logs in Grafana</title>
    <link rel="alternate" href="https://fly.io/infra-log/grafana-log-errors/"/>
    <id>https://fly.io/infra-log/grafana-log-errors/</id>
    <published>2026-03-30T00:00:00+00:00</published>
    <updated>2026-04-02T12:36:47+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;The logs panel on &lt;code&gt;fly-metrics.net&lt;/code&gt; started throwing up an error due to an internal mTLS certificate expiring. This prevented customers from viewing logs in Grafana only, and both &lt;code&gt;fly logs&lt;/code&gt; and the dashboard log viewer continued to work. We fixed this, then briefly broke it again, then fixed it.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>DFW capacity, again</title>
    <link rel="alternate" href="https://fly.io/infra-log/dfw-capacity/"/>
    <id>https://fly.io/infra-log/dfw-capacity/</id>
    <published>2026-03-27T00:00:00+00:00</published>
    <updated>2026-04-02T12:36:47+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;The capacity we provisioned in DFW but two days earlier was slurped up in short order, so we were back to machines on some hosts being unable to start due to resource gates. With the popularity of the region, we found our placement would concentrate new machines onto the few best hosts at any given moment, which would itself create new bursts of start failures. In response to this incident we once again brought new hosts online, alongside improving our placement logic to better handle this case.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Metrics outage</title>
    <link rel="alternate" href="https://fly.io/infra-log/metrics-outage/"/>
    <id>https://fly.io/infra-log/metrics-outage/</id>
    <published>2026-03-26T00:00:00+00:00</published>
    <updated>2026-04-02T12:36:47+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;Metrics graphs were impacted when the storage backing our metrics pipeline filled up, which stalled ingestion. Typically a metrics outage like this will backfill the missing data on recovery, but a configuration mistake this time meant we briefly accepted metrics without forwarding them on, resulting in roughly an hour of permanently lost metrics. We still have some open investigations on this one, such as why the storage filled up (it shouldn&amp;rsquo;t have), and why our alarms didn&amp;rsquo;t fire (they should have).&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>SJC disappeared (briefly)</title>
    <link rel="alternate" href="https://fly.io/infra-log/sjc-disappearance/"/>
    <id>https://fly.io/infra-log/sjc-disappearance/</id>
    <published>2026-03-25T00:00:00+00:00</published>
    <updated>2026-04-02T12:36:47+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;A firmware bug on a switch upstream of us caused traffic to our &lt;code&gt;sjc&lt;/code&gt; servers to be dropped. This was fixed within a few minutes. For a moment, most instances in &lt;code&gt;sjc&lt;/code&gt; were unreachable: apps saw connection failures, and some Managed Postgres clusters were knocked into unhealthy states. The upstream issue was resolved immediately, so this was largely a monitoring incident on our end.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>DFW capacity errors</title>
    <link rel="alternate" href="https://fly.io/infra-log/capacity-errors-dfw/"/>
    <id>https://fly.io/infra-log/capacity-errors-dfw/</id>
    <published>2026-03-25T00:00:00+00:00</published>
    <updated>2026-04-02T12:36:47+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;Another region hit its capacity ceiling after higher-than-normal growth. We were able to spin up more physical hosts over the following hours, during which machine start errors in DFW settled back down.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>A bunch of wedged Sprites</title>
    <link rel="alternate" href="https://fly.io/infra-log/wedged-sprites/"/>
    <id>https://fly.io/infra-log/wedged-sprites/</id>
    <published>2026-03-24T00:00:00+00:00</published>
    <updated>2026-04-02T12:36:47+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;A large number of sprites became difficult or impossible to wake after migrations left them in a &lt;code&gt;failed&lt;/code&gt; state due to capacity limits in their region. This showed up as 502/503 responses from sprites that should have been started on-demand. From the perspective of the Fly Proxy this was strictly correct: &lt;code&gt;failed&lt;/code&gt; &lt;em&gt;should&lt;/em&gt; be a terminal state when a machine fails to launch. This incident showed that this state was mistakenly set for machines that failed to start after being migrated, which is very much not the same thing. Our quick fix was rolling out a proxy change that tries to start &lt;code&gt;failed&lt;/code&gt; machines anyway, and once that put out the fire we focused on cleaning up our migrations to handle capacity issues better across the board.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Tight capacity in ORD and SIN</title>
    <link rel="alternate" href="https://fly.io/infra-log/capacity-ord-sin/"/>
    <id>https://fly.io/infra-log/capacity-ord-sin/</id>
    <published>2026-03-23T00:00:00+00:00</published>
    <updated>2026-04-02T12:36:47+00:00</updated>
    <media:thumbnail url="https://fly.io/static/images/default-post-thumbnail.webp"/>
    <content type="html">&lt;p&gt;Demand in many of our regions is growing &lt;em&gt;a lot&lt;/em&gt; this year. So much so that our ORD and SIN regions hit capacity faster than we could bring new hardware online. The main impact from a constrained region is that existing machines may fail to start if the physical host they&amp;rsquo;re on is above our safety margin for resource usage. Deploys also see some effect, when a valid host to place a new machine can&amp;rsquo;t be found. In this case, the incident was largely resolved by provisioning new ORD hardware and rebalancing some workloads across the region.&lt;/p&gt;
</content>
  </entry>
</feed>
