May 19: Proxy and Corrosion in SIN weren’t on the same page
#May 19: Proxy and Corrosion in SIN weren’t on the same page (11:13UTC)
During a rollout of the Fly Proxy, a new Corrosion query started throwing errors on a subset of hosts in Singapore. This query relied on a new column in our Corrosion schema, which had been rolled out globally the day prior. It turns out these hosts had received the new schema but hadn’t successfully reloaded it.
Once the new proxy came up, it failed to load apps from Corrosion and couldn’t serve any traffic. This made machines on these hosts unavailable, and caused a wave of Managed Postgres (MPG) healthcheck failures in the region.
During the incident this was fixed by forcing a reload of the Corrosion schema on these hosts, after which traffic returned to normal and all MPG cluster alerts resolved.
We made two changes to prevent this happening in the future. First, we didn’t notice this during the schema rollout as Corrosion didn’t return an error for a failed reload. Corrosion now returns an error code when this happens, so we can revisit those hosts after a rollout. Second, this is the sort of thing we should catch in the proxy’s bluegreen deployment. This error wasn’t hit until after the proxy marked itself healthy, though, so it had already taken over as the primary. Now the proxy prepares all SQL queries against Corrosion during its startup sequence, so the new proxy won’t successfully come up if any of these fail.