May 20: SYD egress IP networking broken on new workers

May 20: SYD egress IP networking broken on new workers (04:46UTC)

Some newly provisioned hosts in our Sydney (SYD) region failed to be configured properly for egress IP connectivity. As a result, a number of Machines using egress IPs in the region were unable to access the network. During the incident, we immediately migrated the affected Machines to known-good hosts.

Recently, we moved configuration for some infra components (including the VXLAN interface backing egress IPs) to a new, more scalable system. The rollout appeared to be successful, but an interaction with a legacy deployment method caused the configuration service to not be restarted correctly - so VXLAN worked on existing hosts, but would not be provisioned on new hosts.

Our egress IP monitoring was set up in a world where egress IPs were machine-scoped rather than app-scoped (see this forum post for more context). As such, a couple monitoring Machines were set up in each region, and not every host was being monitored - as that would require one IP address for every host. After this incident, we ported egress IP monitoring to app-scoped IPs with a Machine running on every host.