May 30: Deploys blocked by billing error

May 30: Deploys blocked by billing error (02:18UTC)

For a few hours, deploys for some organizations were failing with a “We require your billing information” error, despite having just added payment methods or credits to their organizations. This was due to a mis-ordered deployment of a new Corrosion schema.

For some context: organization information is managed by our central GraphQL API backed by a local database in iad; when an organization is updated, for instance when the billing information is updated, the GraphQL API pushes the changes to the global Corrosion cluster so it can be read by the Machines API. When new information needs to be stored in Corrosion, we need to deploy two changes: a global change to the Corrosion (sqlite) schema, and a change to the GraphQL API to push the new data to the global cluster.

Earlier in the day, we had prepared a change to push some new organization data to Corrosion. This is usually a safe change, however this time the GraphQL API was deployed prior to the global schema being updated. This caused all organization updates to fail to be propagated to Corrosion, thus causing the Machines API to not know about the updated billing status of organizations. To resolve this incident, we quickly reverted the change to the GraphQL API and backfilled the missing data in Corrosion.

We are looking into ways to alert on repeated sync failures, as well as failing GraphQL API deployments if the Corrosion schema is out of date.