March 17: A bunch of wedged Sprites

March 17: A bunch of wedged Sprites (11:30UTC)

A large number of sprites became difficult or impossible to wake after migrations left them in a failed state due to capacity limits in their region. This showed up as 502/503 responses from sprites that should have been started on-demand. From the perspective of the Fly Proxy this was strictly correct: failed should be a terminal state when a machine fails to launch. This incident showed that this state was mistakenly set for machines that failed to start after being migrated, which is very much not the same thing. Our quick fix was rolling out a proxy change that tries to start failed machines anyway, and once that put out the fire we focused on cleaning up our migrations to handle capacity issues better across the board.