March 26: ORD machine creates bogged down

March 26: ORD machine creates bogged down (15:18UTC)

A solid chunk of machine creation requests in ORD were timing out. We tracked it down to one ORD server that had become a very frequent placement target but was taking extremely long to complete flyd machine create operations. In poking at it, we noticed this host’s bolt store on disk was huge, which bogged down flyd enough that flaps, our Machines API frontend, timed out the request before it completed. This was mitigated by pulling the host as a placement option, and then compacting its event store before putting it back in the pool.