April 8: SYD host I/O saturation
#April 8: SYD host I/O saturation (23:14UTC)
Some workers in our SYD region became saturated on disk I/O, which in turn caused several managed Postgres clusters to go unhealthy (with some temporarily offline) and led to slower/less reliable machine operations on affected hosts. This was mainly caused by a large amount of machines suspending at once, and a lack of concurrency limits / queuing on this operation. We addressed this by limiting how much I/O is allocated to writing machines’ memory snapshots to disk on suspend, and adding limits to the number of suspending machines allowed at once.