Kernel upgrade caused machine `stdout` to become wedged by Cloud Hypervisor

#May 11: Kernel upgrade caused machine `stdout` to become wedged by Cloud Hypervisor (14:30UTC)

This is probably one of the more interesting / confusing / complicated bugs we have had since the revival of Infra Log. It began on this day with us receiving reports about an outage of the Upstash Redis extension, reflected on their status page as well. Upstash Redis, when used as a Fly extension, runs on Fly Machines, just like other customers. The only difference is that, for various reasons, we run their Machines using Cloud Hypervisor rather than Firecracker. This has never caused problems before, and initially we were pretty certain this is an issue on Upstash side. As we worked with them to investigate, though, we got something confusing: the report that these Machines are stuck writing to stdout!

The stdout of a process running inside the machine is a pipe, one end acting as stdout and the other end connected to init, where it splice()s the pipe into a vsock connected to the hypervisor. A process on the host, called firefly, collects these logs from vsocks. For the stdout (write) end of the pipe to be stuck, one of these steps must have gone wrong: either firefly on host is failing to collect logs fast enough, or init inside the Machine is failing to splice() them into the vsock.

We were quickly able to rule out an issue with firefly. This means something went wrong when copying logs into the vsock, but that is extremely weird: the init process does use Rust async with Tokio, but at the end of the day, it is just repeatedly calling splice() to move pages around. We suspected all sorts of wakeup issues on the Rust side, because the way splice() works does not play nicely with Rust async and we had to use another pair of internal pipes as a workaround (we’ll not get into the details here, but the TL;DR is one cannot tell which side of the splice() caused an EWOULDBLOCK or EAGAIN; see something like tokio-splice2 if you are curious).

That was not what was happening. The Rust side was behaving just fine (* for some definition of “fine”, but no weird wakeup issues). Instead, the vsock itself seemed to be getting stuck. The only change(s) recently in this path was a batch of kernel upgrades due to recent vulnerabilities. We found this issue on Cloud Hypervisor’s GitHub repo, which points to a kernel change incompatible with older Cloud Hypervisor builds. The change was also backported to LTS kernels, which we upgraded recently, but without the corresponding Cloud Hypervisor fixes. Because the majority of Machines do not use Cloud Hypervisor (except GPU machines, and specific organizations), this did not show up in our testing. And because the change only affects large transfers, it only really triggers when a large write is performed on stdout, and by extension, the vsock. That is also a gap in our testing: not many apps emit logs as often as some of our biggest customers.

We ended up resolving the incident by upgrading Cloud Hypervisor to a version with the fixes. To mitigate similar issues in the future, we also introduced safeguards in init that will not allow a vsock or host-side issue to block the stdout pipe indefinitely. Unfortunately, the splice() + Rust async issue combined with the lack of a way to determine a Linux pipe’s remaining capacity reliably means we had to resort to a simple timeout on the vsock write side for this, which will still introduce some latency to stdout when the vsock side is wedged. It will not be an infinite delay, though, and it will also emit a corresponding warning for visibility when a timeout happens.