May 28: DNS cache was broken for CNAME'd domains

May 28: DNS cache was broken for CNAME'd domains (09:59UTC)

Some customers saw persistent DNS resolution failures for certain external hostnames that only cleared when we restarted corro-dns, our recursive DNS resolver. It turns out that the domains they were trying to resolve had intermittent failures upstream. The weird thing is that by itself should not cause persistent problems: even though corro-dns does cache DNS responses, it only caches failures for a very brief moment and will retry pretty quickly if one resolution failed. The cache should eventually be populated with a valid response, and if more upstream errors happen, corro-dns is allowed to serve an expired cache in that case.

It turns out that this cache logic failed to take into account cases where a domain A is CNAME‘d onto domain B, and only domain B failed to resolve. In that case, corro-dns ended up with a cached CNAME entry for A -> B, but without any corresponding entry for B. A subsequent request for domain A will hit the cache for the CNAME, but corro-dns will not spawn a new query for domain B since it thinks we’ve already hit the cache. It then returns only the CNAME record to the client, and most clients will not spawn another query either and will just report to the user that no A or AAAA records are returned. This situation will not clear itself until the TTL of the CNAME record expires, which in this case was very long.

We mitigated this issue for now by skipping cache when any unexpected failure happens while resolving a domain. The root cause, however, is that corro-dns caches full DNS responses and not individual DNS records, and does not “fill in” additional records when only a CNAME can be cached. Our plan is to refactor this layer of caching to prevent similar bugs in the future.