May 19: IAD observability disrupted by NATS misconfiguration
#May 19: IAD observability disrupted by NATS misconfiguration (19:11UTC)
A server being decommissioned began advertising a bad NATS config (specifically, an empty connection URL), which caused the logs/metrics exporters of various hosts in the IAD region to crash. During the incident, many Machines in IAD had missing metrics, and some customers may have also seen gaps or delays in log delivery. We mitigated by removing the decommissioned server from the NATS cluster and restarting the affected metrics exporters across IAD to restore normal telemetry flow, and are discussing options to remove the reliance on NATS from the logs/metrics pipelines.