Microsoft services went down last week for many people in the U.S., including Teams, Office 365, Xbox Live, OneDrive and Azure. Briefly after the outage, Microsoft explained that a DNS-related issue was the cause of the issues with Microsoft services. Over the weekend, Microsoft released a root cause analysis report of the issue, which sheds light on what went wrong (via ZDNet).
The report states:
Root Cause: Azure DNS servers experienced an anomalous surge in DNS queries from across the globe targeting a set of domains hosted on Azure. Normally, Azure’s layers of caches and traffic shaping would mitigate this surge. In this incident, one specific sequence of events exposed a code defect in our DNS service that reduced the efficiency of our DNS Edge caches. As our DNS service became overloaded, DNS clients began frequent retries of their requests which added workload to the DNS service. Since client retries are considered legitimate DNS traffic, this traffic was not dropped by our volumetric spike mitigation systems. This increase in traffic led to decreased availability of our DNS service.
Microsoft explains that its DNS services automatically recovered themselves 39 minutes after the outage started. According to the company, the “recovery time exceeded [its] design goal.”
Going forward, Microsoft plans to repair the code defect that caused the issue and improve the automatic detection and mitigation of issues.
The outage that occurred on April 2, 2021 was much shorter and less severe than the outage that happened on March 15, 2021. That earlier outage was caused by an error involving the rotation of keys for Azure.