Incident Review – Microsoft Office 365 Outage
The whole internet spins across different domains but when we talk about the backbone suite of every organization, MS Office 365, is for sure, one of the biggest contenders. Just like the recent Century Link/Lumen outage, we witnessed another major outage, this time Microsoft O365. This month might as well be considered a bad month for the internet, as we have seen a lot of daily used consumer services getting impacted like Reddit, Pinterest, Google Services, etc. Today Microsoft O365 services were down due to some changes implemented (as per Microsoft 365’s status page). The changes were soon rolled back but the downtime already had a global impact.
Less control over the implementation can result in performance issues which eventually lead to outages. At around 21:25 UTC on 28th September, Microsoft Azure Active Directory (Azure AD), which is a built-in solution for identity and access management, began to experience issues that caused many users to be locked out of Azure AD, they couldn’t connect to any application governed by these services. Essentially, this meant that applications like the Azure Portal, Teams, SharePoint, Outlook, Dynamics365, Power platform, etc., were shut out for millions of consumers.
This issue was detected by Catchpoint at 21:15 UTC, 9/28/2020. It was observed by users trying to re-login to Microsoft applications, but Microsoft claimed that the existing Office 365 sessions were still working and urged users not to close them. This was challenged by many users, however, – current sessions, notwithstanding the most recent update, were also affected.
Fig 2: MS Teams login impacted
Multiple Microsoft services were down globally due to a 503 error code – ‘Instance is overloaded’ (due to temporary overloading or maintenance of the server, the server is currently unable to deal with the requests).
Fig 3: The issue had a global impact
Although Microsoft continued analyzing and trying to mitigate these problems as mentioned in their tweets, thousands of users from all over the world continued reporting outages that prevented them from sending – IMs, emails, or accessing their online workspace.
At 00:00 UTC – 9/29/2020, though at a lower level, we saw the services restoring back to normal as Microsoft rolled back few changes which were implemented, and had originally caused the server overload/unavailability but didn’t fully restore the services.
In a later tweet, at 00:48 UTC, Microsoft addressed the rerouting of traffic to their backup infrastructure to improve the end-user experience.
Fig 4: Outage Scatterplot
As highlighted in the snippet above, we can observe the traffic routing, before the issue, during the issue, and after the issue. The primary flow is based on ‘login.windows.net’ and after rerouting to different infrastructure, we see traffic going to ‘ccs.login.microsoftonline.com’ which seems to have helped mitigate the issue.
Fig 5: Change in request routing
In Fig 5, we can see the redirection in real-time as requests are routed to a new list of IPs, 52.xxx.xxx.xxx, instead of the original set 20.xxx.xxx.xxx/40.xxx.xxx.xxx. This was done to mitigate the impact of the outage. However, this does not guarantee end-user experience as enterprises use cloud-based security services, SDWAN, and other WAN optimization services to improve and secure employee experience. So the traffic rerouting would have a ripple impact on those vendors as well.
Summary
Organizations need to ensure they are monitoring every single service, whether a large or small business. Downtime is inevitable for major providers of cloud services and every time it happens, it costs corporations worldwide a lot of millions in lost business, productivity, and service reliability.
Public clouds are here to stay and play a vital role in how organizations run, but that doesn’t mean we are not vulnerable to downtime only because we put our workloads into Azure or Google. An outage, be it micro or major, could be tied to a microservice or to the failure of the infrastructure as a whole.
Upholding SLAs is crucial to these service providers. Users cannot assume service levels are guaranteed just because the vendor says so. Outages like these result in SLA breaches and without data to support, you may not be eligible for penalties.
By implementing end-to-end incident management, pro-active monitoring significantly decreases MTTD. In order to facilitate expedited and seamless processing of the incidents, Catchpoint provides a comprehensive view of key asset data, indicators, historical data, etc. Strategic pro-active monitoring increases the effectiveness of Ops, SRE, and SOC / NOC teams by capturing and assimilating multi-source main assets, metric data and point to microservices or any of the moving parts of the delivery chain, to help minimize the MTTR window to a greater extent. Take a look at Catchpoint’s New Normal Recommendations here.
This blog was co-written by Wasil Banday and Sheikh Mursaleen