Incident Review – The Third AWS Outage in December: When it Rains, it Pours
The following is an analysis of the Amazon Web Services (AWS) incident on 12/22/2021.
When it comes to major AWS outages, three times is certainly not the charm. For the third time in three weeks, the public cloud giant reported an outage, this time due to a power outage “within a single data center within a single Availability Zone (USE1-AZ4) in the U.S.-EAST-1 Region," according to the AWS status page.
Here at Catchpoint, we first observed issues at 07:11 a.m. ET, 24 minutes ahead of the AWS announcement.
Once again, the outage triggered an unfortunate cascade of collateral damage. During further drill-down of the AWS outage, we observed issues for AWS customers such as Slack, Udemy, Twilio, Okta, Imgur, Jobvite. and even the NY Court system web site.
The power outage itself was relatively brief, with Amazon reporting that power had been restored at 9:51 AM ET. However, full recovery efforts persisted well beyond that. The company reported at 12:28 PM ET that it had “restored underlying connectivity to the majority of the remaining” systems. The failures were mainly concentrated around Boston, New York, Philadelphia, and Toronto.
Long Road to Recovery
Ancillary effects proved vexingly persistent, as some AWS users continued to experience problems related to the issue. A Slack status page, for example, did not mark the issue as fully resolved until 8:26 PM ET.
Of particular note was the outage suffered by Twilio, the cloud communications platform as a service with deep ties to AWS. The company first noted the issue at 7:17 AM ET:
“Our monitoring systems have detected a potential issue with multiple services. Our engineering team has been alerted and is actively investigating. We will update as soon as we have more information.”
From there, a string of status updates appeared, but the incident was not fully resolved until 17 hours later at 12:38 ET.
As the chart below illustrates, the AWS outage caused complete downtime for some customers, while others saw intermittent outages throughout the whole period. The symptoms we saw in the various impacted sites varied because of how their systems are architected on top of AWS and their reliance on CDNs in front of AWS. At the same time, some of the sites and applications impacted took a lot longer to completely recover from the AWS outage, due to how their systems were architected and what was required to recover. In one case, Jobvite took 24 hours to fully resolve their issues.
This lasted for about two to three hours before recovery began for most of the sites. Some sites experienced issues connecting to the AWS servers. As observed in the screenshot below taken during the outage), the client was trying to establish a connection with a server that eventually failed after 30 seconds.
Other sites, such as Udemy, returned bad gateway or gateway timeout errors. The screenshot below shows that the server returned a 502 Response code for the request. A 502 Bad Gateway Error means that the web server you have connected to is acting as a proxy for relaying information from another server, but it has gotten a bad/invalid response from that other server.
The Digital Domino Effect
Looking back at the AWS outages of the past three weeks, the underlying reasons for each all differ. However, the incidents all clearly illustrate the severe downstream effect that problems at one company can have on online services. Indeed, Catchpoint detected an outage at one of our client’s SaaS applications, a digital illustration of the domino effect in action.
The bottom line? Ensuring availability and business continuity for your company is not a solo endeavor. When issues originating with partners, customers, and third-party providers can bring down your systems, it is time to build a collaborative strategy designed to support your extended digital infrastructure. For that, comprehensive observability is crucial.
Want to learn more?
Ready to learn more best practices to prevent, prepare for, and respond to an outage?