Blog Post

Incident Review – The Third AWS Outage in December: When it Rains, it Pours

Published
December 23, 2021
#
 mins read
By 

in this blog post

The following is an analysis of the Amazon Web Services (AWS) incident on 12/22/2021.

When it comes to major AWS outages, three times is certainly not the charm. For the third time in three weeks, the public cloud giant reported an outage, this time due to a power outage “within a single data center within a single Availability Zone (USE1-AZ4) in the U.S.-EAST-1 Region," according to the AWS status page.

Here at Catchpoint, we first observed issues at 07:11 a.m. ET, 24 minutes ahead of the AWS announcement.

An image of statistics of issues observed 24 minutes ahead of the AWS announcement

Once again, the outage triggered an unfortunate cascade of collateral damage. During further drill-down of the AWS outage, we observed issues for AWS customers such as Slack, Udemy, Twilio, Okta, Imgur, Jobvite. and even the NY Court system web site.

The power outage itself was relatively brief, with Amazon reporting that power had been restored at 9:51 AM ET. However, full recovery efforts persisted well beyond that. The company reported at 12:28 PM ET that it had “restored underlying connectivity to the majority of the remaining” systems. The failures were mainly concentrated around Boston, New York, Philadelphia, and Toronto.

Long Road to Recovery

Ancillary effects proved vexingly persistent, as some AWS users continued to experience problems related to the issue. A Slack status page, for example, did not mark the issue as fully resolved until 8:26 PM ET.

Of particular note was the outage suffered by Twilio, the cloud communications platform as a service with deep ties to AWS. The company first noted the issue at 7:17 AM ET:

“Our monitoring systems have detected a potential issue with multiple services. Our engineering team has been alerted and is actively investigating. We will update as soon as we have more information.”

From there, a string of status updates appeared, but the incident was not fully resolved until 17 hours later at 12:38 ET.

As the chart below illustrates, the AWS outage caused complete downtime for some customers, while others saw intermittent outages throughout the whole period. The symptoms we saw in the various impacted sites varied because of how their systems are architected on top of AWS and their reliance on CDNs in front of AWS. At the same time, some of the sites and applications impacted took a lot longer to completely recover from the AWS outage, due to how their systems were architected and what was required to recover. In one case, Jobvite took 24 hours to fully resolve their issues.

An image of an availability graph recording the outage time for customers
Availability graph (Catchpoint)

This lasted for about two to three hours before recovery began for most of the sites. Some sites experienced issues connecting to the AWS servers. As observed in the screenshot below taken during the outage), the client was trying to establish a connection with a server that eventually failed after 30 seconds.

Screenshot of a client trying to establish a connection with a server that failed after 30 seconds
Waterfall indicating connection timeout (Catchpoint)

Other sites, such as Udemy, returned bad gateway or gateway timeout errors. The screenshot below shows that the server returned a 502 Response code for the request. A 502 Bad Gateway Error means that the web server you have connected to is acting as a proxy for relaying information from another server, but it has gotten a bad/invalid response from that other server.

Screenshot showing that the server returned a 502 error response code
Waterfall indicating 502 response code and high wait time (Catchpoint)

The Digital Domino Effect

Looking back at the AWS outages of the past three weeks, the underlying reasons for each all differ. However, the incidents all clearly illustrate the severe downstream effect that problems at one company can have on online services. Indeed, Catchpoint detected an outage at one of our client’s SaaS applications, a digital illustration of the domino effect in action.

The bottom line? Ensuring availability and business continuity for your company is not a solo endeavor. When issues originating with partners, customers, and third-party providers can bring down your systems, it is time to build a collaborative strategy designed to support your extended digital infrastructure. For that, comprehensive observability is crucial.

Want to learn more?

Ready to learn more best practices to prevent, prepare for, and respond to an outage?

Download “2021 Internet Outages: A compendium of the year’s mischiefs and miseries – with a dose of actionable insights.

The following is an analysis of the Amazon Web Services (AWS) incident on 12/22/2021.

When it comes to major AWS outages, three times is certainly not the charm. For the third time in three weeks, the public cloud giant reported an outage, this time due to a power outage “within a single data center within a single Availability Zone (USE1-AZ4) in the U.S.-EAST-1 Region," according to the AWS status page.

Here at Catchpoint, we first observed issues at 07:11 a.m. ET, 24 minutes ahead of the AWS announcement.

An image of statistics of issues observed 24 minutes ahead of the AWS announcement

Once again, the outage triggered an unfortunate cascade of collateral damage. During further drill-down of the AWS outage, we observed issues for AWS customers such as Slack, Udemy, Twilio, Okta, Imgur, Jobvite. and even the NY Court system web site.

The power outage itself was relatively brief, with Amazon reporting that power had been restored at 9:51 AM ET. However, full recovery efforts persisted well beyond that. The company reported at 12:28 PM ET that it had “restored underlying connectivity to the majority of the remaining” systems. The failures were mainly concentrated around Boston, New York, Philadelphia, and Toronto.

Long Road to Recovery

Ancillary effects proved vexingly persistent, as some AWS users continued to experience problems related to the issue. A Slack status page, for example, did not mark the issue as fully resolved until 8:26 PM ET.

Of particular note was the outage suffered by Twilio, the cloud communications platform as a service with deep ties to AWS. The company first noted the issue at 7:17 AM ET:

“Our monitoring systems have detected a potential issue with multiple services. Our engineering team has been alerted and is actively investigating. We will update as soon as we have more information.”

From there, a string of status updates appeared, but the incident was not fully resolved until 17 hours later at 12:38 ET.

As the chart below illustrates, the AWS outage caused complete downtime for some customers, while others saw intermittent outages throughout the whole period. The symptoms we saw in the various impacted sites varied because of how their systems are architected on top of AWS and their reliance on CDNs in front of AWS. At the same time, some of the sites and applications impacted took a lot longer to completely recover from the AWS outage, due to how their systems were architected and what was required to recover. In one case, Jobvite took 24 hours to fully resolve their issues.

An image of an availability graph recording the outage time for customers
Availability graph (Catchpoint)

This lasted for about two to three hours before recovery began for most of the sites. Some sites experienced issues connecting to the AWS servers. As observed in the screenshot below taken during the outage), the client was trying to establish a connection with a server that eventually failed after 30 seconds.

Screenshot of a client trying to establish a connection with a server that failed after 30 seconds
Waterfall indicating connection timeout (Catchpoint)

Other sites, such as Udemy, returned bad gateway or gateway timeout errors. The screenshot below shows that the server returned a 502 Response code for the request. A 502 Bad Gateway Error means that the web server you have connected to is acting as a proxy for relaying information from another server, but it has gotten a bad/invalid response from that other server.

Screenshot showing that the server returned a 502 error response code
Waterfall indicating 502 response code and high wait time (Catchpoint)

The Digital Domino Effect

Looking back at the AWS outages of the past three weeks, the underlying reasons for each all differ. However, the incidents all clearly illustrate the severe downstream effect that problems at one company can have on online services. Indeed, Catchpoint detected an outage at one of our client’s SaaS applications, a digital illustration of the domino effect in action.

The bottom line? Ensuring availability and business continuity for your company is not a solo endeavor. When issues originating with partners, customers, and third-party providers can bring down your systems, it is time to build a collaborative strategy designed to support your extended digital infrastructure. For that, comprehensive observability is crucial.

Want to learn more?

Ready to learn more best practices to prevent, prepare for, and respond to an outage?

Download “2021 Internet Outages: A compendium of the year’s mischiefs and miseries – with a dose of actionable insights.

This is some text inside of a div block.

You might also like

Blog post

Preparing for the unexpected: Lessons from the AJIO and Jio Outage

Blog post

Don’t get caught in the dark: Lessons from a Lumen & AWS micro-outage

Blog post

Key findings from The Internet Resilience Report 2024