Internet Outages Timeline
Dive into high-profile Internet disruptions. Discover their impact, root causes, and essential lessons to ensure the resilience of your Internet Stack.
October
Mashery
What Happened?
On October 1, 2024, TIBCO Mashery, an enterprise API management platform leveraged by some of the world’s most recognizable brands, experienced a significant outage. At around 7:10 AM ET, users began encountering SSL connection errors. Internet Sonar revealed that the root cause wasn’t an SSL failure but a DNS misconfiguration affecting access to key services.
Takeaways
The Mashery outage reveals a crucial lesson: SSL errors can be just the tip of the iceberg. The real issue often lies deeper, like in this case, with a DNS misconfiguration. If DNS isn’t properly configured or monitored, the entire system can fail, and what seems like a simple SSL error can spiral into a much bigger problem. To truly safeguard against the fragility of the Internet, you need full visibility into every layer of the Internet Stack, from DNS to SSL and beyond.
September
Reliance Jio
What Happened?
On September 17, 2024, Reliance Jio encountered a major network outage affecting customers across multiple regions in India and across the globe. The outage was initially noticed when users began encountering connection timeouts attempting to access both the AJIO and Jio websites. The outage was resolved around 05:42 EDT.
Takeaways
Gaining full visibility across the entire Internet Stack, including external dependencies like CDN, DNS, and ISPs is critical for businesses. Proactive monitoring is essential for early detection of issues such as packet loss and latency, helping companies mitigate risks before they escalate into major outages.
August
ServiceNow
What Happened?
On August 15, at 14:15 PM ET, ServiceNow experienced a significant outage lasting 2 hours and 3 minutes. Catchpoint's Internet Sonar detected the disruption through elevated response and connection timeout errors across major geographic locations. The disruption, caused by instability in connectivity with upstream provider Zayo (AS 6461), impacted ServiceNow's core services and client integrations. The outage resulted in intermittent service availability, with users facing high connection times and frequent timeouts.
Takeaways
A proactive approach to BGP monitoring is crucial to prevent extended outages. ServiceNow's quick response to reroute traffic is a good example of how effective incident management and holding vendors accountable can make all the difference in keeping things running and keeping your users happy.
AWS
What Happened?
On August 14, between 8:00 and 8:25 UTC, AWS experienced a micro-outage affecting services like S3, EC2, CloudFront, and Lambda. Catchpoint's Internet Sonar detected connection timeouts across multiple regions, particularly in locations routing through CenturyLink AS209 and Lumen AS3356. This disruption, though not reflected on AWS’s status page, significantly impacted these regions' access to AWS services.
Takeaways
Status pages aren't always reliable indicators of service health. If you’re only relying on cloud-based monitoring tools, you’re in trouble if their cloud goes down. It’s good practice to diversify your monitoring strategy and have a fallback plan to ensure Internet resilience. Clear communication will also help you maintain trust with your users.
July
Disney+
What Happened?
On July 31, at 20:12 EDT, Disney Plus experienced a brief outage lasting 38 minutes. Catchpoint detected 502 Bad Gateway errors from multiple nodes, an issue that was confirmed through both automated tests and manual browsing. The disruption was resolved by 20:50 EDT.
Takeaways
This incident shows why it's so important to monitor your services from multiple vantage points to quickly detect and verify outages. Even short-lived disruptions can impact user experience, making continuous monitoring and rapid response essential.
Alaska Airlines
What Happened?
On July 23, from 14:35 to 14:52, Alaska Airlines’ website (www.alaskaair.com) experienced a 404 Not Found error, rendering the site inaccessible for approximately 20 minutes. Catchpoint detected the issue, confirming the failures across multiple tests. Response headers indicated the issue stemmed from configuration errors, as evidenced by the 404 error and subsequent cache miss responses.
Microsoft Outlook
What Happened?
Starting at 21:23 EDT on July 23, Microsoft Outlook experienced intermittent failures across multiple regions. Users encountered various errors, including 404 Not Found, 400 Bad Request, and 503 Service Unavailable, when trying to access https://www.outlook.com/ and https://outlook.live.com/owa/. Catchpoint’s Internet Sonar detected the issue through multiple tests, while Microsoft’s official status page did not report any outages at the time.
Takeaways
Another example of how intermittent issues, which can pose the most persistent threat to observability, may not be reflected on official status pages. Given the high cost of Internet disruptions, even a brief delay in addressing these issues can be extraordinarily expensive. And if you’re waiting for your provider to tell you when something’s wrong, that delay could be even longer.
Azure
What Happened?
On July 18, starting at 18:36 EDT, Azure’s US Central region experienced a major service outage lasting until 22:17 EDT. Initially, 502 Bad Gateway errors were reported, followed by 503 Service Unavailable errors. This outage impacted numerous businesses reliant on Azure Functions, as well as Microsoft 365 services like SharePoint Online, OneDrive, and Teams, which saw significant disruptions.
Takeaways
This incident occurred within 24 hours of a separate CrowdStrike outage, leading to confusion in the media as both issues were reported simultaneously. Companies that relied solely on Azure without multi-region or multi-cloud strategies were significantly impacted, particularly those using eCommerce APIs. Catchpoint’s Internet Sonar detected the outage early and helped isolate the issue, confirming that it wasn’t related to network problems, saving time on unnecessary troubleshooting.
CrowdStrike
What Happened?
On July 19, a massive global outage disrupted critical services worldwide, affecting systems dependent on Microsoft-based computers. The outage, caused by a faulty automatic software update from cybersecurity firm CrowdStrike, knocked Microsoft PCs and servers offline, forcing them into a recovery boot loop. This unprecedented outage impacted daily life on a global scale, grounding airlines, taking emergency services offline, and halting operations for major banks and enterprises.
Takeaways
The CrowdStrike outage is a wake-up call for how fragile our digital world really is. Everything we do relies on these systems, and when they fail, the ripple effects are huge. This incident shows just how important it is to be prepared. Know your dependencies, test updates like your business depends on it (because it does), and have a plan for when things go wrong. Don’t just assume everything will work—make sure it will. And remember, resilience isn’t just about your tech; it’s about your team too. Keep them trained, keep them ready, and make sure they know what to do when the unexpected happens.
June
May
Bing
What Happened?
On May 23, starting at 01:39 EDT, Bing experienced an outage with multiple 50X errors affecting users globally. The issue was detected by Catchpoint’s Internet Sonar and confirmed through manual checks. The outage disrupted access to Bing’s homepage, impacting user experience across various regions.
Takeaways
This incident shows the value of having robust monitoring in place. Quick detection and confirmation are crucial for minimizing the impact of such outages.
What Happened?
On May 1, starting at 10:40 Eastern, Google services experienced a 34-minute outage across multiple regions, with users encountering 502 Bad Gateway errors. The issue affected accessibility in locations including Australia, Canada, and the UK. Internet Sonar detected the incident and the outage was also confirmed via manual checks.
April
What Happened?
On April 29, starting at 03:29 EDT, X (formerly known as Twitter) experienced an outage where users encountered high wait times when trying to access the base URL 'twitter.com.' The issue was detected by Internet Sonar, with failures reported from multiple locations. Manual checks also confirmed the outage. Additionally, during this time, connection timeouts were observed for DFS and Walmart tests due to failed requests to Twitter’s analytics service, further impacting both platforms.
March
ChatGPT
What Happened?
On April 30, starting at 03:00 EST, ChatGPT’s APIs experienced intermittent failures due to HTTP 502 (Bad Gateway) and HTTP 503 (Service Unavailable) errors. Micro-outages were observed at various intervals, including 03:00-03:05 EST, 03:49-03:54 EST, and 03:58-03:59 EST. These disruptions were detected by Catchpoint’s Internet Sonar and confirmed through further investigation.
Takeaways
Even brief micro-outages can affect services and user experience. Early detection is key to minimizing impact.
February
ChatGPT
What Happened?
On February 25, 2024, at 23:29 EST, OpenAI’s ChatGPT API began experiencing intermittent failures. The primary issues were HTTP 502 Bad Gateway and HTTP 503 Service Unavailable errors when accessing the endpoint https://api.openai.com/v1/models. The outage was confirmed manually, and Catchpoint’s Internet Sonar dashboard identified the disruption across multiple regions, including North America, Latin America, Europe, the Middle East, Africa, and the Asia Pacific. The issues persisted into the next day, with 89 cities reporting errors during the outage.
Takeaways
As with many API-related outages, relying on real-time monitoring is essential to quickly mitigating user impact and ensuring service reliability across diverse geographies.
January
Microsoft Teams
What Happened?
On January 26, Microsoft Teams experienced a global service disruption affecting key functions like login, messaging, and calling. Initial reports indicated 503 Service Unavailable errors, with the issue captured by Autodesk synthetic tests. Microsoft later identified the root cause as networking issues impacting part of the Teams service. The failover process initially helped restore service for some regions, but the Americas continued to experience prolonged outages.
Takeaways
Failover processes can quickly resolve many service issues, but this outage showed the importance of ongoing optimization for full recovery across all regions. It also highlighted the value of monitoring from the user’s perspective. During the disruption, Teams appeared partially available, leading some users to believe the issue was on their end.
2023
December
Box
What Happened?
On December 15, from 6:00 AM to 9:11 AM Pacific Time, Box experienced a significant outage that affected key services, including the All Files tool, Box API, and user logins. The outage disrupted uploading and downloading features, leaving users unable to share files or access their accounts. Early detection through proactive Internet Performance Monitoring (IPM) helped Box mitigate the outage’s impact, with IPM triggering alerts as early as 04:37 AM PST, well before the outage became widespread.
Takeaways
Early detection and quick response are key to minimizing downtime, reducing financial losses, and protecting brand reputation. This incident emphasizes the value of a mature Internet Performance Monitoring strategy, setting the right thresholds to avoid false positives, and ensuring teams can quickly identify root causes to keep systems resilient.
Adobe
What Happened?
Starting at 8:00 AM EST on December 8 and lasting until 1:45 AM EST on December 9, Adobe’s Experience Cloud suffered a major outage, affecting multiple services like Data Collection, Data Processing, and Reporting Applications. The outage, which lasted nearly 18 hours, disrupted operations for Adobe’s extensive customer base, impacting businesses worldwide. Catchpoint's Internet Sonar was the first tool to detect the issue, identifying failures in Adobe Tag Manager and other services well before Adobe updated its status page.
Takeaways
Yet another reminder of the fragility of the Internet and another catch for Internet Sonar, which was essential for early detection and rapid response, helping to pinpoint the source of the problem and minimize downtime. The outage also highlights the importance of proactive monitoring and preparedness, as well as the potential financial and reputational costs of service disruptions.
November
October
September
Salesforce
What Happened?
On September 20, starting at 10:51 AM EST, Salesforce experienced a major service disruption affecting multiple services, including Commerce Cloud, MuleSoft, Tableau, Marketing Cloud, and others. The outage lasted over four hours, preventing a subset of Salesforce’s customers from logging in or accessing critical services. The root cause was a policy change meant to enhance security, which unintentionally blocked access to essential resources, causing system failures. Catchpoint detected the issue at 9:15 AM EST—nearly an hour and a half before Salesforce officially acknowledged the problem.
Takeaways
Catchpoint’s IPM helped identify the issue well before Salesforce's team detected it, potentially saving valuable time and minimizing disruption. For businesses heavily reliant on cloud services, having an IPM strategy that prioritizes real-time data and rapid root-cause identification is crucial to maintaining internet resilience and avoiding costly downtime.
August
July
June
Microsoft Teams
What Happened?
On 28 June 2023, the web version of Microsoft Teams (https://teams.microsoft.com) became inaccessible globally. Users encountered the message "Operation failed with unexpected error" when attempting to access Teams via any browser. Catchpoint detected the issue at 6:51 AM Eastern, with internal tests showing HTTP 500 response errors. The issue was confirmed manually, though no updates were available on Microsoft’s official status page at the time.
May
April
March
February
January
Microsoft
What Happened?
On January 25, 2023, at 07:08 UTC/02:08 EST, Microsoft experienced a global outage that disrupted multiple services, including Microsoft 365 (Teams, Outlook, SharePoint Online), Azure, and games like HALO. The outage lasted around five hours. The root cause was traced to a wide-area networking (WAN) routing change. A single router IP address update led to packet forwarding issues across Microsoft's entire WAN, causing widespread disruptions. Microsoft rolled back the change, but the incident caused significant impact globally, especially for users in regions where the outage occurred during working hours.
Takeaways
Catchpoint’s IPM helped identify the issue well before Salesforce's team detected it, potentially saving valuable time and minimizing disruption. For businesses heavily reliant on cloud services, having an IPM strategy that prioritizes real-time data and rapid root-cause identification is crucial to maintaining internet resilience and avoiding costly downtime.
2022
December
Amazon
What Happened?
Starting at 12:51 ET on December 5, 2022, Catchpoint detected intermittent failures related to Amazon’s Search function. The issue persisted for 22 hours until December 7, affecting around 20% of users worldwide on both desktop and mobile platforms. Impacted users were unable to search for products, receiving an error message. Catchpoint identified that the root cause was an HTTP 503 error returned by Amazon CloudFront, affecting search functionality during the outage.
Takeaways
Partial outages, even when affecting a small portion of users, can still have serious consequences. Relying solely on traditional monitoring methods like logs and traces can lead to delayed detection, especially with intermittent issues. Being able to pinpoint the specific layer of the Internet Stack responsible for the issue helps engineers troubleshoot and resolve problems faster.
November
October
September
August
July
Rogers Communications
What Happened?
On July 8, 2022, Rogers Communications experienced a major outage that impacted most of Canada for nearly two days, disrupting internet and mobile services. A code update error took down the core network at around 4 AM, affecting both wired and wireless services. The outage disrupted essential services, including 911 calls, businesses, government services, and payment systems like Interac. Some services were restored after 15 hours, but others remained down for up to four days. The incident impacted millions of Canadians, sparking widespread frustration and highlighting the risks of relying heavily on a single telecom provider.
Takeaways
Test thoroughly before deploying network changes and ensure redundancies are in place and effective. Rogers thought they had redundancies, but they failed to work when needed most. Fast detection and resolution are critical. Rogers' slow response led to significant financial losses, reputational damage, and a potential class-action lawsuit.
June
May
April
March
February
Slack
What Happened?
On February 22, 2022, at 9:09 AM ET, Slack began experiencing issues, primarily impacting users' ability to fetch conversations and messages. While users could log in, key functionalities were down, leading to widespread disruption. The issue persisted intermittently, affecting productivity for many businesses relying on Slack for communication. Catchpoint tests confirmed errors at the API level, pointing to issues with Slack’s backend services, not the network.
Takeaways
Early detection and real-time visibility into service performance is critical. Being able to quickly diagnose an issue and notify users before the flood of support tickets arrives can significantly reduce downtime and frustration. Monitoring from the user’s perspective is crucial, as it helps detect problems faster and more accurately than waiting for official service updates.
January
2021
December
Amazon Web Services (AWS)
What Happened?
In December 2021, AWS experienced three significant outages:
1. December 7, 2021: An extended outage originating in the US-EAST-1 region disrupted major services such as Amazon, Disney+, Alexa, and Venmo, as well as critical apps used by Amazon’s warehouse and delivery workers during the busy holiday season. The root cause was a network device impairment.
2. December 15, 2021: Lasting about an hour, this outage in the US-West-2 and US-West-1 regions impacted services like DoorDash, PlayStation Network, and Zoom. The issue was caused by network congestion between parts of the AWS Backbone and external Internet Service Providers (ISPs).
3. December 22, 2021: A power outage in the US-EAST-1 region caused brief disruptions for services such as Slack, Udemy, and Twilio. While the initial outage was short, some services experienced lingering effects for up to 17 hours.
Takeaways
Don’t depend on monitoring within the same environment. Many companies hosting their observability tools on AWS faced monitoring issues during the outages. It’s essential to have failover systems hosted outside the environment being monitored to ensure visibility during incidents.
November
Google Cloud
What Happened?
On November 16, 2021, Google Cloud suffered an outage beginning at 12:39 PM ET, which knocked several major websites offline, including Home Depot, Spotify, and Etsy. Users encountered a Google 404 error page. This outage affected a variety of Google Cloud services such as Google Cloud Networking, Cloud Functions, App Engine, and Firebase. Google’s root cause analysis pointed to a latent bug in a network configuration service, triggered during a routine leader election charge. While services were partially restored by 1:10 PM ET, the full recovery took almost two hours.
Takeaways
Monitor your services from outside your infrastructure to stay ahead of problems before customers notice. Tracking your service level agreements (SLAs) and mean time to recovery (MTTR) allows you to measure the efficiency of your teams and providers in resolving incidents.