Blog Post

2018 outages and lessons learned

Published

January 14, 2019

mins read

Kameerath Kareem

in this blog post

Application performance can only be maintained with constant monitoring and optimization efforts. Even with the growing emphasis on implementing a comprehensive monitoring strategy, performance issues and outages are common in the digital world. While some are a result of ineffective optimization, others are intentional and malicious, like a cyber-attack.

Catchpoint aims to make monitoring more effective through the prevention and speedy mitigation of performance issues. We help our customers deliver a seamless end-user experience, and as part of this effort, we identified several major performance issues last year. This blog discusses these outages.

Google Ads Impact Customer Experience

Ads and other third-party services are the usual suspects behind performance degradation. Google Ads service was disrupted on 13th March 2018, slowing down hundreds of websites. Catchpoint noticed issues such as increased load time and availability drops across several domains.

Analyzing the incoming data, we were able to identify what was going wrong quickly. The domains “doubleclick.net” and “google.adservices.com” experienced high latency which delayed the document complete time of websites that had the Google Ad services tag configured to initiate at the start of the page load process.

Google had the following notice on their service status page confirming that it was, in fact, an issue with their Ad services, specifically Doubleclick.

Read the detailed analysis of the issue here and here.

Slow DNS Performance Impacts Online Payment Gateway

Fast and seamless end-user experience relies on multiple factors like webpage structure, third-party integrations, CDNs, DNS provider, and even ISPs. A latency introduced by any of these components can degrade application performance.

Catchpoint identified an interesting micro-outage last April. The incident was an example of the impact ISPs have on DNS resolution time. In this case, DNS queries were timing out when using a popular carrier resulting in high DNS resolution times. The issue was limited to an Indian ISP and impacted the response times of an online payment gateway.

The chart above plots DNS resolution time for two different domains from multiple locations. Both domains were impacted.

We looked at the response times across major ISPs to identify the root cause of the issue.

The data in the chart above indicates DNS resolution failures across Airtel and Reliance ISPs. Check this detailed blog about this micro-outage and the different Catchpoint test monitors that helped identify the root cause.

Public DNS Resolver Outage

As Public DNS resolvers have gained a lot of popularity over the last few years, it’s become necessary to track the performance of these resolvers. Google’s public DNS resolvers experienced downtime on 30th May. The DNS outage left many sites inaccessible.

Data from Catchpoint confirmed that querying Google’s public DNS IP address 8.8.8.8 resulted in timeouts, as shown in the scatterplot above. The issue happened between 12:30 and 13:50 ET. We did a complete analysis of the outage here.

BGP Leak Impacts Google

Google services took a hit on November 12th, when a wrong routing path was announced by an ISP from Nigeria. The issue lasted for less than an hour but impacted the performance of several websites.

Sites that used the Google Ajax library slowed considerably.

We discussed how the incident unfolded in our blog here.

Cloudfront Network Outage

A major peering issue between AT&T and Cloudfront impacted websites that relied on the CDN provider. Catchpoint detected an unusual spike in response times on sites using Cloudfront.

Traceroute data helped identify the root cause. The issue was specific to requests going through Telia network. You can read the complete analysis here.

Black Friday Outages

Every year we monitor the performance of big retailers during the Holiday sale season. But this year, Catchpoint introduced the Black Friday Assurance program to help customers prepare for the sale and to ensure there were no major performance bottlenecks during the event. Third-party tags continued to cause performance degradation and were one of the main reasons for Black Friday outages.

The chart above illustrates the impact a third-party tag (Coremetrics) had on the response times of some major retailers during Black Friday. We also did a full round-up of the issues we detected, read all about it here and here.

CenturyLink Outage

The year ended with another micro-outage that was not widely reported. On December 27th, CenturyLink customers were unable to access the internet as the ISP provider was hit with a major outage. The impact of the outage was not limited to CenturyLink customers; it also impacted traffic routed through or peered with the CenturyLink network.

The heat map above indicates the impacted regions. CenturyLink acknowledged the issue in a tweet and informed customers that their customer service portal was also experiencing issues**. Read our analysis of the issue** here.

Lessons Learned

Application performance relies on multiple factors and troubleshooting performance issues can become complicated if you don’t have the right data. There are two important steps to implement when trying to deliver optimal application performance.

First, the application must follow best practices:

During production, focus on page structure, third-party tags, and integrations.
During deployment, focus on the infrastructure used and the third-party services like CDN and DNS providers.

Second, have the right monitoring strategy in place:

Set up the right tests types that focus on important performance metrics
Cut down false positives by configuring only relevant alert types
Monitor every part of the application delivery chain.

If you have these two aspects of your application handled, then you will improve the MTTR (Mean Time to Resolve) significantly and it becomes easier to detect and troubleshoot issues.