Blog Post

The performance cost of micro-outages

Published

January 7, 2019

mins read

Kameerath Kareem

in this blog post

In the digital world, application performance degradation and downtime are not rare occurrences. The impact such incidents have on end-user experience varies. The application may become slow and frustrating to use or, it could crash and impede user transactions. The severity of the issue and the MTTR (Mean Time to Resolve) directly affect end user experience.

When the titans of information technology encounter performance challenges, the repercussions impact almost every single online application. Such outages are widely reported followed by a slew of write-ups analyzing the issue—what caused it, what fixed it and what could prevent it. Performance analysts are on the lookout for such big outages, but we often overlook blips in performance. Blips or “micro-outages” are mostly intermittent performance issues that often go unnoticed.

In this blog, we discuss such outages that usually don’t have a global impact, but that impact businesses and SLAs.

What is a micro-outage?

Micro-outages typically last less than an hour, and although such outages introduce significant latency in the application delivery chain, it doesn’t have that great an impact on end-user experience. Such outages are caused by peering issues, BGP errors, hardware failures or network capacity issues. Outages that fall into any of these categories are easy to troubleshoot and resolve.

Because these are temporary glitches, the impact on end user experience and business is not given due importance. Catchpoint helped identify several incidents of micro-outages, and the performance data we captured shows the real impact it has on performance.

Incident 1: Google BGP routing error

Google faced connectivity issues on November 12th. The issue, that started around 16:30 EST, impacted multiple Google services including their cloud services, APIs, load balancers, etc.

The charts above illustrate the sudden performance degradation across multiple Google services. Google identified and resolved the issue within 30 minutes.

The incident was resolved quickly, but the performance of hundreds of websites took a blow, which inevitably impacted end-user experience and businesses. You can read the detailed analysis of the incident here.

Incident 2: CloudFront network issue

CloudFront faced peering issues on 14th November. The East Coast issue began at 11:30 CET. Content served by CloudFront was inaccessible for AT&T and AT&T wireless users.

This was a routing issue that affected the traffic from AT&T that was routed through Telia. The network had trouble resolving requests resulting in 100% packet loss once the request hit the Telia network. Read the detailed issue analysis here.

This issue impacted only a certain set of network users. Nevertheless, the issue generated negative end-user experience—which always impacts a business.

Incident 3: Public DNS resolver issue

On 30th May, at around 12:54 ET, Google DNS resolvers became unresponsive suddenly and DNS resolution requests resulted in timeouts. The chart below plots the failed requests during the outage.

The requests ended in packet loss while attempting to reach the destination IP address 8.8.8.8. The outage was not global, affecting only the USA, Brazil, India, and Singapore.

Is the performance impact significant?

We discussed three different instances here:

BGP routing error
Network peering issue
DNS resolver issue

These outages provide interesting networking insights. None of the issues were in the spotlight, but they had significant impact.

Another important consideration is SLA impact. These outages go unnoticed most of the time, and so does any SLA breach caused by the outage. Monitoring an application for major outages is different from monitoring intermittent performance anomalies. Setting up the right kind of alerts will help capture micro-outages, and the data can validate SLA breaches.

Black Friday and holiday traffic issues

During the Black Friday Assurance program, Catchpoint identified multiple instances of micro-outages that had an immediate and direct impact on business.

Black Friday is not only about sales and discounts; it’s also a time of year that pushes the performance limits of retail websites. We hear a lot about website outages. Barring the inevitable website crashes, what usually goes unnoticed the intermittent glitches specific to certain regions or a certain network or user base.

For example, a customer’s website had an issue with the “add to cart” during Black Friday. It lasted for less than 30 minutes, but this meant users were unable to place orders on the busiest day of the year in e-commerce—a giant, missed opportunity for the business.

We identified another micro-outage during Boxing Day. The performance of the site dropped significantly during the sale. The page response time spiked causing a proportional dip in revenue as shown in the chart below.

When users are hunting for the best deals and ready to spend money, such glitches translate to big losses.

In the examples above, the chances of an SLA breach are high as the application uses multiple third-party services including the CDN provider, API integrations, and other third-party tags. It’s only through proactive monitoring that you can ensure the service vendor upholds SLAs.

Conclusion

Performance monitoring is a constant process, and businesses must review every anomaly. Intermittent micro-outages could hint systemic issues in the application that IT teams need to address. SLA breaches are not exclusive to complete outages—micro-outages can also cause SLA breaches. The data proves that every outage has repercussions, be it partial, complete, or lasts just 10 minutes—irrespective of how long the outage lasts or the locations it impacts.