Four pillars of DEM: reachability, availability, performance, & reliability
Monitoring has greatly evolved as a practice over the last few decades, as advancements in digital technology have changed the way that the world interacts with our websites, services and applications.
In particular, things have drastically changed during the last decade with the growth and adoption of the following technologies:
- Cloud Computing
- Edge Computing (DNS, CDN, Load Balancing, SDWAN, WAF, etc.)
- Client-Side and Server-Side rendering
- APIs that enable deep integrations with third-party services
- Continuous Integration, Delivery, and Deployment
All these advancements have happened with the end user as the focal point. We haven’t seen a single enterprise who has not adopted all of these.
In this blog, we will discuss how monitoring methodologies have changed from an end user’s point of view.
There are four major pillars of monitoring:
- Reachability
- Availability
- Performance
- Reliability
Monitoring Reachability
Reachability simply means, “Are we able to reach point B from point A?” In the monitoring context it means, “Are the end users able to access the application? If not, what is preventing the end users from accessing the application?”
In the traditional IT world, monitoring reachability was simple. Since everything was hosted under one roof, enterprises just had to worry about end users being able to reach their datacenters.
However, a number of developments have made things exponentially more complex:
- Enterprises now use one or multiple Managed DNS providers.
- Enterprises now use cloud-based Global Server Load Balancers.
- Enterprises now use cloud-based Firewall Services.
- Enterprises now use one or more Content Delivery Networks.
- Enterprises now use one or more Cloud Computing Providers or are Hybrid.
- Finally, enterprises now have deep integrations with third parties who also have digital architectures described in points 1 through 5.
No longer concerned with just a datacenter or two, companies now have to ensure that each of these services are reachable.
Example 1
Let’s take the recent Zoom outage. Zoom’s DNS is managed by AWS, and they leverage Cloudfront for their CDN.
Zoom was down because AWS was not reachable.
Example 2
Below is another example where end users were unable to reach an ecommerce website. This site is hosted on Salesforce Demandware, which leverages Cloudflare CDN.
The website was experiencing DNS and Connection timeout errors due to an issue with Cloudflare. The CDN provider was not reachable due to a BGP route leak related to Verizon. This has been discussed in a detailed blog and webinar.
This example illustrates the complex enterprise architecture discussed in the six points highlighted above.
Reachability issues are mostly caused by ISP peering issues, BGP leaks, or BGP flapping issues, basically network instability. In the above examples, it took us under a minute to get to the root cause of the issue.
Monitoring Availability
In the monitoring context, availability is all about, “Is the end user able to access the application?” The reachability examples that we discussed are often considered as availability issues. Yet while there is certainly overlap between the two (i.e. availability can be impacted by reachability issues), there can be scenarios where end users are able to reach the end point, but the service is down.
Enterprises have traditionally focused on ensuring that their applications are up and responding to requests. If it’s a web application, then most monitoring tools look for HTTP 200 status with some form of basic validation.
This works to a certain extent, but is that sufficient? The answer is definitely NO! The examples below explain why.
Example 1
Let’s take a look at the Facebook outage that happened on July 2. Monitoring tools that just monitored www.facebook.com did not detect any issues. In fact, some monitoring companies even Tweeted that Facebook was down and that there were no reachability issues, but they could not pinpoint where the problem was.
But here is what the end users saw:
Facebook without images is like a rock concert without speakers. In other words, it’s technically up, but for all practical purposes, it’s down.
In the screenshot below, we see that some of the Facebook hosts that serve images were down. These hosts are mapped to Facebook’s CDN and the requests to these hosts were returning HTTP 500 errors.
Example 2
If you are an ecommerce site, you are considered available only if end users are able to complete a transaction on the website. If they are unable to complete it, like in the below scenario, then for the end user the website is as good as DOWN.
In this case, we noticed transaction failures and when we checked the screenshots that our tests captured the “buy button” was missing.
Availability monitoring is not just up or down anymore. With the advent of Single Page Applications, Microservices, AMP, etc., it is not enough to just ensure the Base Page or Home Page is loading. Enterprises have to monitor every single critical user journey on their application.
Monitoring Performance
We have all heard the Amazon story multiple times; performance has a direct impact on revenue and user experience. Here is a quick refresher.
Monitoring performance is important because a slower application has a significant impact on the brand, as end users don’t like slow applications. A slower application is actually more frustrating than an application that’s unavailable. Today, bad performance is the new downtime.
Performance monitoring has two key components:
- Establishing a baseline and alerting on any breaches
- Continuous improvement
Establishing a baseline is the first step because it helps you understand the current application performance. This enables you to set thresholds and detect when the performance is breaching the normal status.
In the image below, we see a significant increase in Wait Time, which in turn slowed down the page rendering process (represented by the Render Start and Speed Index). The root cause for the drop in Wait Time can be multiple factors, but it is important to have a baseline established so that you can detect it as soon as possible.
There are 50+ metrics that are recorded every single time a page loads. Enterprises should have the flexibility and ability to determine which metrics matter to them based on their application type and design, as well as the application delivery chain.
Example 1
In the example below, this company uses a CDN, while some of their non-static contents gets fetched from the origin servers. The end users connect to the CDN, which in turn fetches the content from the origin and serves it back to the end user. Detecting problems here requires more than standard performance metrics.
We can see that the Wait Time was impacted when the Edge to Origin slowed down. This level of visibility helps you pinpoint the root cause within seconds.
Example 2
This customer noticed a significant increase in the page load time:
The root cause was a third-party font that they had on the website.
Example 3
Finally, below is an example of slow performance impacting revenue. The faster the detection, the lower the impact.
Continuous Improvement has now become even more important for businesses to thrive and gain a competitive edge.
Benchmarking your performance against competitors or even role models (Apple, Amazon…) plays a major role in helping your business grow. An apples-to-apples comparison of KPIs will help you identify gaps in application performance, and also helps to:
- Understand best practices in the industry
- Evaluate processes that either need to be implemented or need optimization to improve application performance and efficiency
- Constantly monitor the performance impact due to changes in the application and compare it to competitors
- Measure the progress the business has made and compare it to competitors
- Get a 360-degree performance report of your business, evaluating application performance and business metrics side-by-side
Monitoring Reliability
Reliability is delivering consistent application performance, reliability and reachability. This is the trickiest part of the monitoring methodology because you will need a monitoring tool that does the job of a data scientist. Data scientists in the performance monitoring world have the difficult job of:
- Understanding and defining business problems.
- Data acquisition: determining how the right data can be collected to help solve business problems
- Data Preparation: cleaning data, sorting, transforming, and modifying based on rules
- Exploratory Data Analysis: defining and refining the selection of different variables
- Data Visualization: powerful reports and dashboards
- Historical Data Storage: compare performance over long periods of time
Catchpoint has made all this easy so that enterprise performance teams can slice and dice the collected data in more than a hundred ways to get the visibility they need. Let’s look at some real examples:
Example 1
Volatile Performance was seen for a website in Canada. The root cause was a poor edge server:
Example 2
This customer noticed intermittent DNS failures. We were able to look at the DNS trace and pinpoint the root cause for the failures. The customer worked with their CDN vendor and got the issue resolved.
Volatility is considered a risk not just in the stock market, but in application performance as well. Visibility into each layer of your digital delivery chain is critical to monitor and maintain the stability of each component that powers the application.
Monitoring from the end user’s perspective requires a holistic approach, as there are numerous independent unrelated components involved in powering a single application. The internet is only going to get more and more complex, and enterprises will continue to lose control and visibility over their infrastructure. This calls for an increased emphasis on the four pillars of End User Monitoring: Reachability, Availability, Performance, and Reliability, each of which is as important as the others.