Blog Post

How to accurately monitor distributed systems?

Published
November 6, 2018
#
 mins read
By 

in this blog post

Even the largest and best-performing enterprises are not immune to the realities of the internet. Factors like latency, packet loss, and congestion can impact the content delivery chain anytime.

Latency is the time it takes a packet to travel from the sender to its destination. The distance between sender and receiver directly impacts latency.

The speed of light theory governs the transmission time of any network packet.

Under optimal circumstances, and if there is a direct cable connection between the end user and server, there’s a 42ms round trip time (RTT) between New York and San Francisco. But in reality, there are no such connections. There are numerous routers and paths that a packet travels through to reach from one end to the other. Each hop introduces its share of processing, queuing and transmission delays, retransmission, etc. As a result, the actual RTT between New York and San Francisco could be in the range of 80 to 120 milliseconds.

If a page has 100 objects where each requests RTT is 120ms, the result is poor user experience.

monitoring-route-times

Monitoring distributed systems

Companies with many users across different geographies leverage distributed systems to serve content to their end users. Distributed systems or distributed architecture put you closer to your audience so you can bypass common interruptions like congestion, packet loss, and latency. Thus, delivering a fast and pleasant experience to your users.

Today there is a wide spectrum of solutions that serve content over the internet from a distributed system including commercial CDNs and ISP-operated CDNs.

Some deploy their servers in large datacentres while others deploy thousands of servers across hundreds of cities across the world. Some use Anycast vs. Unicast and DNS to find the closest location.

Here is the network map of some of the major CDN providers with large deployments across the globe:

Some of the major commercial CDNs form strategic alliances with large ISPs.

There are two reasons for this CDN-ISP collaboration:

  • Informed end-user to edge server assignment
  • In-network server allocation.

With informed end-user to edge server assignment, a CDN appropriately assigns an end user to a server based on recommendations from the ISP. The recommendations are based on performance criteria which are mutually agreed upon by the CDN and the ISP.

In-network server allocation allows the CDNs to place their edge servers within an ISP network. This allows the CDN to scale and at the same time serve content more efficiently to the end users.

Additional complexity requires CDN evolution

Currently, more than half of the traffic to the internet comes from mobile networks. Delivering good performance for mobile users is even more challenging due to lower network speeds and higher congestion, packet loss, and latency.

To overcome this, CDNs deploy servers close to the mobile gateways and mobile cores, enabling them to bypass the congestion and latency challenges we discussed earlier.

In addition to mobile users, applications have become more complex, with increased use of third-party APIs, and dynamic content. This technological evolution has led to the need for dynamic content acceleration, front-end optimization, image optimization, and prefetching.

CDNs now offer all the above services all happening at the edge servers closer to the end users.

To summarize, through the distributed architecture of a CDN you can:

  • Deliver content from servers closer to the end user by mapping them to the servers through the fastest known network path.
  • Optimize or modify content on servers closer to the end user.

The evolution of CDNs was driven by a need to improve the end user experience.

Measuring the end user experience

The only way to get a true measure of the end-user experience is by monitoring from where your end users are.

For an accurate measure of end user experience, you must use two monitoring methodologies–synthetic monitoring and real user monitoring.

The goal of synthetic monitoring is to proactively detect real issues before they impact the end users and minimize the impact on the business.

The goal of real user monitoring is to see how your customers are interacting with your application and what performance bottlenecks they encounter.

But where are these end users? Are the end users accessing your applications from the cloud?

There’s a new trend where monitoring vendors are deploying their monitoring nodes only on cloud providers like AWS, Azure, and IBM.

One of the problems with a cloud-only strategy is that you can’t see an accurate picture of the end user experience. Our recent blog on Synthetic Monitoring from Cloud, Backbone and Broadband nodes covers this in detail.

Customer-centric CIOs and businesses around the world are making end user experience KPI’s a business mandate. The only way to guarantee your content is available and delivered fast is to monitor from your end users’ locations. Hence, the widespread deployment of our synthetic monitoring nodes across every major city and ISP.

catchpoint-distributed-systems

This vast deployment of Backbone and Broadband Nodes means our customers can measure and monitor their user’s experience by detecting performance, availability, and reachability issues quickly. The three scenarios below show the benefits of testing from distributed locations to identify CDN issues.

Proximity mapping: If it’s a delivery problem, you need to check if the content request reaches the nearest edge server. Our new CDN Mapping Dashboard shows how server proximity impacts performance.

distributed-systems-monitoring-example

Regional failures: Failures can occur regionally as well as globally. Below is an example of a CDN feature called “Edge Side Include” failing only in a particular city—impacting a major airline. The application failed to load critical modules on the page making the application partially unavailable.

regional-failures-example
regional-failures-example-two

ISP issues: Issues at an ISP can result in poor application performance. The world’s largest online payment system recently detected a problem where the applications their CDN served, as well as the origin, weren’t accessible. The issue was related to a packet loss within a particular ISP.

ISP-issues-example
ISP-issue-example-two

Whether congestion, latency, or even a change in routing path, wide node coverage is instrumental in proactively detecting network issues. Our new ASN change dashboard helps quickly analyze the change in routing paths.

catchpoint-ASN-change-dashboard

Issues at an ISP can lead to DNS failures and slow response times as shown in the scatterplot chart below. Traffic from India was routed to a server in North America, while all other ISPs were routing traffic to the nearest Singapore server.

dns-response-time
dns-response-time-example

We help hundreds of customers detect and identify these and many more issues on a daily basis.

Catchpoint continuously adds nodes closer to your end users, and we do this because we care about customer experience. Customer experience is the product.

Monitoring the customer experience via this vast, complex, ever-changing landscape requires monitoring from where it matters the most.

Even the largest and best-performing enterprises are not immune to the realities of the internet. Factors like latency, packet loss, and congestion can impact the content delivery chain anytime.

Latency is the time it takes a packet to travel from the sender to its destination. The distance between sender and receiver directly impacts latency.

The speed of light theory governs the transmission time of any network packet.

Under optimal circumstances, and if there is a direct cable connection between the end user and server, there’s a 42ms round trip time (RTT) between New York and San Francisco. But in reality, there are no such connections. There are numerous routers and paths that a packet travels through to reach from one end to the other. Each hop introduces its share of processing, queuing and transmission delays, retransmission, etc. As a result, the actual RTT between New York and San Francisco could be in the range of 80 to 120 milliseconds.

If a page has 100 objects where each requests RTT is 120ms, the result is poor user experience.

monitoring-route-times

Monitoring distributed systems

Companies with many users across different geographies leverage distributed systems to serve content to their end users. Distributed systems or distributed architecture put you closer to your audience so you can bypass common interruptions like congestion, packet loss, and latency. Thus, delivering a fast and pleasant experience to your users.

Today there is a wide spectrum of solutions that serve content over the internet from a distributed system including commercial CDNs and ISP-operated CDNs.

Some deploy their servers in large datacentres while others deploy thousands of servers across hundreds of cities across the world. Some use Anycast vs. Unicast and DNS to find the closest location.

Here is the network map of some of the major CDN providers with large deployments across the globe:

Some of the major commercial CDNs form strategic alliances with large ISPs.

There are two reasons for this CDN-ISP collaboration:

  • Informed end-user to edge server assignment
  • In-network server allocation.

With informed end-user to edge server assignment, a CDN appropriately assigns an end user to a server based on recommendations from the ISP. The recommendations are based on performance criteria which are mutually agreed upon by the CDN and the ISP.

In-network server allocation allows the CDNs to place their edge servers within an ISP network. This allows the CDN to scale and at the same time serve content more efficiently to the end users.

Additional complexity requires CDN evolution

Currently, more than half of the traffic to the internet comes from mobile networks. Delivering good performance for mobile users is even more challenging due to lower network speeds and higher congestion, packet loss, and latency.

To overcome this, CDNs deploy servers close to the mobile gateways and mobile cores, enabling them to bypass the congestion and latency challenges we discussed earlier.

In addition to mobile users, applications have become more complex, with increased use of third-party APIs, and dynamic content. This technological evolution has led to the need for dynamic content acceleration, front-end optimization, image optimization, and prefetching.

CDNs now offer all the above services all happening at the edge servers closer to the end users.

To summarize, through the distributed architecture of a CDN you can:

  • Deliver content from servers closer to the end user by mapping them to the servers through the fastest known network path.
  • Optimize or modify content on servers closer to the end user.

The evolution of CDNs was driven by a need to improve the end user experience.

Measuring the end user experience

The only way to get a true measure of the end-user experience is by monitoring from where your end users are.

For an accurate measure of end user experience, you must use two monitoring methodologies–synthetic monitoring and real user monitoring.

The goal of synthetic monitoring is to proactively detect real issues before they impact the end users and minimize the impact on the business.

The goal of real user monitoring is to see how your customers are interacting with your application and what performance bottlenecks they encounter.

But where are these end users? Are the end users accessing your applications from the cloud?

There’s a new trend where monitoring vendors are deploying their monitoring nodes only on cloud providers like AWS, Azure, and IBM.

One of the problems with a cloud-only strategy is that you can’t see an accurate picture of the end user experience. Our recent blog on Synthetic Monitoring from Cloud, Backbone and Broadband nodes covers this in detail.

Customer-centric CIOs and businesses around the world are making end user experience KPI’s a business mandate. The only way to guarantee your content is available and delivered fast is to monitor from your end users’ locations. Hence, the widespread deployment of our synthetic monitoring nodes across every major city and ISP.

catchpoint-distributed-systems

This vast deployment of Backbone and Broadband Nodes means our customers can measure and monitor their user’s experience by detecting performance, availability, and reachability issues quickly. The three scenarios below show the benefits of testing from distributed locations to identify CDN issues.

Proximity mapping: If it’s a delivery problem, you need to check if the content request reaches the nearest edge server. Our new CDN Mapping Dashboard shows how server proximity impacts performance.

distributed-systems-monitoring-example

Regional failures: Failures can occur regionally as well as globally. Below is an example of a CDN feature called “Edge Side Include” failing only in a particular city—impacting a major airline. The application failed to load critical modules on the page making the application partially unavailable.

regional-failures-example
regional-failures-example-two

ISP issues: Issues at an ISP can result in poor application performance. The world’s largest online payment system recently detected a problem where the applications their CDN served, as well as the origin, weren’t accessible. The issue was related to a packet loss within a particular ISP.

ISP-issues-example
ISP-issue-example-two

Whether congestion, latency, or even a change in routing path, wide node coverage is instrumental in proactively detecting network issues. Our new ASN change dashboard helps quickly analyze the change in routing paths.

catchpoint-ASN-change-dashboard

Issues at an ISP can lead to DNS failures and slow response times as shown in the scatterplot chart below. Traffic from India was routed to a server in North America, while all other ISPs were routing traffic to the nearest Singapore server.

dns-response-time
dns-response-time-example

We help hundreds of customers detect and identify these and many more issues on a daily basis.

Catchpoint continuously adds nodes closer to your end users, and we do this because we care about customer experience. Customer experience is the product.

Monitoring the customer experience via this vast, complex, ever-changing landscape requires monitoring from where it matters the most.

This is some text inside of a div block.

You might also like

Blog post

When SSL Issues aren’t just about SSL: A deep dive into the TIBCO Mashery outage

Blog post

Demystifying API Monitoring and Testing with IPM

Blog post

The Need for Speed: Highlights from IBM and Catchpoint’s Global DNS Performance Study