Blog Post

Dynatrace admits their synthetic monitoring has always been ineffective

Published
July 17, 2019
#
 mins read
By 

in this blog post

_“Nothing in the world is worth having or worth doing unless it means effort, pain, difficulty… I have never in my life envied a human being who led an easy life. I have envied a great many people who led difficult lives and led them well.”

―_ Theodore Roosevelt

On July 8, Dynatrace confessed in this blog why they are shutting down the old Gomez/Keynote/Classic synthetic product. As a customer of this platform from 1997 to when I left Google in 2008, I can verify that all of the reasons they list are 100% true (thus my decision to start Catchpoint with some amazing co-founders and 200+ dedicated colleagues).

We’ve addressed why their “new synthetic solution” to these issues doesn’t work, and is really just about saving themselves money at the expense of their customers. Now let’s address their admissions as to why their existing product has always been lacking:

1. Things were slow. When we wanted to add a location, we had to ship hardware and get someone to install that hardware in a rack with power and network. Sound easy? Try doing that in India. This could take 30+ days in easy locations, but months in others.

We have over 500 backbone & broadband nodes, and we have just added 114 new locations in China alone. We are managing thousands of servers in hundreds of geographical locations just for our monitoring locations worldwide. No, it is not easy…..if you don’t know what you’re doing, do not have the right people, and synthetic is not your number one focus!

Fortunately, we’ve had a lot of experience from running the largest advertising platform in the world (DoubleClick), and knowing the Datacenter & ISP industry. When we want to open a new location, it’s typically just a phone call or two away.

2. Hardware was outdated. With the constant change in memory, CPU, resolution and other factors – it was hard to keep up. Each time you needed an upgrade you either needed an entirely new machine or at minimum remote folks to upgrade processors and memory. Try doing that across 80+ locations with 15+ machines per location, you can imagine the difficulty.

Synthetic monitoring is like a thermometer; it should always give you the right temperature, and the hardware should never impact the quality of the measurements. We solved this problem at Catchpoint by doing a complete hardware abstraction. Our years at DoubleClick and Google taught us a thing or two about how to scale better and faster.

3. Failures were happening too often. Fixed hardware is a single point of failure – even when we had redundant machines. When a data center had issues, or a box has issues, our customers had issues. And the last thing you want to do with synthetic is introduce false positives (the bane of all synthetic testing) into the system, and yet this was happening too often.

This is my biggest beef with Gomez and Keynote: false positives (i.e. data and alerts that do not represent reality).

As the head of Quality of Services at DoubleClick, our SLAs were based on metrics not from our own monitoring, but from an “objective” third-party monitoring vendor. We had selected Gomez as this objective vendor. Back then, false positives were generated due to their use of some cheap, underpowered routers/firewalls that were frequently running out of capacity, causing connection resets. These showed up as false positives, which forced us to pay SLA penalties to our customers, and waste another $700K in FTE costs chasing these false alerts.

I can tell you firsthand how frustrating it is to give credits for SLA violations, suffer damage to our brand and reputation, and then find out that it was the monitoring company who screwed up, not our own service. Their recommendation was to change my monitoring strategy by monitoring less. This is why I started Catchpoint.

At Catchpoint, we solved this by building an architecture that can scale, where the agents running the tests are completely decoupled from the configurations. Thus, if the machine is having problems, we do not run tests because, in our business, no data is better than bad data. Customers that have switched have seen upward of 90% reduction in noise. We take the signal to noise ratio pretty seriously at Catchpoint.

We’ve also known from Day 1 that hardware will always fail, so we designed our architecture around that.

4. Scaling up – or down – was a real challenge. I remember when we would sign a new customer and they wanted to add hundreds, or thousands of tests, we had to slow them down, so we had time to add more hardware. Sometimes, if we couldn’t slow them down, other customers’ test might be skewed as physical machines were at their limit.

Again, this is something that impacted me in my previous life. It was quite humorous to get emails from the account rep asking me to scale down our monitoring strategy.

Instead, Catchpoint has auto-scaling built-in so that machines never get overloaded. Our systems are software-controlled, based on the CPU, memory, network, etc. The agent on a machine will auto-scale and not surpass a limit of tests to run because we know it can have an impact on the overall performance and integrity of the results. We spent years and years of R&D tuning our software & hardware to find the right configuration that will allow CONSISTENT, CLEAN, and ACCURATE measurements.

We also always overprovision our infrastructure. We have a team of capacity management engineers that are dedicated to this, and as soon as we cross the 30% usage mark, we first deploy our extra hardware capacity on-site before ordering new equipment.

The bottom line is that synthetic monitoring at scale is not easy, but it is possible if you are competent! As a former Gomez customer, I’m glad to finally get confirmation that our reasoning behind building a next-generation synthetic monitoring platform that actually works was valid.

While their decision to shut down because they couldn’t deliver on the above is the right decision for them, it does not mean that it’s right for their customers.

You should not have to monitor your SLAs from AWS just because your vendor cannot deliver successful nodes on backbone and broadband.

You should not lose visibility of your network from the locations and ISPs where customers come from just because your vendor is trying to lower costs and increase its profit margin.

You should not have to pay for a lower quality product when your business and brand are tied to your reachability, availability, performance, and reliability, especially since you’re already spending more and more with cloud and third-party vendors.

Don’t change your monitoring strategy because your monitoring vendor cannot make its product financially successful for its investors.

The synthetic monitoring methodology is not going anywhere. No matter how much RUM you deploy and AI you invent, you still need to be proactive. You do not need a magic quadrant or a wave or whatever new classification method some analyst will come up with to know that.

Mehdi – CEO & Co-Founder

_“Nothing in the world is worth having or worth doing unless it means effort, pain, difficulty… I have never in my life envied a human being who led an easy life. I have envied a great many people who led difficult lives and led them well.”

―_ Theodore Roosevelt

On July 8, Dynatrace confessed in this blog why they are shutting down the old Gomez/Keynote/Classic synthetic product. As a customer of this platform from 1997 to when I left Google in 2008, I can verify that all of the reasons they list are 100% true (thus my decision to start Catchpoint with some amazing co-founders and 200+ dedicated colleagues).

We’ve addressed why their “new synthetic solution” to these issues doesn’t work, and is really just about saving themselves money at the expense of their customers. Now let’s address their admissions as to why their existing product has always been lacking:

1. Things were slow. When we wanted to add a location, we had to ship hardware and get someone to install that hardware in a rack with power and network. Sound easy? Try doing that in India. This could take 30+ days in easy locations, but months in others.

We have over 500 backbone & broadband nodes, and we have just added 114 new locations in China alone. We are managing thousands of servers in hundreds of geographical locations just for our monitoring locations worldwide. No, it is not easy…..if you don’t know what you’re doing, do not have the right people, and synthetic is not your number one focus!

Fortunately, we’ve had a lot of experience from running the largest advertising platform in the world (DoubleClick), and knowing the Datacenter & ISP industry. When we want to open a new location, it’s typically just a phone call or two away.

2. Hardware was outdated. With the constant change in memory, CPU, resolution and other factors – it was hard to keep up. Each time you needed an upgrade you either needed an entirely new machine or at minimum remote folks to upgrade processors and memory. Try doing that across 80+ locations with 15+ machines per location, you can imagine the difficulty.

Synthetic monitoring is like a thermometer; it should always give you the right temperature, and the hardware should never impact the quality of the measurements. We solved this problem at Catchpoint by doing a complete hardware abstraction. Our years at DoubleClick and Google taught us a thing or two about how to scale better and faster.

3. Failures were happening too often. Fixed hardware is a single point of failure – even when we had redundant machines. When a data center had issues, or a box has issues, our customers had issues. And the last thing you want to do with synthetic is introduce false positives (the bane of all synthetic testing) into the system, and yet this was happening too often.

This is my biggest beef with Gomez and Keynote: false positives (i.e. data and alerts that do not represent reality).

As the head of Quality of Services at DoubleClick, our SLAs were based on metrics not from our own monitoring, but from an “objective” third-party monitoring vendor. We had selected Gomez as this objective vendor. Back then, false positives were generated due to their use of some cheap, underpowered routers/firewalls that were frequently running out of capacity, causing connection resets. These showed up as false positives, which forced us to pay SLA penalties to our customers, and waste another $700K in FTE costs chasing these false alerts.

I can tell you firsthand how frustrating it is to give credits for SLA violations, suffer damage to our brand and reputation, and then find out that it was the monitoring company who screwed up, not our own service. Their recommendation was to change my monitoring strategy by monitoring less. This is why I started Catchpoint.

At Catchpoint, we solved this by building an architecture that can scale, where the agents running the tests are completely decoupled from the configurations. Thus, if the machine is having problems, we do not run tests because, in our business, no data is better than bad data. Customers that have switched have seen upward of 90% reduction in noise. We take the signal to noise ratio pretty seriously at Catchpoint.

We’ve also known from Day 1 that hardware will always fail, so we designed our architecture around that.

4. Scaling up – or down – was a real challenge. I remember when we would sign a new customer and they wanted to add hundreds, or thousands of tests, we had to slow them down, so we had time to add more hardware. Sometimes, if we couldn’t slow them down, other customers’ test might be skewed as physical machines were at their limit.

Again, this is something that impacted me in my previous life. It was quite humorous to get emails from the account rep asking me to scale down our monitoring strategy.

Instead, Catchpoint has auto-scaling built-in so that machines never get overloaded. Our systems are software-controlled, based on the CPU, memory, network, etc. The agent on a machine will auto-scale and not surpass a limit of tests to run because we know it can have an impact on the overall performance and integrity of the results. We spent years and years of R&D tuning our software & hardware to find the right configuration that will allow CONSISTENT, CLEAN, and ACCURATE measurements.

We also always overprovision our infrastructure. We have a team of capacity management engineers that are dedicated to this, and as soon as we cross the 30% usage mark, we first deploy our extra hardware capacity on-site before ordering new equipment.

The bottom line is that synthetic monitoring at scale is not easy, but it is possible if you are competent! As a former Gomez customer, I’m glad to finally get confirmation that our reasoning behind building a next-generation synthetic monitoring platform that actually works was valid.

While their decision to shut down because they couldn’t deliver on the above is the right decision for them, it does not mean that it’s right for their customers.

You should not have to monitor your SLAs from AWS just because your vendor cannot deliver successful nodes on backbone and broadband.

You should not lose visibility of your network from the locations and ISPs where customers come from just because your vendor is trying to lower costs and increase its profit margin.

You should not have to pay for a lower quality product when your business and brand are tied to your reachability, availability, performance, and reliability, especially since you’re already spending more and more with cloud and third-party vendors.

Don’t change your monitoring strategy because your monitoring vendor cannot make its product financially successful for its investors.

The synthetic monitoring methodology is not going anywhere. No matter how much RUM you deploy and AI you invent, you still need to be proactive. You do not need a magic quadrant or a wave or whatever new classification method some analyst will come up with to know that.

Mehdi – CEO & Co-Founder

This is some text inside of a div block.

You might also like

Blog post

Learnings from ServiceNow’s Proactive Response to a Network Breakdown

Blog post

DNS misconfiguration can happen to anyone - the question is how fast can you detect it?

Blog post

Windows 11: Run a better traceroute