Webinar

Preventing Outages: Monitor What Matters

Monitoring is a given in IT: you have to monitor your systems to ensure your users are having a good experience. But most organizations are only monitoring the systems they directly control. That makes sense at first glance – it's easier! - but that’s not where your users are active.  

Your entire network can look ‘green’, and your users can still be unable to connect. That’s because they’re coming to your systems from different geographic locations and via different routes while using different resources – any of which can degrade their experience. This means you MUST monitor what matters to your users, not just to you!

Join Leo Vasiliou and Eknath Reddy as they demonstrate why and how to monitor from ‘the outside in’. By monitoring from your users’ perspective, you’ll be able to prevent outages and improve user experience. You’ll also get a much better understanding of your network and how your users are accessing it so you can optimize performance going forward.

Register Now

Monitoring is a given in IT: you have to monitor your systems to ensure your users are having a good experience. But most organizations are only monitoring the systems they directly control. That makes sense at first glance – it's easier! - but that’s not where your users are active.  

Your entire network can look ‘green’, and your users can still be unable to connect. That’s because they’re coming to your systems from different geographic locations and via different routes while using different resources – any of which can degrade their experience. This means you MUST monitor what matters to your users, not just to you!

Join Leo Vasiliou and Eknath Reddy as they demonstrate why and how to monitor from ‘the outside in’. By monitoring from your users’ perspective, you’ll be able to prevent outages and improve user experience. You’ll also get a much better understanding of your network and how your users are accessing it so you can optimize performance going forward.

Video Transcript

Leo Vasiliou:

Hello, hello, hello. Good morning, good afternoon, good evening. General good day to everybody. Thank you for joining us for Preventing Outages: Monitor What Matters. My name is Leo Vasiliou, former IT ops and infrastructure practitioner for 16 years, author of our annual site reliability engineering report and current product marketing director. And we've got something special for you today. But before we jump in, Eknath, mind if I ask if you say hello to everybody?

Eknath Reddy:

Hey everyone, this is Eknath. I am a performance specialist here at Catchpoint. I am part of the professional services team at Catchpoint.

Leo Vasiliou:

Thank you very much, Eknath. We're going to go ahead and jump in here in just a few seconds. Before we do though, any questions that come up along the way, please type them into the Q&A tab. Otherwise, general banter can be typed in the chat tab. And this is the first time we're showing this session to the world. Eknath and I recorded it a couple of days ago and wanted to do it this way so that we could also interact during the video playback. Without any further ado, let's go ahead and press the play button.

Thank you to everyone for giving us some of your precious time today as we discuss part one in our series of Preventing Outages to Monitor What Matters. And it is so important to us, or actually it's important to all of us that we wrote a white paper about it with the sincere intent that we can all learn from these types of outages, these types of incidents, and that learning is a critical part of a larger resilience program. So do feel free to scan this code for that white paper and read at your leisure. And before we continue, we'd like to ask you to listen through the lens of what would happen if we did not have the worldwide web and the Internet. Think about some life-changing discoveries or inventions through time. Things like fire, the wheel, electricity, and of course the worldwide web and the Internet.

Now think what would happen if we didn't have them. Did relatively recent supply chain issues affect access to basic items or has high cost and rates of inflation made it difficult to get some of the essentials? Or how about the last time your home Internet was not available or the last time your business's primary revenue generating websites were down? So I think it's safe to say the Internet is as critical as electricity. And having said that, it's really not a question of if we're going to experience an outage or if you are going to experience an Internet outage. It's a question of when and how can we prevent that impact to the business. So if it happens to the big companies like Amazon, Facebook, Google, Microsoft, et cetera, then it will happen to the smaller companies or medium companies. And let's kind of dig into this a little bit.

So let's talk about monitoring what matters in the context of using an AWS search outage as a talk track, right? So we see Corbin, the dog makes an appearance and this is what it looks like to the end users. They saw Corbin, the dog, and the visuals on the right is what it looked like with our Internet Performance Monitoring or IPM data. Red diamonds are errors or unavailable and blue circles are successes. And a couple of critical comments on this dataset is look at how this was not an all-down scenario and by our estimates, the IPM data says around 20% of users were impacted.

Eknath Reddy:

Yeah. But Leo, didn't they see this on the backend like with an APM tool?

Leo Vasiliou:

Well, not so fast, Eknath. First you can see if you're looking at this data that it's spanned midnight across Tuesday into Wednesday. And this dataset, this incident I guess if you will, lasted for 22 hours. And to make matters worse, it was intermittent. And so just to think about this based on your question where we talk about outages for the worldwide web and Internet disruptions, I guess based on your question, we should probably take a second to discuss the need for IPM versus APM or more importantly to monitor what matters. So if you think about how many millions upon millions of logs or events are generated on Amazon backend or forget about that, even your own backend, with that volume of data, people can get strangled or drown in it. So to reduce noise, we've seen error thresholds for APM type tools set so high that intermittent outages like the one we are discussing right now go completely unnoticed in the first place.

And the entire point of IPM is to make sure that doesn't happen, to actually improve those detection times, to improve that signal-to-noise ratio by directly monitoring from the end user's perspective with controlled active test scenarios across your Internet Stack. And really just to put the cherry on top about why we need IPM in addition to APM, is sometimes those missing elements or even if you get a 200 OK on that page, those might not even show up as errors in your APM tool at all. And I mean I guess just to go back and kind of dig into this a little more is like you are one of our top performance consultants here at Catchpoint, right? You were day in and day out defining monitoring strategies for various customers. So can you tell us, is there any set template that an organization can follow to set up their end user monitoring?

Eknath Reddy:

Sure, Leo. So monitoring needs vary from organization to organization and the teams who are intending to monitor their applications, the personnel of the teams that are using the product. So we cannot have a standard one-size-fits-all kind of template when it comes to setting up a monitoring strategy for any enterprise. So at Catchpoint, what we do is to make this process a little simpler, we follow a four-pillar approach, a four-pillar model where we set up the monitoring strategy around the four pillars of digital experience monitoring, namely the reachability, availability, performance and reliability.

Leo Vasiliou:

So Eknath, that's real nice, those four pillars sound like a good cornerstone to start the conversation. I was wondering if you could give a brief description of what these four pillars are and maybe show us some examples.

Eknath Reddy:

So the four pillars of monitoring are reachability, availability, performance and reliability. So when you come to reachability, just to give it a one-liner, it's just to monitor if the request from the end users are actually reaching to your server and if then you have your availability, it's all about if the end user is able to access the application as the application is intended to be. Then you have performance, monitoring your performance, whether it's slow, faster, comparing it to big players in the industry, benchmarking your performance. And then you have reliability. Reliability is all about applications consistency with respect to reachability, availability and performance. Like the previous slide you've shown Amazon outage where only 20% of the users had an issue. That's an example of non-reliable experience, right?

Leo Vasiliou:

Right. And so availability, just checking if the application is up or down, but you've also got these other pillars. And so I guess maybe a different way to think about and expand is once an incident is detected, how do you narrow down the root cause? Maybe show an example.

Eknath Reddy:

Sure. So availability is not only checking if the application is up or down. Enterprises traditionally have been focusing on ensuring if the app is up or not, mostly by looking for HTTP 200 OK response and with some kind of basic validations. But that is not sufficient because we have to consider the application not only if it responds with the 200 OK but after validating its functionalities that are working as expected. For example, consider an e-commerce site and you open up an e-commerce site trying to buy something. You go to the product details page and you see this, you have the list of products, but there are no images, nothing, and the site doesn't look legible. So this is actually a real-world example where one of our e-commerce customers experienced an issue with availability of the product images. This directly has an impact on the revenue because customers who try to buy something, they don't proceed with the checkout flow because there are no images and the site doesn't look reliable, it looks some kind of scam or something.

And this happened during a holiday event and none of the APM tools detected it because they all were looking for HTTP 200 OK. We were able to catch it using Catchpoint transaction test where we have a user journey in place, how the end user tries to go through the flow, open the homepage, go to the product, then navigate to one product in that product list, listing page, going to checkout. So this is one major advantage when you monitor from outside in perspective from the customer making the end user as your focal point.

Leo Vasiliou:

And this is a great little anecdote there and maybe it's time just to kind of see what that looks like.

Eknath Reddy:

Sure. I can show you an example of a major e-commerce site in a fashion industry having a problem with the performance, right. Let me share my screen.

Leo Vasiliou:

All right.

Eknath Reddy:

Yeah. Here you see. So if you take a look at this 30 second graphic image. On the left-hand side, you see the site loading for the customers from India. On the left-hand side you see loading for the customers in Germany and the total time taken to load this site and the way the visual progress happens on the site depends from India to Germany. Obviously, you can see that the users in India are experiencing a little slower site when compared to the other Germany. So why did this happen? So this happened because there is something called wait time, which is an important metric, Time to First Byte, and Largest Contentful Paint. Largest Contentful Paint is a Core Web Vital metric that Google has introduced to measure the site's performance and Core Web Vitals are used to rank your website for your search engine scores and stuff like that.

So it's an important metric and if you see the GIF, again, you can clearly notice that the site from India is slower. So we have done an analysis on why the site is slower from India and not from Germany. We noticed that the Time to First Byte from India is significantly higher than the Time to First Byte Germany. It's almost three times to four times more for India when you compare it to Germany. So one of the major metric that impacts Time to First Byte is your wait time. So if you see the wait time from Germany, the mean wait time is around 137 milliseconds, whereas from India it's around 351 milliseconds. It's twice as much as it is from Germany. So we have dug out why this happened, why the users in Germany are experiencing a lower wait time and a faster page load versus why the users in India are experiencing a slower website and a higher wait time.

So when we opened up waterfalls and dug into the waterfalls and looked at the page load actually using the filmstrip and screenshot, you can see here it's almost taking 3.4 seconds for the entire page, the largest content on the page below. And the page takes almost nine to 10 seconds to complete its loading, to finish the entire layout. And as discussed earlier, it was a wait time that was having the problem. It's considerably higher from India. And one of the important contender you can say for the wait time is your CDN mapping. CDN mapping is something like where your end users are and to which location are your end users request routed. So in an ideal scenario for a user accessing the content from India should be routed within the country, within India, or if it's accessing from US, the request should be routed within the US.

But if I go into the waterfall record and check the headers, actually I can see that the request is from India. If you see here, this is from India, the request is from India and the request is actually routed to Japan. NRT is the airport code for Tokyo's airport. So if you see this, this request is routed to Japan, whereas if you take the case of Germany, when a user is requesting the page from Germany, Hamburg, the request is being routed within Germany, it's going to Frankfurt. FRA is the airport code for Frankfurt's airport. So that is the reason you see a faster page load, right? You see the Largest Contentful Paint, LCP has loaded within two seconds.

Leo Vasiliou:

Now, so this is great. I'm glad we had the opportunity to dig into the guts a little bit here, if you will. And so what you've been showing us manually by drilling into the waterfall data or the detailed data, the granular data, which by the way is impressive, how about the over time, looking at some of these mappings or performance over time?

Eknath Reddy:

Sure. So this data, which you see is from the HTTP headers and this is not a standard metric that monitoring tools readily have, but in Catchpoint, we have a workaround where we can capture a text from the header here and put it in the form of an insight or we call it trace points in Catchpoint and plot it in your explorer where you can see the holistic view of how the requests are. So whatever I've explained using the waterfall record, like picking out one instance, I was able to plot it for multiple runs. So here, if you take a look, let me plot for runs as well.

So if you take a look, we have around 400 runs. This is on the left-hand side, you have the data for Germany. On the right-hand side, you have the data for India. If you take a look at the wait time, which is the major player here, which is causing a delay in Time to First Byte, from India, the wait time is around 500 milliseconds and 323 milliseconds based on the edge server that it is getting connected to. But in Germany, it's around 164 milliseconds and 90 milliseconds, that again, based on the head server it's connected to.

So if you see on the right-hand side, on the legend here, the blue ones are actually the requests that are going to Frankfurt and the green ones are the ones that are going to London. This is London, Heathrow Airport. And the orange ones are the ones that are going to Japan. So if you compare the number of runs that are going when the users are requesting the page from Germany, ideally they're routed to either Frankfurt or London. But whereas in India, it's never within the country, it's always out of the country mapping, which is causing a higher wait time resulting in a higher Time to First Byte and eventually the higher Largest Contentful Paint. It's all cascaded, right.

Leo Vasiliou:

Such wonderful insight. Thank you so much, Eknath, for talking through and explaining in great detail. I think it just honestly, it reminds me of what we were talking about a few minutes ago when we had the APM versus IPM slide and the concept of monitoring what matters in some of the differences, especially for things that don't even show up in your APM data. And I suppose it's like, just to go back to your comment about monitoring what matters, it's really where we are talking about from the end user's perspective, directly monitoring from the outside in for things that agent based APM or traditional systems monitoring have no reach and you can only monitor what matters if you monitor from where it matters. You can't do Internet Performance Monitoring if you don't monitor from points on the Internet, which is why we are proud to offer the world's largest commercially available active monitoring network in the world.

The true beauty of the depth and breadth of the coverage that we have is the flexibility and the ability for practitioners and strategists to adjust this coverage as they need based on whatever the use case is today, but also more importantly to have the ability to adjust for some new use case tomorrow. And the reason I like to kind of talk about it that way is because it's, again, going back to what the user sees, they saw Corbin, the dog, is like these are some of the improvements that our customers, our own users have told us, right. So 95% improvement in performance, which is directly tied to a great customer experience. 90% improvement in resolution time, the ability to triage six times faster with properly implemented IPM monitoring strategies and probably the most important of all, a good night's sleep.  

I guess we'll go ahead and work to wrap up this part one. Again, we just want to say thank you to everyone for giving us some of your precious time today.

And before we wrap up part one, we want to call your attention to part two on June 14th where Mark and Shree will continue the discussion in Preventing Outages with Map Your Internet Stack. So do feel free to scan that code to register now, should be an amazing talk for everyone to listen to. And what we'll do is, before we get into questions and answers, we'll go ahead and take a moment to open up the poll, which very simply is, would you like to learn more about Catchpoint? So please navigate to the appropriate tab and submit your response. Intentional pause there to give people a moment to do that. As we wind up by again saying thank you very much for giving us some of your time today. Please keep an eye on your email for the recording and a link to redeem your attendee swag. Thank you very much everybody.

All right, Eknath. We hope everyone enjoyed the session. We try to make it as short, crisp, and succinct as possible. I think 21 minutes was a good time. As I mentioned in the chat, we had a couple of questions come in for you, Eknath, so I guess we'll just jump in to wrap this up. The first one says something to the effect of, on the prior issue where you were talking about pages not loading on the screen, somebody made a suggestion, which is unintentionally part of what we were talking about, like frontend versus backend, but the suggestion was maybe there was an issue with some API microservices. So is Catchpoint IPM capable of monitoring those types of endpoints and URIs?

Eknath Reddy:

Yeah, sure, definitely. We have a test type called API testing, like we can hit the API endpoint, set some post data, post it and get the response. We can have an API testing in place as we're using Catchpoint.

Hey, Leo, looks like you're talking on mute. Can you unmute?

Leo Vasiliou:

All right, well, I don't want to repeat all of that. I'll say yeah, we could do some cool stuff. And we made an announcement last week specifically related to enhancing API monitoring capabilities. So apologies for everyone listening there. I'll take this next question. It says, going back to when you were talking about APM versus IPM, maybe if you could just elaborate on that. So I did misspeak early, it's not really APM versus IPM, it's APM plus IPM, right? So what is IPM? It's like you've got your application stack, they generate the bits and bytes of your code that fly off of your servers. For the sake of the conversation, let's say you use application performance monitoring to monitor that, but when those bits and bytes do fly off your servers, there's a certain point, a demarcation I guess, or maybe a handoff or a handshake where those agent-based type of tools can no longer reach, right? Those agent-based monitors can no longer reach.

However, there's still an entire stack or myriad of components that those bits and bytes have to traverse in order to reach reachability, right? Reach your users wherever they are in the world. And that's where IPM takes over. So just to make that a little less abstract, it's like you can't install agents on your CDN providers, on your third party API providers, right? But IPM gives you visibility into those and those are just a couple of narrow examples to hopefully convey the concept. So hopefully that helps clear that up. It's APM plus IPM. Looks like one more. So this one says, going back to when Eknath was looking at the HTTP response headers plotted in the graph, essentially the question, so it looks like you were capturing codes. Is there a way I can capture timing indicators or other numerical values and plot that as a custom KPI? So I'll let you take that one, Eknath.

Eknath Reddy:

Yes, definitely. So you can capture numerical values as well. For example, you want to capture the server timing, the edge server response time or your internal load balances response time or its processing time, whatever. So since these are not readily available metrics, it's key that we have some mechanism in place to capture those plotted and look at the performance over a period of time. So that is definitely possible with Catchpoint.

Leo Vasiliou:

Okay, so the ability to bring your own KPI is what I'm hearing. And the KPI, you know, can be numeric or non-numeric. So I get you, right, a type of white box monitoring, but from an external perspective. So makes sense to me. All right, it looks like it was just those other couple of questions that came in. I don't see any other questions going once, team. All right, let's go ahead and wrap this up. Looks like they pasted the link to part two in the chat. Otherwise, pay attention to your emails for the swag and let's go ahead and wrap this up, team. Right on time. Eknath, let you give the final closing thank you and remark to everybody.

Eknath Reddy:

Thank you everyone for joining us today. Please do participate in the poll. Have a good day. Bye-bye.

Leo Vasiliou:

Cheers everyone.