Webinar

Solving the SLO Riddle: Why SLOs aren't enough

Traditional service level objectives (SLOs) focus on IT metrics, not customer experiences. An SLO might be met, but the customer could still be unhappy with their digital experience. 

This makes it hard to get actionable insights to continually improve product reliability and customer experiences. 

In this webinar, you will learn:

  • A brief overview of SLIs, SLOs, and error budgets
  • The difference between SLOs & experience level objectives (XLOs) 
  • How implementing customer-focused XLOs can drive timely, data-driven decisions
Register Now
Webinar

Solving the SLO Riddle: Why SLOs aren't enough

Register Now

Traditional service level objectives (SLOs) focus on IT metrics, not customer experiences. An SLO might be met, but the customer could still be unhappy with their digital experience. 

This makes it hard to get actionable insights to continually improve product reliability and customer experiences. 

In this webinar, you will learn:

  • A brief overview of SLIs, SLOs, and error budgets
  • The difference between SLOs & experience level objectives (XLOs) 
  • How implementing customer-focused XLOs can drive timely, data-driven decisions
Video Transcript

Leo Vasiliou: Welcome to master class 3, the 3rd in our series of 4 total master classes. This one's entitled solving the service level objective (SLO) riddle, why SLOs are not enough. It may also be known as, let's discuss experience level objectives. Thank you for giving us some of your precious time today.

My name is Leo Vasiliou, former DevOps practitioner and author of our annual SRE report. I have the pleasure of being joined by Brandon today. Brandon, would you mind saying hello to everyone?

Brandon DeLap: Of course. Thanks, Leo. Hi, everyone. My name is Brandon DeLap, senior solutions engineer here at Catchpoint. I've been working at Catchpoint for over 10 years now. I'm looking forward to introducing everyone to a new feature set here that we've introduced to Catchpoint and how that could help everyone moving forward.

Leo Vasiliou: Thank you, Brandon. Well met. Before we jump in, just a few housekeeping items. First, I'd like to ask our attendees to locate the chat tab. Feel free to make use of it as we go through this session. Say hello. Tell us where you're from. Second, please make note of the docs tab. We've placed a couple of resources there for you. And last, please make note of the Q&A tab. If you have any questions, need additional clarity, anything that comes to mind, feel free to type them in, and we will address them along the way.

Here is the high-level agenda for today's master class. First, we will discuss the nature of the riddle. What is the riddle? We'll go over it. We'll unpack it a little bit and have a little bit of fun. Then we'll briefly cover some key service level terms and concepts to make sure we are all talking about the same thing. And then last, we'll try to make this transition, augmenting service levels with experience levels. How do we make that transition from talking about service levels to incorporating experience levels?

So you might be familiar with the service level indicator. Well, there is this concept of an experience level indicator. Service level objective is a concept of an experience level objective. And then we'll talk about some of the reasons to think about this idea of experience levels and the problems that they solve. Brandon, if you don't mind, take it away.

Brandon DeLap: Definitely. As Leo mentioned, we're going to start with that riddle. Before we go any further, if you meet me, you might be happy. If you miss me, you might be sad. If you meet me, you might still be sad. What am I? Give it a moment to ponder and think that one over.

Leo Vasiliou: Hopefully, the title of the master class gives us a major hint.

Brandon DeLap: Drum roll. A service level objective. Now, Leo, what does that riddle mean to you?

Leo Vasiliou: First, Brandon, I think the riddle is brilliant, so kudos to whoever created it. The essence of why we decided to talk about this today is two things come to mind. First, IT's inward focus on internal service components and availability is not necessarily a measure of the user's experience. Things look green, but when you dig in, your users might still be having frustrating experiences. The other thing it makes me think about is the age-old IT to business communications gap, or maybe sometimes we talk about trying to maintain business agility while preserving application stability, reliability, resilience.

Here at CatchPoint, I am fortunate enough to speak with a lot of reliability practitioners. We do our independent research, publishing pieces like our annual SRE, Site Reliability Engineering Report, our Internet Resilience Report. When we speak with them, they talk about their various approaches to service and application reliability. They monitor their indicators, track them against objectives. The gravity though is they usually talk about IT metrics or indicators rather than customer journeys or business metrics or business indicators.

Reporting on those IT metrics or indicators really won't provide the business with meaningful or actionable insights. Speaking completely separate languages mismatched on what's important or what's valuable. If I say that a reliable product or service is critical to business success, I don't think anyone would argue the essence of that point. But if IT and the business are thinking differently about what that actually means, maybe different metrics, criteria, or perspective, then they will remain disconnected. IT will have a tough time justifying their reliability investments, getting their budgets approved, and ultimately, there's a risk of customers suffering.

Brandon, those are the two things that come to mind for me. I was also thinking maybe before we get too much further, if you could give us a brief overview of some of the key terms and concepts that we'll be using in this conversation.

Brandon DeLap: Definitely. Let me drill into that then. I will make this as quick as possible. We obviously don't want to bore you with some key terms and concepts, and we want to get into the meat of it. So let's move forward here.

What do we mean when we say key terms? What have we spoken about already? SLIs, SLOs, and SLAs. When we think about SLIs, what is this? Think about it as the metric. This could be availability, how many times was a customer able to reach a site? How many times is the customer able to check out? Or it could be a performance metric. Was it latency, response time, page load time? Are we talking about 1000 milliseconds? Are we talking about 4 seconds? At the end of the day, think of SLI, think of the metric.

Moving over to SLO, this is where that metric value should be. This could be 3 nines, could be 4, 5, 6 nines, or it could be performance related. It could be 2000 milliseconds or within 2000 milliseconds. This is really where your team is striving to get to when it comes to that specific metric. And then last but not least, the agreement. This is going to be the language, the contract, what you're essentially uphold to make sure that your team or your service never breaches a certain threshold, that has been set up via that objective.

I just want to run through some quick examples of these three key terms. First and foremost here, what we're looking at is just a simple web test, running a catch point. We're looking at two different metrics here on two different time series or scatter plots. In this case, we're looking at two different service level indicators. The first one being overall test time. This could be average, could be a percentile, and then we're looking at percentage availability as well. Those would be those two service level indicators for this specific test in this specific use case.

Moving forward here over to the objectives, let's introduce two green lines to these charts here. These would be essentially your objectives that you're trying to maintain. If we're looking at the overall test time, this could be less than or equal to 100,000 milliseconds for a user to complete some checkout or some sort of user flow on the page. For availability, this could be greater than or equal to 3 nines from an overall objective perspective.

Now let's go one step further real quick here and obviously agreement. This is just a nice fun example of what an agreement could be. We agree to have a general service availability of those 3 nines between x and y z, and if not, potentially credit 1 month subscription if we're talking about a business to an end user. And then maybe if we're talking business to business or business to vendor, we're talking about some credits for that annual contract term.

Hopefully, that helps summarize those three key concepts and key terms. Before we transition over to XLOs and the equivalents there, I think it's a requirement at this point. I think you'd be a must go back and watch the previous two master classes. To fully understand and appreciate why experience levels are needed in today's world when it comes to digital properties and entities. Why they're needed to transform your business and take your organization to the next level when it comes to a maturation perspective.

The first one being here, monitoring from your user's perspective when we're talking about experience. All that truly matters is your end user's experience. To get to those XLOs, you're going to have to monitor from your user's perspective. Going into the 2nd master class here that we've touched on, it's all about resiliency. I know that's a big term that you hear pretty often nowadays. I just actually heard it on a recent preseason game in the NFL, talking about being resilient when injuries occur. You also have to be resilient when issues occur to your digital properties. It's all about ensuring availability, reachability, performance, and reliability across the entire Internet stack.

Leo Vasiliou: Agreed. Two things, Brandon. We've got the QR code there on the screen, but also the links to the previous two classes are in the docs tab that we mentioned in the housekeeping at the beginning of the call. Just getting back to what we were talking about here, before we make this transition is, service levels, indicators, objectives. Those are key reliability concepts. They're critical. They always will be. When you're doing a good job, how much budget you have to do an innovation, release in no way, shape, or form do we want to detract from their value or their necessity. Absolutely critical. All we're seeing is that there's room for additional ways to think about them to improve customer experiences. I just want to make that point before we continue any further because I felt it was important.

Brandon DeLap: So now what? What is that new concept? I know we said it several times already here that we're going to teach you in our 3rd master class session. I'll hand it over to Leo to take over the next section as we introduce you to experience levels.

Leo Vasiliou: Let's say for the sake of the conversation that all 4 of those services have to be successful each and every time for one single user journey to complete. Suppose one of those single services on the left fails or has an uptime strike, then that user journey on the right also has an uptime strike. The user was not able to complete their journey because one of those 4 critical components failed. The individual components on the left have their respective indicators, in this case, we're showing availability and same for on the right. 99.9% or 3 nines is what the availability of that user journey would be if one component failed.

Continuing, if 2 of those components failed, and then suppose within the given time frame of your objective, this hypothetical checkout service fails. The critical observation here is the number of, the uptime of individual services is still, at worst 3 nines, but the uptime of that journey now has 2 strikes and is at 99.8%. Continuing along, if a third service fails, the indicators of those individual services, again, at worse, 99.9 each. But, cumulatively, the availability, the strike rate, or the success rate of that user journey is at 99.7%. If all 4 of those services failed, they are each individually still at 3 nines, but that journey on the right is at 99.6%.

We're going to transition from this idea of uptime and availability to now talk about performance. Performance being a critical part of the transition from SLOs to XLOs. The slow is the new down, I guess, is the expression. Suppose that journey on the right was dependent on the performance of all of the services on the left. In this simple hypothetical example, let's say these services are part of your critical path and load in sequence. If they each take 100 milliseconds, then that means that user journey will take 400 milliseconds. If 2 of those individual services increase their individual response times by 100 milliseconds each, that's a delta of 200 milliseconds. That journey went from 400 milliseconds to 600 milliseconds. If all 4 of those services on the left increased by 100 milliseconds, then that hypothetical user journey has taken now double the time, from 400 milliseconds to 800 milliseconds.

Now that we understand how individual components might impact how we think about the differences between service levels versus experience levels, what could these objectives be? Here's an example uptime objective for your service levels. Individual service components should be available 4 nines of the time. On the right experience level, users should be able to check out 3 nines of the time. If we do the same thing that we did a moment ago and upgrade and talk about performance, individual service components response times should be less than 100 milliseconds 95% of the time, and users should be able to complete their checkout, experience level. Performance should be 400 milliseconds 90% of the time.

In the real world, there are a couple of additional things we want to consider here. First, consider your current state and what the consequences of that current state might be. Then think about how do we get from here to there or how do we get from our current state to a better state? What would be required to get from here to there? What type of mappings do we have between these internal components and these experiences that people enjoy? Do we even have these mappings? What are the resources to get them? It's okay if they're not perfect out of the gate. Just follow the three c's of DevOps, culture, communication, collaboration.

Brandon DeLap: Definitely. Along the similar talk track of augmenting your current SLOs with XLOs, before we drill into that, I want to go back to a basic service level graph. Look at this graph again. We're talking about availability. We're talking about server response time. Mainly, internal stuff only. Is it up or down? How long does it take for just the server to respond? We're not talking about full end-to-end user flow or those critical core web vital metrics. Is it responding potentially just from the cloud or your own data center?

If we introduce XLOs, this is where we are potentially going as organizations or where we are as organizations. These following performance metrics are key. That's why we've selected them to track how your end users are actually experiencing your application or server or service from their perspective. Some of these may be new. Some of them may not be. They were introduced as the Google Core Web Vital metrics. Things like first contentful paint. How long does it take for the browser to render that first piece of content on the page? Largest contentful paint. The time the largest content is visible within that browser window. Time to interactive. The time it takes for the page to be responsive to your end users' inputs. Cumulative layout shift. The measure of the layout and how much of it actually shifts unexpectedly during the page load and as it renders. These will be the key metrics as we move forward and show you some more things related to XLOs and the features that we've added to the portal.

The main thing to point out here is they account for more than just one single request. There are 100 and sometimes thousands of requests required to make that FCP, LCP, that interaction for that page. I think that's something key to point out there as well.

Leo Vasiliou: I would also remind people of the idea of third parties. I got a message here in the Q&A. The idea of monitoring or measuring your availability performance indicators, but how do we track them, actually track them against objectives? It's one thing to see the little blips. But how do you actually track to see how you're doing over a period of time? The question came in a few slides ago, but I waited to address it now.

This visual here is one of the primary visuals used in tracking whether or not you've met or missed your objectives over a period of time. It's what we refer to as a burn down chart. It visualizes the trending of whether or not you're on track to meet your objective. You don't have to stress about trying to do it yourself or what indicators or metrics to track. This is part of the Catchpoint IPM platform as part of our SLA, SLO tracker capability as part of our larger service level management use case. If you've never seen this before, the conversational way to read it is on the left, if your actual indicator is above that dotted red line, that means you've met or exceeded your objective. On the right, if the line is below, the blue line in this case is below that dotted red line, that means you are not meeting your service level objectives.

The other thing we didn't really talk too much about is the idea of what they refer to as an error budget. The way to think of that is if you're meeting your service level objectives, then maybe you can use some of your budget to do another incremental release. While your cache is clearing or something like that, and you take a performance hit because you're rewarming your cache or something like that, then as long as you're still exceeding your objective, you can spend that budget on doing things like those additional releases. The delta, the white space between the two lines, also visually conveys how much of a budget you have. The main takeaway here is that tracking your indicators against your objectives over a period of time will usually be visualized in this burn down chart.

Brandon, I think that's a perfect segue to do a brief demo of how we address some of these challenges and what that looks like in the Catchpoint portal.

Brandon DeLap: Definitely. Let me go ahead and stop my share and share my screen. Give it a moment to refresh and settle. Can everyone see my screen okay?

Leo Vasiliou: It does look good on my end, Brandon. Nice, full screen.

Brandon DeLap: Perfect. Thank you, Shana. Let me go ahead and start in the portal here with what you can configure first and foremost to then set the SLIs and SLOs. What you're looking at my screen here is just a demo account. It's a demo cinema app. Think of a movie ticket booking app online. I have a suite of tests set up against this application, spanning from basic DNS reachability to HTTP availability, SSL certificates, network perspective, running a trace route to my infrastructure externally, all the way to executing a full user transaction through Playwright. As Leo mentioned, it's important for those experience levels and those XLOs to measure that full end user experience. In this case, it's booking a full ticket for a movie coming out. You can see several different steps here. About 4 steps, running through 4 different pages of that user flow, and running this at a frequency of about one location every 5 minutes. We guarantee to check here every 5 minutes for this Playwright transaction example.

I want to jump over into the new portion here, and it is going to be living or it does live within analysis, fellows. This is a new module within the Catchpoint IPM platform where you can configure your objectives. Click over into that objectives tab. From here, you have the ability to set up your SLOs and SLIs. I do have some setup here for the demo. Some of them are related to availability. Some of them related to performance metrics like DNS and test time.  

Some of them related to our new XLOs like largest contentful paint, FCP, time to interactive, and CLS. If I click into one of these examples, you have the ability to first and foremost, give it a name, specify whether or not you want this in always on, always running objective, or if there is a certain time frame that you'd like to point this objective against. The next configuration would involve that metric that you want to leverage and then the certain violation thresholds and conditions for that indicator. In this case, what is your actual objective? I've set it at 25100 being greater than being a violation for a certain percentage of my nodes over a certain time frame. You could also see you have the ability here to configure your goal. My goal is to meet this threshold 75% of the time.

Last but not least, you're going to go ahead and apply the test or assign those tests to this specific SLO or XLO. You just go ahead and drop in your specific Catchpoint monitors, whether it's an API test, the full user experience test, any one of our monitor types, and then click save. When you click save, it's going to take for the next test iteration within that time frame window to then update your actual status over here on the left-hand side of my screen. If I click back over to the status here, this is where you'll see my running objectives.

Leo Vasiliou: Brandon, I wasn't quite sure when the best time to interrupt your flow was, so forgive me. A comment and then a question from the Q&A. I'll take a second to stress that what we did there was we set it on a user journey. You could also set it on those internal components and then assign it to a different objective. You can have your service levels and your experience levels in the same portal. This other question here is when you have the transaction up on the screen. Essentially, it says, when looking at the transaction, sometimes, there'll be third parties that are on the page, but that don't affect the user experience. Is there a way to account for that? The way I take that question is, can we specify some configuration so that our objectives aren't affected by third parties that are just maybe a tracking tag or something like that?

Brandon DeLap: You could do it a number of different ways. The best way and the way that I've seen a lot of our customers doing that is configuring an additional synthetic test that you can run alongside your typical scenario, where we have all those third-party tracking pixels firing off. With that, we can apply what we call request blocks, where you specify specific requests or hosts that you want to block test execution. I've done that in this case. I've grabbed my three third parties. You probably have more than I do. You may have a tag manager that you can block, which would be a lot easier. In that sense, we're now only loading or requesting those first-party objects for my cinema app. Then you can apply the objective to this specific test that you've configured.

Leo Vasiliou: Awesome. I think that's what the question was. Thank you for that.

Brandon DeLap: Perfect. Let me jump back over here into the SLOs tab. Jump back over to the status. We have our objectives configured. This could be availability for the internal. It could be more of the experience level objectives with those core web vitals. Once you jump over to the status tab, this is where you'll see the current status of those objectives. Let me reload that. Perfect. You'll see all those objectives here. We're pulling this live. This isn't some point-in-time report where you have to refresh it and so on. This is as you see, we called this refresh of data from our database live as I refreshed that page. What we're looking at is the specific test that you set objectives against. What are those objectives? What's your goal? What is your current state against that goal on a weekly, monthly, or quarterly basis? All of this is sortable. You have the ability to see where we are seriously exceeding our budgets and where we are currently meeting our budgets on a weekly, monthly, or quarterly basis. You could also go back in time and look at from a week-by-week basis. You can go back in time, and same thing for the month and the quarterly. This is something that you have the ability to dive into live in the portal and then also download it as time goes on if you need to share that outside of Catchpoint.

Leo Vasiliou: I'm going to bring on for you looking at the chat there, the question came in. Yes. Essentially, what they're saying is it looks like you're in the portal. Is there API access for this stuff?

Brandon DeLap: I'll have to double-check. I believe there is the ability to pull the SLO reports via the API. I'll have to check to see if we can configure that. That's a good point, Leo.

Leo Vasiliou: Thanks, Susan.

Brandon DeLap: Perfect. Hopefully, that gave you a good idea of how to leverage the new SLO feature, including the XLOs, within the Catchpoint portal. As an existing customer, you can come right into the analysis SLOs tab, get that first set of objectives configured, apply it to the test, and then you'll see your status update as those tests execute and as time goes on.

Any other questions, Leo, before I stop my share?

Leo Vasiliou: I don't see any, Brandon.

First, thank you, Brandon. Appreciate it.

What's the riddle? If you meet me, you might be happy. If you miss me, you might be sad. If you meet me, you might still be sad. What am I? Service level objective.

Talking about augmenting service levels with experience levels, should we do it? We do our own independent research here. We do surveys, like the SRE survey, to publish the SRE report. I was reading an analyst report on what trends might be for this year or next year. We asked, "What do you think your organization should prioritize over the next 12 months?" From our reliability survey 2024 with 301 responses, you can see reliability in general was selected by 42%, but service level and/or experience level perspectives were a close second at 40%. We try to arm people with data, little nuggets, so they can justify their investments or start conversations. This is a relevant stat, a sneak peek at next year's SRE report, to help get your conversations going.

Let's wrap this up and talk through some recommendations. Brandon, you wanna kick us off?

Brandon DeLap: Definitely. It's not a master class unless we give you some nuggets to use in your day-to-day. We're talking about recommendations for setting XLOs. Always remember that slow is the new down. We can no longer focus on just server availability. How long did it take for content to be delivered to the end user? Was it 100 milliseconds or 10 seconds? Depending on the object, users might be upset and might post on social media. So remember, slow is the new down.

Leo Vasiliou: Performance is key. I asked this year's engagement question: "Slow is the new down." Have you heard this expression before? Do you agree or disagree? It's not just something we're making up at Catchpoint. I've heard it in hallway conversations for years, and I was curious if it's more than just an internal anecdote. The answer is yes, it is.

I'll take the next one: monitor what matters from where your users are. Your application stack includes third parties to create the experience, and you also have your Internet stack, which is used to deliver those experiences. These stacks differ in different parts of the world, and you might have to set different indicators or objectives depending on the region. The only way to know whether they are being met or missed is to monitor from where your users are. Watch those first two master classes where we drill into the nitty-gritty of those concepts. But, monitor from where your users are.

Brandon DeLap: Excellent. The next point is, don't consider service levels as your experience levels. Essentially, go beyond your internal availability metrics and logs. Create a new set of indicators and objectives when referring to experience levels. It's key to not consider what you're measuring internally as your external experience levels.

Leo Vasiliou: No silos. Work with the business: product owners, program managers, go-to-market strategists, risk managers, to determine customer-focused requirements. Identify critical processes and journeys. Your product team knows what's important to users. Sales and sales engineers know what's important for growth. Risk managers understand how governance and compliance might influence your objectives. There are rules and regulations that stipulate reliability and resilience requirements, like those in the finance vertical in the U.S. or the EU's DORA Act. Don't set objectives in silos. Remember the C's: Co DevOps culture of communication, collaboration, and catalyzed by iteration.

Brandon DeLap: That's excellent. Another key point: before setting any agreement, you must know your actual baselines from a user experience perspective externally. You can't set objectives, like a 1-second largest contentful paint, when your current baseline is 3 or 4 seconds. It's crucial to establish baselines before setting XLOs.

Leo Vasiliou: And continually optimize. We're close to time, so I'll wrap it up. Don't analyze to paralyze—iterate. Get that white space between your actual indicator and your objective, just make it a little better each time. Remember, it's the journey, not the destination. The infinity loop life cycle—continually optimize. Thank you for joining today. I don't see any questions in the Q&A tab, but if you have any, please type them in. Be sure to visit the docs for the resources, including an SLA asset and links to the previous two master classes. Brandon, you wanna sign us off?

Brandon DeLap: Definitely. Thanks everyone for listening to this 3rd master class. Hopefully, you can move forward with configuring your XLOs. Catchpoint has the feature to help you establish those. Looking forward to seeing everyone get their XLOs set up in the future.