Blog Post

A CIO’s Take On Preparing for SaaS Outages

Published
January 16, 2019
#
 mins read
By 

in this blog post

As the CIO of Catchpoint, I am responsible not only for the digital services we deliver to our customers but also for the services we deliver to our employees— services that keep Catchpoint running successfully. Like most customers of the web conferencing service Zoom, we were impacted last week by their outage. And like any vendor outage, it caught us by surprise, and shed light on our weaknesses and how prepared we were for the failure of such critical services.

Setting the stage

We founded Catchpoint in the digital era, and therefore, we have relied on digital services like SaaS or PaaS to operate the business and deliver portions of our services. In the last ten years, we’ve added services and have reached over 150 such services. With nearly 200 employees, our number of SaaS applications is close to our number of employees.

There are the obvious critical SaaS apps like Office 365, Salesforce, Marketo, NetSuite, and Zendesk. Then there are others that often are not considered SaaS—yet vendors deliver them over the cloud like Google Tag Manager and Google AdWords. Even access to our physical office is provided as SaaS by OpenPath.

Then there are the cloud services we rely on for our products and services like Dyn, Fastly, NS1, Verizon Digital Media, and obviously we have nodes in every public cloud provider: Alicloud, AWS, Azure, Google, IBM, and Tencent.

All these services are essential at various levels to our daily operations. If Office 365 Outlook goes down, the whole company comes to a halt. If Salesforce goes down, our Sales and Customer Success teams have a hard time connecting with customers. If Zendesk goes down, we can’t support our customers. But if AdWords goes down—we aren’t impacted much as it’s not critical to our day-to-day operations.

For several services, we built redundancy. For instance, when it comes to delivering our core applications to customers, we built redundancy by utilizing three DNS providers, two CDNs, etc. However, for digital services used by our employees, only a few have redundancy. For example, if OpenPath has an outage, employees might not be able to enter the office with their phones—but they can still enter the office via keypad.

For web conferencing, Zoom is primary, but we also have Microsoft Teams for internal use. There are good reasons for not having web conferencing redundancy:

  • We cannot train all employees on two different, complex applications for the same activities (like chat, web conferencing, and file sharing).
  • It’s hard to keep two different apps from two different vendors in sync as they have different data models and workflows (exceptions are file-sharing services).
  • The cost can become enormous when paying per user per month.

Zoom outage

Zoom had a worldwide outage on January 9th, between 12 pm ET and 2 pm ET. It eventually impacted every Zoom client. The outage impacted us too. Some users got 504 errors while others that logged in via Okta were greeted with a message stating they had no access and should contact the “Owner” of the account. Our CEO got that message.

The first reaction was—wait, did I get kicked out of zoom? The next reaction was—did someone forget to pay the bill? Within minutes our Corporate IT team and I were flooded with messages about what was going on. We sent a companywide email to notify everyone of the challenges.

Our offices are distributed across six cities and three continents, and we have remote employees around the world. Customer-facing teams were unable to get on calls with customers. Conference rooms were unusable as people could not screen share. The result? People canceled meetings and focused on individual work—or huddled at desks to continue their meetings.

Now you might be wondering what happened to our redundant services. Well, that was the first big failure on our side. Microsoft Teams was available as an alternate, but we didn’t plan for an outage. So, we didn’t communicate to employees to switch to Teams. Nor did we properly train employees on how to use Teams for conferencing. Some power users of Microsoft Teams switched over during the Zoom outage on their own, but most employees did not.

In addition, all conference rooms rely only on Zoom and therefore weren’t usable. Now, we are looking into bringing Microsoft Teams into our conference rooms.

The Public Cloud storm

Zoom, like many digital companies out there, relies on Public Cloud to deliver their service—specifically, AWS. In communications to clients, they stated the root cause was an AWS service failure. Their failover system and service architecture somehow failed to handle the failure and did not recover. Interestingly, the AWS status page did not show any failures of their service, which could mean either Zoom has a private cloud—or, whatever the AWS failure was, was isolated to Zoom’s instance.

For all the SaaS providers out there or external-facing digital services, this should serve as a lesson. We must always ensure that we have correctly identified single points of failures in system architecture, and that we have not only built redundancies and failovers—but that we’ve tested them periodically to ensure they work. At the end of the day, failure is bound to happen, either due to bad code, bad configuration, hardware failure, or vendor failure.

Lessons for IT teams

1. Determine which services are critical to your business

Your company likely utilizes hundreds of apps. We have customers that have over 1,000 apps or services. Not all are critical. Not all need the same level of attention. Define different levels of importance for your services and have a process to classify services as you buy them going forward. At the simplest, you might categorize them as follows:

Critical – the service impacts revenue directly. For example, if your cloud provider goes down, it might mean your service is down; you cannot sign new customers, you might lose customers, and you might breach client SLAs and must pay fees.

High – the service impacts employees, work cannot be completed, but there’s no direct impact on revenue or costs.

Medium – the service might impact employees, but you have redundant solutions in place.

Low – there’s no impact on employees, business, or activities. You are indifferent to the service being up or down.

Keep in mind that for some services, timing might change the level of importance—I.e., a service you use to complete taxes is only critical during tax season.

2. Have a current plan for each critical vendor or service type

Whether the service is web conferencing, sales automation, or your managed DNS, you need to have a well-documented plan in place as to what to do when the service fails.

The plan should define:

  • Who to contact
  • Who to send communications to and what those communications should be
  • What the backup solutions are and who is responsible for managing them
  • What training do all employees or specific employees need to be ready for crisis (skip this for low priority services)

Backup solutions work well for simple solutions, where the learning curve is low, and when data does not need to sync between the two competing solutions and when the syncing can be done easily (like file sharing).

3. Monitor every critical service

If you are in IT and responsible for the delivery of a service, you must monitor it. You probably don’t want to monitor everything. For instance, we do not monitor if the Google AdWords portal is up or down as we have decided its availability isn’t critical to our business or employees. However, we monitor Office 365, NetSuite, Zendesk, Salesforce, Marketo, Zoom, and every service our product depends on.

Make sure your monitoring solution gives you visibility into what the performance of the service is, and also the performance of the network that you control – ISPs, Routers, WiFi, Proxies, etc. When failure is detected you quickly need to figure out if it is your network, vendor, or if major backbone providers in the internet are having an outage.

4. Hold performance reviews

As an employee, you might have quarterly, biannual, or annual reviews that range from manager to peer reviews and 360 reviews. It’s interesting to consider how companies have clear performance review processes for employees—with frequent feedback loops—but almost no company has service performance reviews for vendors.

Is the vendor delivering the service they promised? Are we getting what we should be getting out of the vendor? Funny enough, most of these reviews happen when there are budgeting exercises or when the boss decides to review the tools (after reviewing the people). Most of the time, there’s no documentation on the service, the value it provides, or the failure accumulated from it. So, decision making often comes down to the last event (or the feelings of the decision maker).

Tools and services are probably the second biggest line item in a business, right after people. It is a significant investment and can have an enormous impact on your business. Therefore, make sure you have an ongoing, internal review process and ongoing reviews with vendors. For critical services, we review weekly, for others quarterly, and yearly reviews for least-critical services. Your company likely has security and privacy review processes in place. I highly recommend you align performance reviews with them.

Capture key information on an ongoing basis.

  • Are services performing their intended function?
  • Have they saved money or helped make more money?
  • Have they failed or resulted in your teams spending more money to make them work?

Also, measure their speed and availability and track any SLAs you have. What is the point of negotiating the SLA, if you aren’t monitoring it and cannot enforce it?

Conclusion

It’s essential that you identify and monitor your critical services. Once you’ve identified those critical services, you need to have plans for dealing with failures—whether they belong to you or your vendors. And finally, keep an open and continuous feedback loop with your vendors so that they aren’t guessing what you need.

Most importantly, always communicate with vendors on how they are doing—because if they are not aware of where they are failing or succeeding, how can they keep doing more of what works and less of what doesn’t? At the end of the day, your vendors are no different than your employees. They are valuable extensions of your company that help it grow, scale, and improve.

As the CIO of Catchpoint, I am responsible not only for the digital services we deliver to our customers but also for the services we deliver to our employees— services that keep Catchpoint running successfully. Like most customers of the web conferencing service Zoom, we were impacted last week by their outage. And like any vendor outage, it caught us by surprise, and shed light on our weaknesses and how prepared we were for the failure of such critical services.

Setting the stage

We founded Catchpoint in the digital era, and therefore, we have relied on digital services like SaaS or PaaS to operate the business and deliver portions of our services. In the last ten years, we’ve added services and have reached over 150 such services. With nearly 200 employees, our number of SaaS applications is close to our number of employees.

There are the obvious critical SaaS apps like Office 365, Salesforce, Marketo, NetSuite, and Zendesk. Then there are others that often are not considered SaaS—yet vendors deliver them over the cloud like Google Tag Manager and Google AdWords. Even access to our physical office is provided as SaaS by OpenPath.

Then there are the cloud services we rely on for our products and services like Dyn, Fastly, NS1, Verizon Digital Media, and obviously we have nodes in every public cloud provider: Alicloud, AWS, Azure, Google, IBM, and Tencent.

All these services are essential at various levels to our daily operations. If Office 365 Outlook goes down, the whole company comes to a halt. If Salesforce goes down, our Sales and Customer Success teams have a hard time connecting with customers. If Zendesk goes down, we can’t support our customers. But if AdWords goes down—we aren’t impacted much as it’s not critical to our day-to-day operations.

For several services, we built redundancy. For instance, when it comes to delivering our core applications to customers, we built redundancy by utilizing three DNS providers, two CDNs, etc. However, for digital services used by our employees, only a few have redundancy. For example, if OpenPath has an outage, employees might not be able to enter the office with their phones—but they can still enter the office via keypad.

For web conferencing, Zoom is primary, but we also have Microsoft Teams for internal use. There are good reasons for not having web conferencing redundancy:

  • We cannot train all employees on two different, complex applications for the same activities (like chat, web conferencing, and file sharing).
  • It’s hard to keep two different apps from two different vendors in sync as they have different data models and workflows (exceptions are file-sharing services).
  • The cost can become enormous when paying per user per month.

Zoom outage

Zoom had a worldwide outage on January 9th, between 12 pm ET and 2 pm ET. It eventually impacted every Zoom client. The outage impacted us too. Some users got 504 errors while others that logged in via Okta were greeted with a message stating they had no access and should contact the “Owner” of the account. Our CEO got that message.

The first reaction was—wait, did I get kicked out of zoom? The next reaction was—did someone forget to pay the bill? Within minutes our Corporate IT team and I were flooded with messages about what was going on. We sent a companywide email to notify everyone of the challenges.

Our offices are distributed across six cities and three continents, and we have remote employees around the world. Customer-facing teams were unable to get on calls with customers. Conference rooms were unusable as people could not screen share. The result? People canceled meetings and focused on individual work—or huddled at desks to continue their meetings.

Now you might be wondering what happened to our redundant services. Well, that was the first big failure on our side. Microsoft Teams was available as an alternate, but we didn’t plan for an outage. So, we didn’t communicate to employees to switch to Teams. Nor did we properly train employees on how to use Teams for conferencing. Some power users of Microsoft Teams switched over during the Zoom outage on their own, but most employees did not.

In addition, all conference rooms rely only on Zoom and therefore weren’t usable. Now, we are looking into bringing Microsoft Teams into our conference rooms.

The Public Cloud storm

Zoom, like many digital companies out there, relies on Public Cloud to deliver their service—specifically, AWS. In communications to clients, they stated the root cause was an AWS service failure. Their failover system and service architecture somehow failed to handle the failure and did not recover. Interestingly, the AWS status page did not show any failures of their service, which could mean either Zoom has a private cloud—or, whatever the AWS failure was, was isolated to Zoom’s instance.

For all the SaaS providers out there or external-facing digital services, this should serve as a lesson. We must always ensure that we have correctly identified single points of failures in system architecture, and that we have not only built redundancies and failovers—but that we’ve tested them periodically to ensure they work. At the end of the day, failure is bound to happen, either due to bad code, bad configuration, hardware failure, or vendor failure.

Lessons for IT teams

1. Determine which services are critical to your business

Your company likely utilizes hundreds of apps. We have customers that have over 1,000 apps or services. Not all are critical. Not all need the same level of attention. Define different levels of importance for your services and have a process to classify services as you buy them going forward. At the simplest, you might categorize them as follows:

Critical – the service impacts revenue directly. For example, if your cloud provider goes down, it might mean your service is down; you cannot sign new customers, you might lose customers, and you might breach client SLAs and must pay fees.

High – the service impacts employees, work cannot be completed, but there’s no direct impact on revenue or costs.

Medium – the service might impact employees, but you have redundant solutions in place.

Low – there’s no impact on employees, business, or activities. You are indifferent to the service being up or down.

Keep in mind that for some services, timing might change the level of importance—I.e., a service you use to complete taxes is only critical during tax season.

2. Have a current plan for each critical vendor or service type

Whether the service is web conferencing, sales automation, or your managed DNS, you need to have a well-documented plan in place as to what to do when the service fails.

The plan should define:

  • Who to contact
  • Who to send communications to and what those communications should be
  • What the backup solutions are and who is responsible for managing them
  • What training do all employees or specific employees need to be ready for crisis (skip this for low priority services)

Backup solutions work well for simple solutions, where the learning curve is low, and when data does not need to sync between the two competing solutions and when the syncing can be done easily (like file sharing).

3. Monitor every critical service

If you are in IT and responsible for the delivery of a service, you must monitor it. You probably don’t want to monitor everything. For instance, we do not monitor if the Google AdWords portal is up or down as we have decided its availability isn’t critical to our business or employees. However, we monitor Office 365, NetSuite, Zendesk, Salesforce, Marketo, Zoom, and every service our product depends on.

Make sure your monitoring solution gives you visibility into what the performance of the service is, and also the performance of the network that you control – ISPs, Routers, WiFi, Proxies, etc. When failure is detected you quickly need to figure out if it is your network, vendor, or if major backbone providers in the internet are having an outage.

4. Hold performance reviews

As an employee, you might have quarterly, biannual, or annual reviews that range from manager to peer reviews and 360 reviews. It’s interesting to consider how companies have clear performance review processes for employees—with frequent feedback loops—but almost no company has service performance reviews for vendors.

Is the vendor delivering the service they promised? Are we getting what we should be getting out of the vendor? Funny enough, most of these reviews happen when there are budgeting exercises or when the boss decides to review the tools (after reviewing the people). Most of the time, there’s no documentation on the service, the value it provides, or the failure accumulated from it. So, decision making often comes down to the last event (or the feelings of the decision maker).

Tools and services are probably the second biggest line item in a business, right after people. It is a significant investment and can have an enormous impact on your business. Therefore, make sure you have an ongoing, internal review process and ongoing reviews with vendors. For critical services, we review weekly, for others quarterly, and yearly reviews for least-critical services. Your company likely has security and privacy review processes in place. I highly recommend you align performance reviews with them.

Capture key information on an ongoing basis.

  • Are services performing their intended function?
  • Have they saved money or helped make more money?
  • Have they failed or resulted in your teams spending more money to make them work?

Also, measure their speed and availability and track any SLAs you have. What is the point of negotiating the SLA, if you aren’t monitoring it and cannot enforce it?

Conclusion

It’s essential that you identify and monitor your critical services. Once you’ve identified those critical services, you need to have plans for dealing with failures—whether they belong to you or your vendors. And finally, keep an open and continuous feedback loop with your vendors so that they aren’t guessing what you need.

Most importantly, always communicate with vendors on how they are doing—because if they are not aware of where they are failing or succeeding, how can they keep doing more of what works and less of what doesn’t? At the end of the day, your vendors are no different than your employees. They are valuable extensions of your company that help it grow, scale, and improve.

This is some text inside of a div block.

You might also like

Blog post

When SSL Issues aren’t just about SSL: A deep dive into the TIBCO Mashery outage

Blog post

Preparing for the unexpected: Lessons from the AJIO and Jio Outage

Blog post

The Need for Speed: Highlights from IBM and Catchpoint’s Global DNS Performance Study