5 Lessons for Managing a Third Party Outage
Third party outage is a giant pain for both companies and their users. On June 29, Comcast and their customers suffered an Internet outage from two fiber cuts at Level3 and Zayo. This outage frustrated millions of consumers across the country for several hours.
Those red spots above are a nightmare for any SaaS, PaaS, IaaS, or Enterprise company affected – not only is your internal productivity suffering, but you’re also at risk of damaging customer relationships.
No company that offers digital services is safe from an outage. So, we’ve put together a list of 5 lessons on what you can do to make lemonade out of a giant lemon like an outage.
1. Failure is Bound to Happen – Build Redundancy
Modern enterprises rely on a web of many parts for connection – including ISPs. Like any of your other parts, ISPs are sometimes going to have outages or issues. So, you definitely need a backup plan.
Don’t use the same provider for your primary and backup. Diversify so that if your primary provider crashes, you can switch over to the backup provider.
You’ve probably installed critical hardware components like firewalls and load balancers in a high-availability configuration to keep yourself safe. To be fully redundant, you need additional backups for more than just your hardware. You need backups for essential services like ISPs, DNS, or CDNs.
ISP issues aren’t the only risks you should prepare for. Build redundancy for all your 3rd parties and make sure you’re monitoring them in case of an outage:
• Deploy multiple DNS providers on separate networks that are seamlessly integrated using APIs or a custom program.
• Use an active CDN failover strategy – with multiple CDNs at your major points of presence (PoPs). This also provides the fastest delivery of data to your users.
• Use two cloud providers or dual datacenters– run servers in different locations (one in CA and one in VA) or with different providers like a few on AWS and some on Rackspace.
2. Monitor Your ISPs from Key Regions
You need to know what’s going on from your customer’s perspective. Monitoring only from AWS in San Jose isn’t going to give you reliable data on what all consumers in California are experiencing. Nor will it tell you if your ISP is having peering problems with major providers in the region. And it’s especially not helpful if your services are also hosted in that same AWS region. But, it will give you a good idea of the API/service experience of clients consuming those services from that AWS region.
Monitoring from multiple points of presence gives you insight into which locations are having issues. Maybe users in San Jose are experiencing an outage because they are going through to ISP A, but users in Virginia aren’t affected because they are going to ISP B.
It’s also important to monitor from key backbone ISPs that deliver content to the region to ensure your ISPs, CDNs, DNS, and cloud providers are not having peering problems that cause regional micro-outages.
Synthetic monitoring continuously tests from multiple locations and key backbone and broadband ISPs, giving you insight into any rising issue – often before it turns into a bigger problem.
If you weren’t monitoring from Comcast, you wouldn’t have known there was an issue there. You need to monitor along the entire journey to pinpoint any break in the chain:
- Make sure your monitoring strategy includes tests on backbones, last mile, and wireless.
- Cover your users from their geographic locations.
- Implement real user monitoring (RUM) for an exact view of your users’ experiences.
3. Keep an Eye on Your Service Delivery Chain
Just because you’ve planned for the worst, doesn’t mean your partners, vendors, or suppliers have. They may or may not have built redundancy, they may or may not be monitoring all parts of their infrastructure. They may or may not execute their redundancy plans properly.
To protect your company and customers:
- Ask your partners and vendors what they are monitoring.
- Ask them about their own redundancy and outage plans.
- Monitor your own infrastructure so that you can hold 3rd parties accountable to their SLAs.
4. Focus on What You Can Control and Execute Quickly
When something big like the Comcast outage occurs, your best bet is to worry about what you can control. Instead of worrying about news and Reddit threads, focus on making sure the other parts of your infrastructure are performing properly.
Routing doesn’t fix itself; you have to make changes (or tell your providers to make the changes).
- Synthetically monitor your ISPs 24/7 so that you can be proactive should an issue arise.
- Have a backup ISP so you can switch over when there’s a problem.
- Have a runbook on what to do when ISP peering is the root cause of an outage.
5. Have a Communication Plan Ready
When stuff hits the fan, make sure you’ve got a plan in place so that you can quickly communicate with your team about what needs to get done. You’re also going to need a plan for communicating with your customers and the public.
Internal Communication
Introverted or not, it’s still vital that we’re efficient communicators.
Source: https://www.flickr.com/photos/tbaatar/8367493679
Here are some steps you can take to be effective in communicating with your team should a 3rd party outage occur:
- Identify the stakeholders – who’s who on the team? What are their responsibilities?
If you’re a large organization, you might want a record of this who’s who somewhere on your computer. Most people are terrible with names, so save a list of names, duties, and email addresses so you can get in touch with the right people ASAP.
- Assign the right tasks to the right people, and make sure people are clear on what their tasks are.
- Have a chain of execution – maybe you send a Slack message or email to stakeholders or call a meeting to inform them, and then everyone executes their part of the plan.
- Take your plan for a test drive every 6-12 months so that everyone is prepared when a real outage occurs.
External Communication
Your internal plan shouldn’t be the only communication plan you have in place for outages. You’re going to need to communicate with your most important asset: your customers.
Comcast communicated with the public via Twitter shortly after the outage first occurred.
Your external communication plan will likely involve your marketing, sales, customer success, and PR teams. It might look something like this:
- Engineering and product spot the problem and articulate that to the other teams.
- Engineering communicates with 3rd party contacts quickly.
- Marketing and PR decide upon and send public messaging.
- Sales and customer success communicate with current clients.
Your organization should identify the key stakeholders for external communication using the bullets from above – who is responsible for these forms of communication? Who manages the public messaging? The customer communication?
After you’ve identified stakeholders, you need a plan, a chain of events on how communication will unfold. Maybe it’s protocol for your org to inform the board and enterprise customers before you notify the general public, etc. Each company will have a different chain of events based on internal and consumer needs.
Remember that outages make people angry – everyone involved in customer or 3rd party communication should keep cool and flex their empathy muscles. Share charts and reports from your monitoring tools illustrating where the issue is, show people that you’re working on a resolution.
In Conclusion
Every company with a complex web system should be monitoring every point along the user’s path. This will ensure you’re able to switch over to backups and often nip problems in the bud.
Monitoring also improves the efficacy of both internal and external communication. The faster you spot a problem, the faster you can make changes, alert 3rd parties, communicate, and fix it.