Blog Post

Incident Review - Slack Outage Impacts A Subset Of Users Worldwide Due To DNS Issue

Published
October 1, 2021
#
 mins read
By 

in this blog post

DNS observability is an essential part of any Ops team’s strategy. Looking for proof? It’s happening right now.

It has been a busy week for Ops teams across the globe. Many were forced to urgently rotate SSL certificates after one of Lets Encrypt’s root certificates expired.

Collaboration plays a critical role during such situations where members in a team or multiple teams must communicate and work with each other to rapidly and efficiently complete a collective task. Unfortunately, things got more challenging this week as one of the world’s largest collaboration and messaging applications, Slack, was not accessible for various users worldwide during the same time period.

DNS misconfiguration is at the core of this issue. If the process of DNS resolution fails, users experience outages like this. However, you can take action to avoid business impact.

Let’s start by breaking down the issue that’s happening currently.

Slack Acknowledges The Issue

Users were not able to access desktop, mobile, and web applications of Slack from 15:30 AM UTC onwards. The outage was related to a DNS failure, which was later acknowledged by Slack. At the time of publication, the issue is still ongoing for some users and at 06:57 Slack UTC announced it may take up to 24 hours to completely resolve this issue for all users.

Users were struggling to understand if they were not able to access Slack due to their device, wireless network or ISP connectivity. Things got more difficult as Slack’s status page was down due to the same issue.

Why Monitoring From The Cloud Isn’t Enough

During such incidents where Operation teams are not able to collaborate efficiently with each other, things can easily get out of control. This can lead to outages that directly impact customers. IT teams in some organizations might already monitor their SaaS applications, but it is not surprising if none of them had triggered any alarms for Slack.

Most monitoring solutions are hosted on cloud instances. Monitoring applications from cloud instances leaves dangerous blind spots and does not accurately represent end user experience.

A good monitoring and observability strategy must include a combination of observation across backbone and last mile networks. The backbone network has predefined bandwidth and consistent network connectivity. This allows you to monitor, measure and benchmark application performance without any network fluctuations. At the same time, the last mile network represents availability and performance for real end users who are trying to access digital services on their home/office networks.

Catchpoint’s Last Mile Tests Detected DNS Issues As The Root Cause

Catchpoint’s last mile tests detected Slack DNS issues, allowing the platform to proactively notify respective teams.

Scatterplot data from Catchpoint showing intermittent failures for Slack DNS tests

Catchpoint records showing server failure while resolving slack.com domain

Even after 15 hours of the outage, some users still cannot access Slack. Those who are aware of the issue and its root cause can mitigate it by overriding their default DNS resolver with a public DNS resolver such as 8.8.8.8 or 1.1.1.1.

Slack has now confirmed the outage was “caused by our own change and not related to any third-party DNS software and services.” This was related to Slack’s TTL allowing for caching of responses for up to two days.

The lesson here? DNS might be a small service in the delivery chain, but minor mistakes in configuration can take hours to recover if you have large TTL for your records. Read more about how TTL can impact DNS responses.

Understand How to Resolve DNS Issues More Quickly

DNS is at the core of the Internet. If the process of DNS resolution fails, users will experience outages such as this one. Observing the DNS of all your essential SaaS services from the cloud, backbone and last mile is essential to understanding the true performance of DNS. The fix is easy, but only if you know what needs to be fixed!

Watch Our DNS How-To video series to find out how to verify DNS server mapping, and other DNS-related tips!

For further information on major incidents in 2021, please check out our new report. You’ll find detailed analysis, as well as a checklist of best practices to prevent, prepare for, and respond to an outage.

Download “2021 Internet Outages: A compendium of the year’s mischiefs and miseries – with a dose of actionable insights.”

DNS observability is an essential part of any Ops team’s strategy. Looking for proof? It’s happening right now.

It has been a busy week for Ops teams across the globe. Many were forced to urgently rotate SSL certificates after one of Lets Encrypt’s root certificates expired.

Collaboration plays a critical role during such situations where members in a team or multiple teams must communicate and work with each other to rapidly and efficiently complete a collective task. Unfortunately, things got more challenging this week as one of the world’s largest collaboration and messaging applications, Slack, was not accessible for various users worldwide during the same time period.

DNS misconfiguration is at the core of this issue. If the process of DNS resolution fails, users experience outages like this. However, you can take action to avoid business impact.

Let’s start by breaking down the issue that’s happening currently.

Slack Acknowledges The Issue

Users were not able to access desktop, mobile, and web applications of Slack from 15:30 AM UTC onwards. The outage was related to a DNS failure, which was later acknowledged by Slack. At the time of publication, the issue is still ongoing for some users and at 06:57 Slack UTC announced it may take up to 24 hours to completely resolve this issue for all users.

Users were struggling to understand if they were not able to access Slack due to their device, wireless network or ISP connectivity. Things got more difficult as Slack’s status page was down due to the same issue.

Why Monitoring From The Cloud Isn’t Enough

During such incidents where Operation teams are not able to collaborate efficiently with each other, things can easily get out of control. This can lead to outages that directly impact customers. IT teams in some organizations might already monitor their SaaS applications, but it is not surprising if none of them had triggered any alarms for Slack.

Most monitoring solutions are hosted on cloud instances. Monitoring applications from cloud instances leaves dangerous blind spots and does not accurately represent end user experience.

A good monitoring and observability strategy must include a combination of observation across backbone and last mile networks. The backbone network has predefined bandwidth and consistent network connectivity. This allows you to monitor, measure and benchmark application performance without any network fluctuations. At the same time, the last mile network represents availability and performance for real end users who are trying to access digital services on their home/office networks.

Catchpoint’s Last Mile Tests Detected DNS Issues As The Root Cause

Catchpoint’s last mile tests detected Slack DNS issues, allowing the platform to proactively notify respective teams.

Scatterplot data from Catchpoint showing intermittent failures for Slack DNS tests

Catchpoint records showing server failure while resolving slack.com domain

Even after 15 hours of the outage, some users still cannot access Slack. Those who are aware of the issue and its root cause can mitigate it by overriding their default DNS resolver with a public DNS resolver such as 8.8.8.8 or 1.1.1.1.

Slack has now confirmed the outage was “caused by our own change and not related to any third-party DNS software and services.” This was related to Slack’s TTL allowing for caching of responses for up to two days.

The lesson here? DNS might be a small service in the delivery chain, but minor mistakes in configuration can take hours to recover if you have large TTL for your records. Read more about how TTL can impact DNS responses.

Understand How to Resolve DNS Issues More Quickly

DNS is at the core of the Internet. If the process of DNS resolution fails, users will experience outages such as this one. Observing the DNS of all your essential SaaS services from the cloud, backbone and last mile is essential to understanding the true performance of DNS. The fix is easy, but only if you know what needs to be fixed!

Watch Our DNS How-To video series to find out how to verify DNS server mapping, and other DNS-related tips!

For further information on major incidents in 2021, please check out our new report. You’ll find detailed analysis, as well as a checklist of best practices to prevent, prepare for, and respond to an outage.

Download “2021 Internet Outages: A compendium of the year’s mischiefs and miseries – with a dose of actionable insights.”

This is some text inside of a div block.

You might also like

Blog post

Catchpoint Expands Observability Network to Barcelona: A Growing Internet Hub

Blog post

When SSL Issues aren’t just about SSL: A deep dive into the TIBCO Mashery outage

Blog post

Demystifying API Monitoring and Testing with IPM