Facebook outage analysis
Wednesday, March 13th was an interesting day as the world got to see what happens when a major social media giant that also is a technology leader goes down. The Facebook outage impacted users around the world for an extended period of time.
At Catchpoint we monitor and benchmark the availability and performance of major websites across multiple segments, key elements required for delivering an amazing digital experience to end users.
First warning sign: a micro-outage
The first alert we received indicating that Facebook was having issues came in at 4:02:40 UTC. This micro-outage lasted for around 30 mins and didn’t have a widespread impact.
During this period, we saw Facebook servers return an HTTP 503 error.
When we receive notifications like this we follow a few troubleshooting steps.
- We look at other social networks to see if users are complaining.
- We manually try loading the site to see if we experience issues.
At this time we didn’t find any correlation on other social networks of users complaining, there was no news of a Facebook outage, and we were able to open Facebook and Instagram. We decided to keep an eye on this.
Critical alerts received
We started receiving critical alerts again at 3/13/2019 16:06:58 UTC. This time around, the server was returning a 500 HTTP error response code.
The issue was global. It was not tied to any particular ISP or region. We quickly ruled out this being a network issue with DNS or BGP based on our network monitoring data.
Impact on end-user experience
While we waited for Facebook to post an update on the outage, we started looking at the impact on end user experience.
Users across the globe were impacted. While users in some cities coming from certain ISP’s could load at least the homepage the majority of users could not.
We further tried to understand why the outage was spotty or why the site wasn’t hard down.
The chart above shows users coming from some cities, and ISPs like Milan (Telia), Atlanta (Zayo), and Mumbai (Telia) were able to view the homepage. We took a look at the IPs serving these nodes to understand if it could be specific to certain VIPs / PoPs or servers.
We noted that some servers never resulted in failures.
In the locations with no failures, we analyzed the server IPs to see if these resulted in 100% successful page loads. However, this wasn’t the case.
Thus, this also wasn’t a case of a DDoS attack as we didn’t see servers go down one after the other or the service degrade gradually on a specific server.
Communication during the Facebook outage
Facebook tweeted for the first time at 10:49 AM PST and acknowledged the issue.
It was over 1.5 hours after Catchpoint initially caught the issue at 9:06 AM and after their user base had already taken to Twitter complaining
We also took a look at how Facebook was handling user experience during the outage. Most organizations put up a maintenance page or a “Sorry, we will be back soon” page.
During the outage, the error page seen by the users was varied.
Some saw:
And a few others saw:
Those who tried too often to check if Facebook was back up, the below message was displayed:
This and the lack of communication on what was going on definitely resulted in outrage among the users.
What would have helped from an end-user experience perspective is frequent updates on what was going on. Facebook did assure users that it wasn’t a DDOS attack.
Incident resolution
Facebook fixed the issue around 15:00 PST. After incident resolution, it is important to take a look at how the remediation takes effect.
1. Are all users able to load the page successfully at the same time?
2. Are there any corner cases that still need to be addressed?
Below are the last failure timestamps in the respective countries –
United Kingdom – 3/13/2019 14:58:11 PST
Switzerland – 3/13/2019 15:35:50 PST
Canada – 3/13/2019 18:55:31 PST
United Stated – 3/14/2019 01:10:16
Hong Kong – 03/14/2019 05:40:53 PT
Australia – 3/13/2019 22:42:19
India – 3/14/2019 06:44:26
Singapore – 3/14/2019 04:41:55
In the United States, failures were seen as of this morning as well.
The last failure we noted was in Philadelphia at 03/14/2019 01:10:16 PT.
Across the globe, the last failure we noted was from Delhi- Airtel node in India at 3/14/2019 06:44:26 PT.
Facebook tweeted at 9:24 AM on Thursday that a server configuration issue caused the site to have issues. The total downtime noted in Catchpoint was 09:06:58 03/13/2019 to 3/14/2019 06:44:26 PT.
2018 saw some major outages, this is the biggest outage of 2019 so far.
Lesson learned:
Outages happen, to even the best and biggest companies out there. Customer notification and communication is key in moments like this.