Incident Review – Google Outage
When something as ubiquitous as Google goes down, there is a lot of online frenzy with users tweeting and searching for updates on the issue. That’s exactly what we witnessed today between 9/24/2020 17:59:44 PST to 9/24/2020 18:23:20 PST. Multiple Google services like Mail, Drive, Meet, Hangouts experienced downtime.
Frustrated users took to Twitter to report the outage and the tweets were captured by Websee.
Users trying to access Google services got a 502 error screen.
Incident Timeline
At Catchpoint, we synthetically monitor various Google Services like Meet, Hangouts, Calendar, Drive, and Mail. Login to google was impacted and hence users didn’t have access to the suite of services.
We received the first alert from our monitoring at 9/24/2020 17:59:51 and looking at data we saw that the onset of the issue was at 9/24/2020 17:59:44.
Fig 1: Downtime detected across all nodes
Fig 2: Outage Scatterplot
The Google status page reported the issue shortly
Fig 3: Google Status Dashboard
The outage was widespread, the impact was seen from offices in Bangalore, Los Angeles, New York, and Boston.
Fig 4: Global Impact
The issues seen by the end users can be broadly categorized into two groups:
1. The server returning a 502 HTTP response code.
Fig 5: Waterfall and Header details showing 502 Error
2. Connection timeout to the server and high latency.
Fig 6: High Latency
Looking at the server IP breakdown, we noted that certain IP ranges were impacted. Here is a breakdown of the IPs of the servers impacted per city
Fig 7: IP Breakdown by City
What was interesting to note during the outage was that some of the servers that historically serve some of the cities did not get any requests during the outage
Fig 8: Change in Servers
Baselines are thus very important, as it helps us identify the anomalies. For a highly distributed network like the Google network, having this level of visibility ensures you are able to get one step closer to the issue.
The traceroute tests that were running parallelly also offered some great insights. Before the outage, we noted no loss at the Google AS
Fig 9: Traceroute before Outage
However, during the outage, we saw packet loss at the Google AS
Fig 10: Traceroute during Outage
We were actively monitoring the outage on social as well. Urs Hölzle, senior vice president of technical infrastructure and Google Fellow at Google, tweeted about the root cause – “As has been noticed, several Google services were down for some users from 6:00 to 6:23 p.m. PDT. A pool of servers that route traffic to application backends crashed and users on that particular pool experienced the outage. “
This is the fourth outage /service disruption for google services in the month of September as issues happened on September 18th September (Google Chat), 15th September (Google Drive), 8th September (Google Drive), 2020 with multiple Google Services including Drive. So User Experience for sure has impacted, Service reliability is a big question for all the companies providing services.
Summary
We rely on important services like Google Drive, Mail, Calendar, Maps, YouTube which is always in demand, used by almost everyone who has internet access on any device. In the current global situation, with the majority of the workforce working from home, these tools are even more crucial to communicate and collaborate. The whole delivery chain and infrastructure have to scale up to cope with the surge in active userbase accessing these services and ensure the user experience is not impacted. When services like Google go down, the consequences are immediate. The incident was resolved by 6:23pm PDT but the impact on end-user experience was definitely significant.
Google has always paved the path in the field of reliability and monitoring. A number of practices and philosophies of SRE originated at Google. But when it comes to technology, things are bound to break, this is just the inherent nature of technology. With Google services dominating a large part of our daily life, any impact on the services is amplified. Handling an outage of this magnitude is a testament to the reliability, operations, network, and monitoring systems at Google.