Anatomy of the Recent Salesforce Outage
At Catchpoint, I work as a Solutions Engineer. Being on the sales side, one of the applications I use a lot is Salesforce, the CRM platform used at Catchpoint and thousands of other organizations. According to Catchpoint’s Endpoint data, Salesforce is my fourth most visited site.
When Salesforce had an outage for close to five hours yesterday, the first thought that came to mind was “Thank God it’s not quarter-end or worse year-end”. Quarter-end is when sales teams are hustling to close deals. Year-end, the last quarter, is when every deal matters as it impacts the revenue of the company. Salesforce is critical for sales teams during this time and its availability and performance directly impact not only employee productivity but also impacts business results. Since efficiency is key in any organization, our sales processes are automated using Salesforce. We rely on Salesforce for requesting a service order, routing contracts for signature, for getting billing information, and so on. So yes, a Salesforce outage can result in missing the quarterly targets!
Working for a monitoring company also means I have the fortune of getting notified about any issue before it impacts users. We have implemented a sound monitoring strategy for our systems and applications.
As our CEO Mehdi Daoudi rightly puts it, “We drink our own champagne at Catchpoint”.
We leverage Active, Real User, Endpoint, and Network monitoring to monitor our backend systems, our customer-facing portal, our APIs and web services, and every SaaS application used in the organization. We get into the details for implementing an effective monitoring strategy for Salesforce in one of our previous blogs.
Full visibility and performance control of an application is best achieved by leveraging multiple data sources, giving you complementary perspectives.
At a high level, our monitoring strategy for Salesforce can be summarized in the diagram below –
Our overview dashboard for Salesforce (Fig 1) combines various data sources to give us a holistic view into Salesforce availability and performance.
Fig 1: Salesforce Overview Dashboard.
The dashboard shows that Salesforce was hard down between May 11, 2021 – 14:03:42 to May 11, 2021 – 19:19:48.
Let me take you through the perspectives we get from combining proactive and reactive monitoring.
- Active Monitoring from Catchpoint Public and Enterprise Nodes: We use this to proactively monitor end-to-end user journeys. Before the pandemic, we only monitored on-prem nodes which are deployed in our branch offices. Post-COVID, we added monitoring from backbone and broadband nodes since employees are working from home and accessing SaaS applications through the public internet.
This is the monitoring that tells us there is an issue even before our IT teams see a ticket from an employee.
Here is the Active monitoring data for a critical user journey for Salesforce from yesterday. The red diamonds show the outage (Fig 2).
Fig 2: Salesforce outage.
We were also able to drill down to know exactly what was going on (Fig 3).
Fig 3: HTTP 504 error on the landing page.
The landing page after you login, Salesforce Home, failed to load, with the server returning an HTTP 504 error. Users saw the maintenance page.
- Endpoint Monitoring – Every employee at Catchpoint has the Catchpoint endpoint agent deployed on their laptops. This gives us real user data on the performance and availability of every SaaS application from the real user perspective. However, we love the power of Synthetic monitoring and run proactive tests for critical applications from the endpoint agents. Yes, you read it right, our endpoint agent can collect data both reactively and proactively!!
During the outage yesterday, we saw a dip in page visits to Salesforce (Fig 4) since the application was down at the landing page itself.
Fig 4: Dip in page visits.
Looking at data from the last three days, the time period of the Salesforce outage is also a time when the application is used a lot. You can see the surge (Fig 5) in the page views after the outages – looks like several of my colleagues had to put in some late hours to wrap up work yesterday!
Fig 5: Outage at peak hours.
It doesn’t stop at that. With endpoint monitoring, we can also answer the questions such as:
- Who was impacted?
- Where were they accessing the application from?
- Is it their network or is it the application?
- What parts of the applications are impacted?
Sometimes only parts of an application are slow or down and having this level of granularity helps IT teams better assist employees when there is an issue.
Endpoint data below shows employees were seeing high response time (Fig 6) and errors access all parts of the Salesforce application.
Fig 6: High response time.
Here we know exactly who had the worse impact –
Fig 7: Outage analysis.
Outages are difficult. Thank you to the Salesforce team for working hard to resolve the issue. Below is the summary of the outage that was posted on the Salesforce status page.
As organizations are adopting cloud, microservices architectures and increasing the use of SaaS applications, traditional forms of monitoring using APM, NPM, logging, and tracing fall short. How can you instrument code and applications you don’t own?
Yesterday, the Salesforce status page was down for a while as well making it difficult for organizations to know what was really going on.
Given that the objective of any SaaS provider is to ensure optimal, uninterrupted service and great user experience providers need to broaden their horizons and pick the right approaches that truly reflect what their end users are seeing. As we continue to adopt several SaaS applications, employee productivity and business results depend on the availability, performance, reachability, and reliability of these applications. Thus, it becomes crucial for organizations to have a sound monitoring strategy for these applications as well.
The past 20 years have shown us that having a solid ability to understand what your end users are experiencing should be the number one item on the agenda of any monitoring.