When SSL Issues aren’t just about SSL: A deep dive into the TIBCO Mashery outage
On October 1, 2024, TIBCO Mashery, an enterprise API management platform leveraged by some of the world’s most recognizable brands, experienced a significant outage. At around 7:10 AM ET, users began encountering SSL connection errors that appeared straightforward at first glance.
Internet Sonar, one of the tools in our Internet Performance Monitoring (IPM) arsenal, successfully captured the incident. While other solutions may have missed it, Internet Sonar was able to pinpoint the issue because it monitors all the layers of the Internet Stack, including DNS, SSL, response times, and reachability from “eyeball” user networks. This comprehensive view revealed that the root cause wasn’t an SSL failure, but a DNS misconfiguration affecting access to key services.
What happened?
The SSL error in the browser (as shown in the image below)shows that the certificate is pointing to pantheonsite.io.
Looking at the details in the Catchpoint platform, we observed the same issue.
While attempting to connect to developer.mashery.com, DNS resolution occurred, but the connection pointed to an IP that didn’t identify the Mashery domain. In some regions, the connection was still working and returning the correct certificate.
The correct SSL handshake should have seen mashery.com as the Common Name or Subject Alternative Name (CN/SAN) as we can see in the screenshot below.
Since this was related to an SSL error, an SSL test confirmed that we failed to connect to the site because the CN/SAN did not match.
Understanding the Incident
Initially, users encountered SSL errors caused by connections being directed to a Pantheon IP (23.185.0.3) instead of the expected AWS ELB IPs (54.160.170.229, 54.235.15.197, and 44.211.103.199). This misrouting occurred due to recursive resolvers connecting to"ns65.worldnic.com" and "ns66.worldnic.com." In contrast, alternative DNS resolvers like 8.8.8.8 (Google) and 1.1.1.1 (Cloudflare) correctly directed traffic to AWS.
We can see the same in this DNS Experience Test record below:
This chart shows how it should have been working (some cities where it was working correctly):
Query using Google DNS Resolver –8.8.8.8
Crucially, the issue manifested differently across various geographical locations, likely due to geo-IP based configurations affecting how DNS records were served. This variability underlines the importance of a global monitoring strategy. Relying solely on a cloud instance would never have captured the full scope of the problem. It’s imperative to get close to eyeball networks to truly understand how users experience services across regions.
As the hours progressed, we witnessed DNS errors such as '101 Not Implemented,' 'Query Refused,' and 'Server Failure' from different parts of the world, indicating ongoing changes within the system. Catchpoint’s DNS monitoring captured these issues, and after almost 4.5 hours, the problem began to resolve as changes propagated correct Name Servers and A Records globally.
The real-world impact
For users relying on Mashery for seamless API management, this incident had serious consequences. Requests were inconsistently routed—some ending up at Fastly IPs instead of the intended AWS ELB—potentially leading to service disruptions. This highlights the fragile nature of Internet infrastructure, where a single DNS misconfiguration can immediately impact user experience and service availability.
What seemed like a simple SSL error at first quickly revealed a much bigger issue. The incident exposed critical weaknesses in DNS reliability, SSL configurations, and CDN performance. It’s yet another reminder that the Internet is deeply interconnected, and problems can appear in one region while other areas remain unaffected, making global visibility and proactive monitoring essential.
What we learn from the Mashery outage
The Mashery outage reveals a crucial lesson: SSL errors can be just the tip of the iceberg. The real issue often lies deeper, like in this case, with a DNS misconfiguration. If DNS isn’t properly configured or monitored, the entire system can fail, and what seems like a simple SSL error can spiral into a much bigger problem.
This incident is a wake-up call. The interconnected nature of the Internet means that a single point of failure—like DNS—can disrupt services across the globe. Geographic differences only make it harder to detect and resolve these issues, which is why a global monitoring strategy is essential. To truly safeguard against the fragility of the Internet, you need full visibility into every layer of the Internet Stack, from DNS to SSL and beyond.
Monitor the entire Internet Stack to stay ahead of outages
This incident underscores the necessity of monitoring every layer of the Internet Stack—DNS, SSL, CDN, and third-party services. By using robust IPM tools like Internet Sonar, companies can achieve resilience across all these dependencies.
Internet Sonar provides:
- Unparalleled worldwide and regional visibility leveraging Catchpoint’s Global Observability Network with over 2700 nodes from more than 300 providers in over 100 countries – with more being added all the time.
- Hundreds of the most popular Internet services monitored, including Internet Infrastructure (CDN, DNS, Cloud, ISP), SaaS(email, SaaS, UCaaS, SECaaS), and MarTech (Ad serving, Analytics, Video).
- Real-time email alerts as well as webhook orAPI access for easy integration into any application.
- Automatic, AI-powered data correlation with active monitoring for simple, real-time status information.
Today it was Mashery; tomorrow, it could be your service. The need for strong, continuous monitoring practices cannot be overstated.
Check out our demo hub to see Internet Sonar at work, or contact us to learn more.