Blog Post

BGP Beyond Preventing Outages

Published
April 13, 2023
#
 mins read
By 

in this blog post

The Importance of BGP

Since its initial conception on the back of a napkin, BGP has been an essential part of the Internet. However, its ubiquity and simplicity also make it a potential weak spot in any organization's Internet Stack. As an open, near-universal protocol it's a vector for potentially malicious attacks. It can also cause the same amount of problems simply through misconfiguration (in fact telling the difference between the two can be a challenge in and of itself). But one thing BGP is often undervalued for is its use in network monitoring and issue analysis. I was recently reading this white paper about preventing outages and noticed a few specific points around BGP that I thought should be highlighted.  

Malice or Misconfiguration?

When there's a BGP problem, it can be hard to tell if it's a deliberate attempt at harm by a malicious third party or just something that wasn't properly configured. This is a particular problem because BGP was never designed with any sort of security in mind at all:

“BGP hijacking may be the result of a configuration mistake or a malicious act; in either case it is an attack on the common routing system that we all use… The problem is, BGP was created long before security was a major concern. BGP assumes that all networks are trustworthy. Technically, there are no built-in security mechanisms to validate that routes are legitimate.”

Megan Kruse, Director, Partner Engagement and Communications, Internet Society

BGP hijacks can be an incredibly serious threat from a security perspective - especially for financial organizations, healthcare providers or anyone who absolutely must keep customer data secure. After all, if you're experiencing a hijack where is all that data going? And what kind of financial or regulatory penalties are you likely to incur if you aren't able to answer that question? A fast response and resolution become paramount in these cases.  

Between malice and misconfiguration, it's usually the latter. But looking directly at BGP data will tell you for sure. As discussed in the Preventing Outages white paper, Telia, a major backbone carrier in Europe, suffered from a network routing issue. By reviewing the IPv4 and IPv6 routes announced and withdrawn when the outage began, the problem was traced to a misconfiguration in the Telia Carrier IP Core network. A simple rollback to an earlier version of the routing policy resolved the issue and services were gradually restored. While it's impossible to avoid BGP misconfiguration completely, network operators should follow common sense rules and apply some of the best practices advocated by MANRS. This will help minimize the chances of a BGP misconfiguration.  

Impacts Beyond BGP  

In October of 2021 a routine maintenance routine job performed by Facebook staff backfired, taking down all the connections in their backbone network. Consequently, the Facebook routers couldn’t speak to their data centers. This triggered a safety mechanism in which the BGP routes towards their DNS servers were withdrawn from the network. Details can be found in the Preventing Outages white paper, but this example is particularly noteworthy because of what else was impacted by the BGP issue. There's (unconfirmed) evidence that the outage resolution was slowed by IT staff being unable to access server rooms because their access badges no longer worked. More importantly, a logic problem in automation led to all their DNS servers being taken out of the BGP announcement, because the servers were not in a proper state. The result was that there were no DNS servers available, and Facebook couldn’t do the simplest and most important thing any organization should during an outage: issue a notice to their users explaining that they are down and working on fixing the situation.  

This isn't the direct result of a BGP issue, but it does reinforce how important BGP is to just about every single system in your network. It's also a reminder that while resolving an outage is the number one priority, taking care of your customers is a close second. Studies show that customer dissatisfaction is mitigated enormously just by knowing that their provider knows about a problem and is working on it. Even if the problem isn't fixed, customer sentiment is FAR more favorable if they're kept informed and understand that an effort was made. Remember, customers are (usually) reasonable people and not just screaming outrage monkeys: they can cut you a surprising amount of slack if they see you are trying to fix the problem. IT is (rightly) focused on making things work again, but there's tremendous value to ensuring that the users impacted by the outage are kept informed and that their pain is, at the very least, acknowledged. It costs nothing to be polite, considerate and transparent. It could also dramatically improve customer satisfaction and reduce customer churn. And transparency can bring additional benefits to the entire industry as well.

Use BGP Data to Learn from Other's Mistakes

We know as much as we do about the Facebook outage because the Facebook team publicly released a very good post-mortem analysis of the incident. In fact, that's a trend that should be applauded and emulated by everyone in the industry:

“An engineer in another department of a large company may change their own process for the better after reading about an incident written by someone in an unrelated department that didn’t directly impact them at the time. This is where distribution comes into play. At the extreme end of this, which I’m hoping we’re trending towards as an industry overall, is making postmortems public to get the maximum downstream learning impacts across the entire industry and not just within a single company.”  

John Egan, Former co-founder and Product Lead Workplace, Facebook

Post-mortems are good for the entire industry, but not every company reports details of an outage. However, BGP data is public and can often give surprising insight into the timeline of what happened and what went wrong. Time and again, the authors of our Preventing Outages white paper were able to determine an astonishing amount about the outages of a wide variety of disparate organizations simply by analyzing publicly available BGP data. And there's no reason you can't do the same! Take a detailed look at BGP data whenever a system or service goes down so you can learn either what NOT to do or how to react better if you're ever in a similar situation. Even if the organization is tight-lipped about causes, you'll be able to understand a lot of details about exactly what was impacted and where.  

Mitigate BGP Risk

So, how exactly can you view publicly available BGP data? And given all of this information, what can you do to mitigate BGP risks for your organization? This is where Catchpoint can help. First and most importantly, monitor BGP In real-time with our unique global observability network - the world's largest. As of February 2023, we receive and analyze routing data from more than 140 peers in all five continents. Collected BGP data is combined with RIPE RIS and Route Views data and presented to our customers through Catchpoint’s Internet Performance (IPM) platform. This gives you the most comprehensive view available of BGP activity anywhere in the world.

As well, you can take advantage of some unique Catchpoint features designed specifically to help identify and resolve BGP Issues:  

  • Route Hijack Detection via Catchpoint's control center library, which stores a list of customer ASNs. The BGP Overview Dashboard flags any prefix announced from unexpected ASNs, so IT is immediately alerted to potential hijacks.
  • A Customizable BGP Smartboard to identify incidents and root cause faster and with fewer clicks for improved MTTR. The Catchpoint BGP Smartboard lets IT teams investigate BGP peer event data across selected timeframes, view announcements and withdrawals, then drill down to the details of each event. The result is faster and more effective troubleshooting.
  • Advanced BGP Dashboard & Score Metrics to see the health of the networks you rely upon at a glance. Information presented includes visibility for reachability, hijacks, peer visibility, mass withdrawals, RPKI status, and BGP data by region.  

Catchpoint can help you mitigate the impact of BGP Issues with a platform that makes troubleshooting fast and easy – all while integrating seamlessly into the applications you already use. Give Catchpoint a try for free or get in touch with us for details.

The Importance of BGP

Since its initial conception on the back of a napkin, BGP has been an essential part of the Internet. However, its ubiquity and simplicity also make it a potential weak spot in any organization's Internet Stack. As an open, near-universal protocol it's a vector for potentially malicious attacks. It can also cause the same amount of problems simply through misconfiguration (in fact telling the difference between the two can be a challenge in and of itself). But one thing BGP is often undervalued for is its use in network monitoring and issue analysis. I was recently reading this white paper about preventing outages and noticed a few specific points around BGP that I thought should be highlighted.  

Malice or Misconfiguration?

When there's a BGP problem, it can be hard to tell if it's a deliberate attempt at harm by a malicious third party or just something that wasn't properly configured. This is a particular problem because BGP was never designed with any sort of security in mind at all:

“BGP hijacking may be the result of a configuration mistake or a malicious act; in either case it is an attack on the common routing system that we all use… The problem is, BGP was created long before security was a major concern. BGP assumes that all networks are trustworthy. Technically, there are no built-in security mechanisms to validate that routes are legitimate.”

Megan Kruse, Director, Partner Engagement and Communications, Internet Society

BGP hijacks can be an incredibly serious threat from a security perspective - especially for financial organizations, healthcare providers or anyone who absolutely must keep customer data secure. After all, if you're experiencing a hijack where is all that data going? And what kind of financial or regulatory penalties are you likely to incur if you aren't able to answer that question? A fast response and resolution become paramount in these cases.  

Between malice and misconfiguration, it's usually the latter. But looking directly at BGP data will tell you for sure. As discussed in the Preventing Outages white paper, Telia, a major backbone carrier in Europe, suffered from a network routing issue. By reviewing the IPv4 and IPv6 routes announced and withdrawn when the outage began, the problem was traced to a misconfiguration in the Telia Carrier IP Core network. A simple rollback to an earlier version of the routing policy resolved the issue and services were gradually restored. While it's impossible to avoid BGP misconfiguration completely, network operators should follow common sense rules and apply some of the best practices advocated by MANRS. This will help minimize the chances of a BGP misconfiguration.  

Impacts Beyond BGP  

In October of 2021 a routine maintenance routine job performed by Facebook staff backfired, taking down all the connections in their backbone network. Consequently, the Facebook routers couldn’t speak to their data centers. This triggered a safety mechanism in which the BGP routes towards their DNS servers were withdrawn from the network. Details can be found in the Preventing Outages white paper, but this example is particularly noteworthy because of what else was impacted by the BGP issue. There's (unconfirmed) evidence that the outage resolution was slowed by IT staff being unable to access server rooms because their access badges no longer worked. More importantly, a logic problem in automation led to all their DNS servers being taken out of the BGP announcement, because the servers were not in a proper state. The result was that there were no DNS servers available, and Facebook couldn’t do the simplest and most important thing any organization should during an outage: issue a notice to their users explaining that they are down and working on fixing the situation.  

This isn't the direct result of a BGP issue, but it does reinforce how important BGP is to just about every single system in your network. It's also a reminder that while resolving an outage is the number one priority, taking care of your customers is a close second. Studies show that customer dissatisfaction is mitigated enormously just by knowing that their provider knows about a problem and is working on it. Even if the problem isn't fixed, customer sentiment is FAR more favorable if they're kept informed and understand that an effort was made. Remember, customers are (usually) reasonable people and not just screaming outrage monkeys: they can cut you a surprising amount of slack if they see you are trying to fix the problem. IT is (rightly) focused on making things work again, but there's tremendous value to ensuring that the users impacted by the outage are kept informed and that their pain is, at the very least, acknowledged. It costs nothing to be polite, considerate and transparent. It could also dramatically improve customer satisfaction and reduce customer churn. And transparency can bring additional benefits to the entire industry as well.

Use BGP Data to Learn from Other's Mistakes

We know as much as we do about the Facebook outage because the Facebook team publicly released a very good post-mortem analysis of the incident. In fact, that's a trend that should be applauded and emulated by everyone in the industry:

“An engineer in another department of a large company may change their own process for the better after reading about an incident written by someone in an unrelated department that didn’t directly impact them at the time. This is where distribution comes into play. At the extreme end of this, which I’m hoping we’re trending towards as an industry overall, is making postmortems public to get the maximum downstream learning impacts across the entire industry and not just within a single company.”  

John Egan, Former co-founder and Product Lead Workplace, Facebook

Post-mortems are good for the entire industry, but not every company reports details of an outage. However, BGP data is public and can often give surprising insight into the timeline of what happened and what went wrong. Time and again, the authors of our Preventing Outages white paper were able to determine an astonishing amount about the outages of a wide variety of disparate organizations simply by analyzing publicly available BGP data. And there's no reason you can't do the same! Take a detailed look at BGP data whenever a system or service goes down so you can learn either what NOT to do or how to react better if you're ever in a similar situation. Even if the organization is tight-lipped about causes, you'll be able to understand a lot of details about exactly what was impacted and where.  

Mitigate BGP Risk

So, how exactly can you view publicly available BGP data? And given all of this information, what can you do to mitigate BGP risks for your organization? This is where Catchpoint can help. First and most importantly, monitor BGP In real-time with our unique global observability network - the world's largest. As of February 2023, we receive and analyze routing data from more than 140 peers in all five continents. Collected BGP data is combined with RIPE RIS and Route Views data and presented to our customers through Catchpoint’s Internet Performance (IPM) platform. This gives you the most comprehensive view available of BGP activity anywhere in the world.

As well, you can take advantage of some unique Catchpoint features designed specifically to help identify and resolve BGP Issues:  

  • Route Hijack Detection via Catchpoint's control center library, which stores a list of customer ASNs. The BGP Overview Dashboard flags any prefix announced from unexpected ASNs, so IT is immediately alerted to potential hijacks.
  • A Customizable BGP Smartboard to identify incidents and root cause faster and with fewer clicks for improved MTTR. The Catchpoint BGP Smartboard lets IT teams investigate BGP peer event data across selected timeframes, view announcements and withdrawals, then drill down to the details of each event. The result is faster and more effective troubleshooting.
  • Advanced BGP Dashboard & Score Metrics to see the health of the networks you rely upon at a glance. Information presented includes visibility for reachability, hijacks, peer visibility, mass withdrawals, RPKI status, and BGP data by region.  

Catchpoint can help you mitigate the impact of BGP Issues with a platform that makes troubleshooting fast and easy – all while integrating seamlessly into the applications you already use. Give Catchpoint a try for free or get in touch with us for details.

This is some text inside of a div block.

You might also like

Blog post

Catchpoint Expands Observability Network to Barcelona: A Growing Internet Hub

Blog post

Learnings from ServiceNow’s Proactive Response to a Network Breakdown

Blog post

Don’t get caught in the dark: Lessons from a Lumen & AWS micro-outage