The Changing Nature of Network Monitoring
The end user is the driving force behind the evolution of IT architectures and systems. As third parties such as DNS providers, CDN vendors, cloud providers, and many others become more integrated into the delivery chain with the goal of delivering content to the end user faster than ever, it changes the nature of how these systems are maintained and monitored. As a result, the IT professionals who are responsible for ensuring that delivery face the daunting challenge of staying up to date on these new technologies and methodologies.
There’s perhaps no discipline that’s been more affected by this evolution than the network professional. Whereas network engineers, architects, and operations centers used to be able to focus on the hardware that delivers content to the end users, the changing nature of delivery systems has shifted the focus more to the software that runs the network.
To better understand how the networking discipline has evolved along with the technology, we spoke to two of the experts included in our list of top network professionals, Tom Hollingsworth and Amy Arnold, about how their roles have changed in recent years, what trends they see as having the most impact in the years to come, and new methodologies for managing the ever-changing systems for which they’re responsible.
“Network engineering roles break down silos and blur technology boundaries,” says Arnold. “Network engineers face the challenge of understanding a world full of overlays, clouds, SaaS providers, and containers, in order to meet the demands of rapid IoT growth, ever expanding network perimeters, and elastic application deployment. It’s not enough these days for an engineer to understand the fundamentals of a packet and how TCP/IP transit works. We must understand the when/where/why a packet exists in a larger ecosystem of business infrastructure and applications.”
“Understanding how business services are consumed remains key for designing, deploying, and most critically, troubleshooting network connectivity and performance issues.”
Both Arnold and Hollingsworth agree that the expanding role of cloud providers is the biggest trend that will continue to affect their responsibilities in coming years. “How on earth can you go into AWS to plug in a port?” wonders Hollingsworth. “You’re going to have to start thinking about networking being abstracted away from hardware.”
“The move to and from cloud creates interesting challenges for network engineers,” agrees Arnold. “As the myth of everything-in-a-single-cloud dies off, in its place lives a reality of hybrid environments where some services are best suited for cloud or multi-cloud, and some services best fit the traditional on-premises deployment model.”
The application delivery chain is a complex beast.
She also emphasized the need to correlate the performance of these systems with the all-important business metrics that fuel nearly every decision made by organizations with public-facing digital services. “Understanding how business services are consumed remains key for designing, deploying, and most critically, troubleshooting network connectivity and performance issues,” she says.
A good example of this business-driven evolution is the decoupling of infrastructure and services. While it’s the best possible solution to avoiding single points of failure in a world full of security and performance threats (both malicious and accidental), it has also complicated the way that these systems must be monitored and maintained – not to mention making it much more expensive to implement and collect data.
“Multiple layers of network generate multiple layers of monitoring,” Arnold points out. “Overlay networking means you can’t just watch one logical path between endpoints; you must also account for the devices that make that path possible in the underlay. With redundantly designed systems both for overlay and underlay, the number of devices monitored is considerable. More often than not, the monitoring system of the network paths are different as well, also increasing monitoring overhead for networking engineers.”
Hollingsworth agrees with her about the difficulties created by the wider distribution of these systems. “Decoupling of infrastructure and services has made it much more difficult to implement low-level monitoring,” he says. “It’s forced the monitoring and analytics capabilities to grow up and become more robust in the applications themselves. Instead of just relying on packet headers and NetFlow data, we now are requiring our vendors to build in more capabilities to report on the entire system. It’s harder to capture that data, but it’s is a much better payoff in the long run.”
“Instead of just relying on packet headers and NetFlow data, we now are requiring our vendors to build in more capabilities to report on the entire system.”
Yet he insists that while teams no longer have control over many of their systems, that doesn’t mean they’re absolved from responsibilities when performance issues arise. “Having the infrastructure outside your control doesn’t mean you can’t troubleshoot. It just means you have to be smarter about it,” he says. “You have to have a method and you have to quickly test the areas that you can before you can start eliminating things. Just because you can’t log into the cloud provider’s network switches doesn’t mean you can’t isolate the problem. But it does mean you have to investigate the entirety of the infrastructure. You can’t just isolate to a slow Internet connection and call it a day.”
Arnold adds that establishing benchmarks and baselines, and then alerting when performance trends exceed those baselines, is critical to catching performance issues and discovering the root cause in a timely manner. “Monitoring and creating baselines for business networks, applications, and even endpoints provides the best first step in an environment that leverages cloud applications and infrastructure,” she says.
Of course, understanding the baseline performance for the applications and endpoints in addition to just the network requires being able to communicate and share data across teams quickly and efficiently. “You have to be willing to bring in other teams right away when you suspect their systems are involved so you can spend less time fighting the root cause and more time fixing it,” adds Hollingsworth.
By working with other teams, network professionals can help lower their mean time to innocence when troubleshooting issues, which is vital when you consider that network is always the first component that gets blamed, likely because it’s the most challenging part to understand and the hardest to get visibility into. “Even if the application runs locally on a machine someone is going to say it’s the fault of the network,” agrees Hollingsworth. “That’s because it’s very easy to blame something as complex as a network since it’s almost always the fault of the network in some way. There are no error messages when the network is running slow.”
Adds Arnold, “Users have no knowledge of CPU, RAM, applications servers, load balancers, SQL query times, etc. All the end user knows is a webpage response is slow and it’s the network’s fault. As engineers, we want to inform our users our packet captures, monitoring, and common sense indicate the blame lies squarely elsewhere, but the truth is users aren’t interested in the network’s mean time to innocence, nor should we expect them to care. It often becomes the unsolicited role of the network engineer to find the resource to solve the user’s problem, even when it’s not the network.”
Both Arnold and Hollingsworth also agree that when it comes to resolving incidents, early detection is critical not only for minimizing the impact on the business, but also for preventing the issue from spiraling out of control.
“Early detection is critical because it prevents secondary issues. If DNS is down, lots of other things will start failing too. If BGP is on the fritz, you’re going to start seeing issues all over the place.”
“Early detection is critical because it prevents secondary issues,” Hollingsworth points out. “If DNS is down, lots of other things will start failing too. If BGP is on the fritz, you’re going to start seeing issues all over the place. You may not be able to fix and upstream provider issue with either of those protocols right away but knowing where the problem is will give you the ability to try other fixes.”
They also agree that effective monitoring of all these systems is the best way to catch issues before they become major incidents. “Ensuring that the right services are being monitored and the alert threshold is appropriate are crucial to early detection,” insists Arnold. “With the first, you prioritize what needs to be alerted on, and with the latter, you focus on not creating alert fatigue. Network engineers are asked to do more and more with less and less, proper monitoring and alerting maximizes the time spent troubleshooting and leaves more time to focus on learning, re-tooling, and understanding new technologies.”
“Have thresholds that are tuned to alert the instant you see issues,” adds Hollingsworth. “But don’t configure them like that for every system. Pick your edge routers and your core name servers and configure the alerts so that they fire as soon as an issue is detected. And when that alert goes off, don’t snooze it!”
As for the future of the networking discipline, both of our experts agree that engineers, architects, and operations pros must be able to adapt with the changing systems, or else they run the risk of becoming obsolete. “The biggest challenge facing networking pros right now is the fact that teams are starting to find ways to do work without them,” says Hollingsworth. “Rather than the networking department being a gatekeeper for change, operations and other teams are just implementing their functions and applications in places where they have total control. If you’re left standing in the dust, it’s hard to justify your existence.”
To combat this, Arnold implores her fellow networking pros to invest in ongoing education and learning from their peers around the industry (like by following top experts on social media!). “The pace and scope of changing technologies means non-stop learning in the field of network engineering,” she insists. “The biggest challenge remains the ability to glean required knowledge while shedding old ways of thinking. Network engineers must assemble knowledge from new technologies quickly, and resist any complacency in their current expertise. The ability to learn, re-learn, and adapt are crucial, but require discipline, time, and vital curiosity.”