AIOps—Artificial Intelligence for IT Operations—is an exciting and fast-evolving field rapidly gaining traction in IT groups as teams see its power and deploy it to improve and automate IT operations.
Its arrival couldn’t come at a better time: IT environments, while simplifying in some ways, are getting more complex in others, complicating the operational aspects of managing a modern network.
A seemingly minor issue can cascade into a series of incidents and outages. Armed with legacy tools, network teams often find themselves navigating a sea of alerts, many of which turn out to be false positives or multiple expressions of the same incident. The risks associated with managing these high-stakes networks are soaring. Seamless and efficient network management is critical.
AIOps, when implemented thoughtfully, is addressing these operational challenges—marrying machine learning and automation to, among other things:
- Pluck out critical insights from a torrent of noisy data.
- Correlate multiple alarms pointing to the same incident.
- Automate routine tasks, freeing up staff to focus on more complex troubleshooting and project work.
- Automatically fetch and attach essential information to tickets so incidents can be managed better and faster.
While much has been said about what AIOps promises to do, more needs to be said about how network managers are actually using machine learning and automation today.
This guide does just that. We dive into the problems plaguing conventional approaches, expose their impacts, and explore how AIOps offers innovative solutions with firsthand examples from our own experience using AIOps in our Network Operations Center (NOC).
- Detecting root causes and the impact of incidents in hybrid, interrelated, and dynamic infrastructures
- Reducing alert noise
- De-siloing systems and teams for faster incident response
- Breaking through the limitations of traditional ITOM systems
- Seamless integration with multiple systems
Take advantage of the only NOC support service applying powerful AIOps capabilities to the NOC operations environment.
Rather than wasting time gathering information across a set of siloed and fragmented tools, we’re leading the industry in utilizing (and expanding) a suite of AIOps tools at strategic points in the NOC operational workflow.
We draw on our deep operational expertise and innovative tools to bring all your monitoring, change, and topology data into one place and utilize that data without limitations. In short, we're applying AIOps to enable human teams to make better decisions faster.
Learn more about AIOps at INOC »
1. Detecting root causes and the impact of incidents in hybrid, interrelated, and dynamic infrastructures
In today’s application-heavy IT environments, interdependencies are everywhere, making infrastructures complex and fragile. Conventional and cloud applications are often interwoven with containerization, serverless computing, microservices, and orchestration tools.
Add in network technologies such as Software-Defined Networking and Network Functions Virtualization, and we’re looking at many devices with high levels of interdependence to support various organizational services and applications.
In such a scenario, a single outage can diminish organizational knowledge and obstruct engineers from accessing essential systems. Teams need good visibility into these networks to identify the root cause of incidents and understand the overall impact of an outage spanning multiple environments.
AIOps tools, such as what we’ve implemented here at INOC, give teams better visibility into these environments by pulling together network, server, cloud, and application data into a single platform and analyzing topology, metrics, and traces for dependencies and correlations. This is exactly what we’re doing with AIOps right now within our own NOC platform, which serves as our operating system for delivering NOC support.
Let’s break this down further into a few component parts to discuss them in more detail:
A super-charged Configuration Management Database (CMDB)
Our CMDB is the single source of truth at the center of our NOC service operation. It aids in correlating incidents and isolating circuits and devices that are down.
Unlike basic CMDBs that contain only device or asset data, ours stores a ton of information, including customer data, third-party contact info, and knowledge articles. While all this data would overwhelm a human engineer manually combing through it trying to cut a ticket, our AIOps tools have no problem connecting the dots across all of these data sources to correlate events, create incident tickets, assess impact, and guide NOC engineers in resolving issues.
We’re using these tools to perform lookups on each alert to see what has been affected and then automatically attach the affected Configuration Items (CIs) based on the data in the alerts we receive. This automated process ensures that our human NOC engineers have all the necessary information at their fingertips as soon as they pick up a ticket—vastly improving the speed and quality of support.
Advanced incident correlation
To better and more quickly identify singular networking incidents expressed as multiple alerts or events, we let our AIOps toolset ingest large amounts of data (including data stored in our CMDB) so it can correlate those alerts and events beyond what even a sizeable human team could ever be expected to accomplish on its own.
We also use this data to study past incidents, see how their alerts were correlated, and make necessary adjustments for better correlation in the future—a continuous cycle of improvement enabled by AIOps.
One significant aspect of our approach here is the concept of dynamic severity, which layers in urgency based on data from the CMDB. Depending on the criticality of a specific device in alert, a higher or lower priority is assigned, which, when combined with the alert's severity, helps us automate incident prioritization based on impact.
“Our CMDB helps correlate incidents and to isolate circuits and devices that are down by doing lookups on alerts to see what has been affected. We have automation that automatically attaches the CIs. The configuration items based on the data in the alerts that come into our ServiceNow will attach those CI records to the ticket. By the time a NOC technician views that ticket, they already have all that information at their fingertips.”
— Lindsey Logsdon, Platform Integration Specialist, INOC
Automated closure for self-clearing incidents
In addition to detecting and prioritizing incidents, we’re also using AIOps to automate several aspects of our incident management and troubleshooting; for instance, automatically closing lower-priority incidents that clear on their own.
This “auto-closure” automation frees up engineers from self-resolving incidents and streamlines their incident management process.
Automated root cause determination from pattern recognition
We’re currently working on automating probable cause determinations based on alert patterns and providing documentation to help troubleshoot and streamline incidents. Alerts sometimes arrive at the NOC with a probable cause designation, so we're building documentation around those likely causes to create even better playbooks for the NOC.
“Really, what happens is we have several key alerts that heavily indicate larger problems. And then we have patterns of alerts where they indicate particular things. We're working right now to go through a step-by-step process when we see these patterns in these specific alerts to help streamline the troubleshooting process for our NOC engineers. So, as soon as the alert comes in, it's going to go into the ticket, and the ticket’s going to say, ‘Hey, this is probably a fiber outage between this and that,’ and then it'll take the NOC engineer through the steps of reaching out to the carrier, properly updating the ticket, what necessary information they can grab, light levels, things of that nature.”
— Erik Rose, Change and Configuration Specialist, INOC
2. Reducing alert noise
Network engineers typically field—and can often be overwhelmed by—genuine alerts and many false positives. High alert noise distracts support teams from focusing on real issues. Over time, alert fatigue sets in. In the best case, the team takes a long time to find the actual issue. More than likely, however, too many alerts have the same effect as no alerts and are simply ignored.
Also, most threshold alerts are static. For example, an alert may be created if CPU usage is above 90% for five minutes. A lack of contextual awareness (batch job execution, for example) for threshold alerts leads to creating unnecessary incidents, adding to the NOC staff workload without improving infrastructure availability. High alert volumes ultimately result in the same situation: higher MTTR and a poor reputation for the NOC as the critical alerts need to be acted on quickly.
Reducing alert noise sharpens the alerts that require action, enabling engineers to focus on the right issue instead of spending time on false positives.
Here are a few ways we’re applying AIOps to “bring down the noise level” in the NOC:
Automated alert correlation
To take a slightly different angle on alert correlation, we’re using AIOps to find relevant correlated data from a large volume of alerts in real time at machine speed. This allows the NOC engineer to focus on the right alerts at the right time without “ignoring” data that would otherwise be treated as “noise” despite there being genuine insights contained within it.
Put another way, we’re able to start listening to what all of a client’s data has to say about their environment and use those insights to deliver genuinely proactive NOC support. Continuous data collection allows for thorough analysis and problem management later.
Automated alert filtering
We’re also leveraging AIOps to filter out alerts that aren’t actionable, which significantly reduces alert noise, too. Our integrations are meticulously set up, maintained, and tuned to facilitate this. Our latest platform also enables us to filter specific alerts on particular devices or facilities, providing much greater granularity than before. This helps ensure only actionable alerts reach the NOC, improving its ability to respond quickly and efficiently to genuine issues.
Maintenance plans are another AIOps-enabled “feature” we use to minimize alert noise, which operates more from the ticketing side. Our platform uses AIOps to check for ongoing maintenance when alerts flow into our tickets. If maintenance is detected, the system suppresses the ticket from the NOC until the maintenance is completed.
At the end of the maintenance window, our automation checks all the alerts again. If they’re clear, the ticket is automatically closed, and any necessary customer notifications are sent. As you might imagine, this simple process removes the need for manual interference from the NOC, significantly reducing alert noise and far less human labor spent chasing frivolous alerts.
3. De-siloing systems and teams for faster incident response
The typical enterprise (or IT-complex) organization uses multiple tools to monitor and manage its environment, each collecting IT operations data and retaining it in silos. (This may happen, for example, because DevOps teams use tools specific to their environment or because new tools are added when new technologies are adopted.)
With these separate data environments, it becomes tough to understand underlying infrastructure or application issues in current technology stacks with high levels of interdependence; siloed systems make centralized insight difficult or impossible.
We continue to see organizations expend many resources to maintain these disparate systems yet lack a single-pane view for monitoring and managing incidents. Typically, each support team within the organization will investigate issues independently. The NOC then combines these teams’ tools and processes, eventually merging the data and analysis to determine the root cause. This leads to significant delays in incident resolution. Senior management escalation and involvement in troubleshooting is common in these siloed environments, causing significant lost productivity for the organization.
Today, the industry-leading AIOps tools typically integrate with many of the major monitoring systems available on the market—allowing teams to use best-of-breed solutions for various technologies and pull together disparate information from these tools without a ton of work.
By capturing data from these tools and integrating them into a central database, AIOps makes it easier for machine learning algorithms to do a deeper analysis across the hybrid and dynamic infrastructure and can provide crucial insights that help identify root causes in real time.
Unified monitoring system
At INOC, we’ve used AIOps to establish a unified monitoring system that brings together all of our clients' alerts so we can format, normalize, enrich, process, and correlate them into proper incidents.
When this monitoring isn’t unified, and separate systems raise alerts, having two or more tickets for the same issue is common, resulting in more noise and wasted labor. Our approach requires less work, decreases noise, and consolidates multiple notifications for the same issue into a single ticket.
For instance: One of our clients monitors two similar systems. Technically these belong to two different entities and monitoring sources, but they feed into the same exact alarm stream. Before AIOps, this situation often resulted in duplicate alerts. But with our current platform, we’re able to consolidate these through automated cross-correlation that runs at machine speed.
Thanks to many out-of-the-box integrations built into our AIOps tool, we can connect to almost every major monitoring platform in a relatively easy GUI fashion.
Rather than programmatically modifying our system to accept the alarm feeds from our clients’ tools (leading to development bottlenecks), integrations are implemented with little more than an exchange of keys, clicking a few buttons, and writing some correlation patterns. Plus, the platform itself offers a feature that suggests correlation patterns based on alert trends and patterns.
“There are always ways that technology can make your life better in ways you can't see coming. But when it comes to management, with the right level of machine learning and enough data, I think it could start to suggest trends.”
— Lindsey Logsdon, Platform Integration Specialist, INOC
4. Breaking through the limitations of traditional ITOM systems
Traditional ITOM systems depend on an accurately populated CMDB with clearly defined relationships and dependencies between parent and child configuration items. But, in many support organizations, the CMDB quickly becomes outdated, given the usually incomplete implementation of reliable change management processes and procedures.
AIOps now allows organizations to preserve existing investments in ITOM tools and bring together data from diverse sources for processing and correlation. It can further enrich the information on alerts with data from NOC runbooks.
Here’s how we’re doing this today:
Advanced correlation and ticket enrichment
As we’ve covered a few times in this guide, we use AIOps with our CMDB and runbooks to enrich tickets and allow engineers to see and correlate circuits and circuit values. We also use location data from the CMDB to enrich alerts. When these enriched alerts flow into our ticketing system, the location in the ticket is automatically set. This has driven some of our other automations and processes and provided an efficient way to track and address incidents.
Proactive “pre-incident” issue management
With AIOps and rigorous data collection, we can now proactively address issues before they escalate into incidents.
We collect and analyze data to find trends and conduct problem management—which then allows AIOps to detect the pre-cursors of future incidents and address them before they materialize.
Maintaining CMDB accuracy
We maintain a human check element when updating our CMDB to ensure accuracy. But for some devices we monitor, particularly the more “optically-centric” ones, we have intelligent automation to collect specific information, facilities, assets, and other relevant data to populate the CMDB tables. The use of AIOps here aids in maintaining the CMDB's relevancy and accuracy, making it a reliable resource for our NOC operation as each client environment changes with time.
5. Seamless integration with multiple systems
Traditional outsourced NOC systems tend to require long timelines to accommodate custom development and are limited in their capacity to integrate with various data sources. One key reason we were interested in adopting the AIOps tool we ultimately used was its ease of use and, more importantly, its easy integration capabilities. The tooling was purposefully designed to facilitate easy integration, which has had a transformative impact on service provision.
Previously, onboarding a new client was a time-intensive process involving significant custom development work. This long timeline was, understandably, a cause for concern for clients who wanted to start utilizing NOC services as quickly as possible. Our newest platform, however, dramatically speeds up this process. With just a few clicks, clients' alarm feeds can be set up to flow directly to INOC.
This easy integration capability also brings another significant advantage: the ability to integrate with systems that were difficult or impossible to work with before. Our previous platform had significant limitations, with only email or SNMP traps available for integration. There was an API-based solution, but it wasn’t as efficient. Our new platform effectively utilizes APIs, enabling us to integrate with systems we previously struggled with.
In short, this ease of integration doesn't just simplify previously challenging tasks; it renders what was once prohibitively difficult into a manageable task, thereby unlocking formerly impossible integration possibilities. The enhanced integration capabilities now make it possible to connect and work with a wider range of systems than ever before.
Looking into the future: using the unified monitoring system to normalize data
As we look to the near future of AIOps in network management, one new capability we’re looking to unlock is using our unified monitoring solution to actually normalize alarm data.
The goal is to standardize the interpretation of data, so regardless of where it comes from, a check value remains a check value. In other words, data should always come in with the same label, irrespective of its source.
While we’re still working on implementing this capability, it’s a highly desired feature and would allow data to be processed separately but then unified in a way that can be reported on more straightforwardly.
Final Thoughts and Next Steps
To pull all of these threads together: Network management teams are using AIOps to reduce human workloads, eliminate alarm noise, prioritize critical alarms, increase efficiency and standardization, improve data quality and response times, and shorten and simplify onboarding processes. These improvements can result in higher quality service and a better customer experience.
Here at INOC, we’re already applying these tools to improve network monitoring and management as we expand our service to provide additional value.
- Event Monitoring & Management: We're the first, and so far the only, NOC support provider using AIOps to consolidate and process alarm and event data from all sources and help the NOC understand the significance of an alarm or event in the proper context, as well as its possible impact on infrastructure services and application availability.
- Incident Management: We’re autonomously surfacing the probable cause of incidents and allowing our NOC engineers to confirm the analysis and the data behind it. This further enables engineers to follow through on the response and beat expectations for resolution times.
- Problem Management: Analyzing the root cause of incidents is critical for preventing future incidents and making infrastructure and applications more reliably available. We’re leveraging AIOps’s immense processing power to analyze a trove of historical information to identify root causes with ease.
- Change Management: Managing changes is time-consuming and fraught with risk. We’re using AIOps to reduce time and risks significantly by communicating notifications for upcoming maintenances automatically and analyzing incidents in the context of previous change management activities.
If you’d interested in diving deeper into AIOps:
- Watch our free recorded webinar we co-hosted with BigPanda, which identifies several strategies for bringing AIOps into your workflow.
- Read our 101 guide to NOC automation.
- Download our free white paper, which gives you a comprehensive, practical look at the role of AIOps in enhancing NOC support.
Need to take your existing support infrastructure to the next level with an outsourced NOC solution? Schedule a NOC consultation with our Solution Engineers and start the conversation. Want to learn more about applying advanced tools to the NOC? Grab our free white paper below and learn how much you stand to gain from adding AIOps to your support workflows.
FREE WHITE PAPER
The Role of AIOps in Enhancing NOC Support
Download our free white paper and learn how your NOC support stands to gain from AIOps by overcoming operational challenges and delivering outstanding service. Use the free included worksheet to contextualize the value of AIOps for your organization.