The Network Operations Center (NOC) is a critical element in detecting, isolating, and resolving network, IT, and cloud infrastructure faults that inevitably happen due to operational realities, and can potentially result in expensive downtime.
As IT infrastructures continue to evolve rapidly, the NOC must keep up with a robust toolset to handle new and old technologies and changing operational requirements.
In addition to the tools that have been a mainstay of the NOC for years, IT organizations are looking at a new breed of technologies that bring machine learning and automation into the NOC to better handle workloads and re-focus staff on revenue-generating projects rather than reactive support tasks.
But choosing the right tools for a NOC isn’t an easy process, especially for enterprises, communications service providers, and OEMs with large networks, IT, and cloud environments. The price tags are high. The stakes are high. Blindspots are everywhere, and the list of questions can seem endless.
- “What functionality is operationally important to us?”
- “How do the features of this tool map to the support operation workflows we want?”
- “Do we have everything we need to operationalize this tool?”
- “Will this tool continue to work the way we want it to as we expand service?”
- “Does this tool offer upgrade options to make the solution ‘future-proof’?”
- “Is the pricing for this tool transparent? And do the licensing models fit our organization's requirements?”
- “Will this tool integrate with other tools? And do we know how to set up that integration correctly?”
- “How much time do we need to invest before we see a benefit?”
The list goes on.
Here, we offer a brief look at some of the NOC tools hard at work inside some of the most complex and multi-faceted IT organizations. This is certainly not an exhaustive list; rather, a quick look into four categories of tools one would likely find hard at work inside any high-performing NOC: Network Management Systems (NMSs), machine learning and automation (AIOps), ticketing, and reporting platforms.
- Network Management Systems: SolarWinds | LogicMonitor | OpenNMS
- Machine Learning and Automation (AIOps): BigPanda | Moogsoft
- Ticketing: ServiceNow | Jira | ConnectWise
- Reporting: PowerBI | Tableau | Snowflake | AWS Redshift
Need expert assistance figuring out which NOC tools are best for you or how to better utilize, configure, integrate, or operationalize your tools to improve service? Schedule a NOC consultation with our Solution Engineering Team and get the conversation started.
1. Network Management Systems (More than just network monitoring)
An NMS constantly monitors elements and services across the network, IT, and cloud infrastructure. It also conducts analyses and notifies the appropriate personnel when an issue arises or when critical values have been exceeded. With the right NMS, the correct and actionable events, trends, and metrics are available to trigger the appropriate personnel’s response.
Of course, every organization has different requirements for a network monitoring solution. There are many different tools and solutions on the market, so careful consideration is key.
SolarWinds is the market leader in network monitoring and network management systems. It’s an on-premise system that primarily uses fairly standard protocols such as SNMP and WMI to check on infrastructure element statuses.
Its autodiscovery capability is a huge advantage, compiling an asset inventory and automatically drawing up a network topology map. Since it’s on-premise, so long as your organization has sufficiently blocked and controlled access to the SolarWinds servers and has the monitoring network sufficiently locked down, it can maintain control of its data without having to worry about it leaking out to the cloud or any other external sources—a big plus.
Now, the downsides. There are some challenges with SolarWinds, particularly on the backend. Of course, the massive SolarWinds hack obviously prompts some serious concerns around security. However, as news about the hack has progressed at the time of writing this piece, it appears the vulnerabilities won’t be a long-term problem.
Besides this particular concern, being an on-premise system has its costs. SolarWinds requires a significant amount of hardware resources for it to simply run (and even more to run efficiently). So beyond the high price tag of purchasing and licensing the system itself, there’s an additional cost of managing more on-premise or virtual assets specific to your monitoring infrastructure.
The third big “up-front” cost of SolarWinds is setup. Configuring and optimizing SolarWinds in a specific environment is an art unto itself and an entire consulting industry has sprung up to help implement and fine-tune it to extract maximum value. Enterprises, communications service providers, OEMs, and other organizations with expansive IT and cloud infrastructures should carefully consider these other upfront costs as well.
LogicMonitor is a great cloud-based monitoring platform that differentiates itself from SolarWinds through its use of “collectors.” These are essentially small applications that run on server hardware, connect to assets, and communicate back to the cloud over TLS—a method that offers reliable, secure communication. The use of collectors centralizes monitoring to one of a few points in your network, and only those need to talk to the cloud. However, the cloud and internet resources that collect your data don't have access, which is itself a big plus.
While LogicMonitor also tends to share the same “heavy lift” in configuration as SolarWinds, your organization very likely won’t have to worry about optimizing it. Optimization is handled by the vendor in this case and covered in the monthly subscription fee.
LogicMonitor has proved to be an attractive tool for organizations that are “lifting and shifting” their physical on-premise resources to the cloud. It can manage both on-premise and cloud-based assets all while sitting in the cloud itself. Since it’s cloud-native, it’s one less piece of infrastructure you have to move. Again, this is especially beneficial for enterprises migrating from on-premise to a primarily or fully cloud-based environment.
Of course, being a cloud-native system has its downsides, too. Since the platform lives in the cloud, so does all of its data. While data is available via APIs, it doesn't necessarily live natively where you can easily manipulate it. As a result, if your organization needs customization, it will almost certainly have to make development requests with the vendor.
And then there’s cost. Those monthly subscription fees can get pricey on a per-element basis with LogicMonitor—a reason to carefully consider and compare prices relative to capabilities when making decisions around any of these tools.
OpenNMS is the open-source solution on our list. Like many open-source tools, the big advantage here is not having to pay upfront for licensing. Depending on your internal capabilities and the complexity of your environment, you may incur some consulting fees for implementation and setup, but the product itself is free.
In terms of capability, OpenNMS can do pretty much anything that your organization could do with SolarWinds or LogicMonitor. The downside is that while it does do quite a bit “out of the box,” enterprises and other large organizations should expect to do much more work to configure the tool to bring those capabilities to life. It’s an even heavier lift than SolarWinds and LogicMonitor to get up and running right.
OpenNMS can be run as a standalone system with the main OpenNMS server doing the network and infrastructure monitoring, or it can be run with what it calls “minions.” These are similar to LogicMonitor’s “collectors,” which run on a server with connectivity and SSL back to your OpenNMS box, wherever it lives.
This option offers a lot of flexibility in designing and operationalizing your monitoring architecture. For example, you can easily place minions on your local on-premise locations and securely send data back to your OpenNMS server. At the same time, your organization can have its OpenNMS server live inside its cloud compute environment. In this scenario, your team is able to utilize that host to do its direct monitoring on its cloud and compute assets.
The point here is that OpenNMS is extremely flexible—more so than the other tools we’ve mentioned—as it gives your organization much more control over how you want your monitoring set up and run. Again, however, that flexibility comes with the cost of heavy configuration.
2. Machine Learning and Automation (AIOps)
AIOps—Artificial Intelligence for IT Operations—combines machine learning (ML) and automation to identify and automate low-risk tasks and unlock the insights contained within massive amounts of data generated across an environment. With vastly superior data processing and machine learning power, the NOC can perform correlation much faster and identify the subtle indicators of approaching issues within a torrent of mostly noisy data.
Here at INOC, we’ve made significant strides in applying AIOps at strategic points in the NOC operational workflow. (Talk to us if you’re interested in learning more about the potential service enhancements achieved by putting AIOps to work for your organization.)
Read our free white paper for the proper deep dive into AIOps and the NOC: The Role of AIOPs in Enhancing NOC Support.
While our white paper explains a number of ways AIOps can be applied in NOC workflows, currently, the most common application is in enhancing the NOC’s ability to correlate data faster and more accurately than humans. In this application, an AIOps tool feeds in events (via API, SNMP, email, or whatever is coming out of the NMS) with some initial correlation rules to tie related events together. Over time, ML will recognize patterns and provide feedback, which can then be used to further grow and improve the rule-set and make automated monitoring and ticketing increasingly more effective over time.
The value here can be massive, especially in enterprise environments where incidents and events need to be correlated across perhaps three, four, or five different monitoring platforms. A well-tuned AIOps platform can be fed information from all of those platforms and make incredibly effective correlations across them—consolidating all of those feeds from disparate systems and providing remarkable intelligence onto the incident ticket.
BigPanda is an industry-leading event correlation platform powered by AIOps. The platform helps enterprises significantly reduce IT noise and detects incidents in real-time (before they escalate into outages).
One of BigPanda’s greatest strengths is that while being a product it operates in many ways like a service—bringing some unique value-adds and a higher level of support. It also has a huge library of integrations and—in some instances—can be implemented very quickly thanks to a solid onboarding process and compatibility with many existing monitoring, change, topology, collaboration, and ticketing tools.
Compared to other tools of its class, it demands considerably less effort to get to the point where information is flowing through it. However, getting the platform to the point where its output is useful may be a more involved process. That’s especially true of functions such as alarm enhancement—where the ML needs somewhat extensive training to become excellent at recognizing specific patterns and improving decision-making.
When we set out to identify the AIOps platform best suited for our particular workflow requirements some years ago, we ultimately determined BigPanda to be a good fit. Here’s a quick summary of the drivers of that decision:
- Scalable pricing: Rather than a large upfront cost, BigPanda’s pricing was associated with the number of clients/devices we had running through it. For us, this pricing model was, and continues to be, most advantageous as it scales well with our business.
- ML as “recommender” rather than “decider”: BigPanda enables us to use its ML capabilities to augment and improve our decision-making rather than stepping in to make those decisions itself. Through its impressive data analysis capabilities, the platform helps us recognize patterns and make suggestions to which we can apply our expertise to validate, modify, and possibly work further before taking action. This prevents potentially bad decisions the system would make on its own while still delivering incredibly helpful insights to our human engineers.
- Ability to ingest additional metadata: The ability to incorporate and analyze metadata further enhances the correlations we build and strengthens the information flowing out of our CMDB.
Moogsoft is a cloud-native observability solution designed for DevOps professionals and SRE teams. It offers intelligent noise-reduction, alert correlation, and native observability capabilities, including metrics collection and anomaly detection.
Moogsoft delivers out-of-the-box workflows and integrations with notification and alerting tools to help teams resolve incidents faster and deliver continuous assurance for their critical digital services.
When we evaluated Moogsoft as a potential event correlation tool for our own workflow a couple of years ago, it was still an on-premise solution. At that time, we were impressed by its excellent UI. However, its API capabilities offered less than what we needed in our particular use case. Back then, we considered it a tool that an organization would likely run in its data center and configure through that wonderful UI.
At the time of our evaluation, Moogsoft was a tool that was arguably set in an earlier generation. There was less focus on implementation as code and more focus on UI. However, as of the end of 2020, Moogsoft has relaunched its platform—breaking it into microservices with a cloud-native product. The system is no doubt much stronger and more capable than it was during our evaluation and certainly deserves to be treated as a contender with BigPanda.
Again, like with any of these tools, while one platform ultimately proved best for our use-case and workflow, each organization is unique and should evaluate for suitability and value within the context of its specific environment.
One last important thing to note about AIOps tools
As of now, no tool can contextualize the event impact for an IT service without additional instrumentation. Also, the ML algorithms that power these tools can’t establish the priority and urgency of a specific event without service context, which is why starting from scratch with such a tool on your own not only means high costs but sometimes quite a bit of lead time to train the system to understand or “learn” patterns and deliver real value.
Here at INOC, we’re already applying these tools to improve event monitoring and management as we continue to expand our service to provide additional value.
Talk to us about turning up support on our NOC to put our AIOps investments to work for you.
Ticketing systems are core to ITSM in the NOC. They track issues by their urgency, severity, and personnel assignment and create tickets that describe issues so they can be processed and assigned to the appropriate resource. When a person or group assigned to a task can’t complete it, the ticket will move to the next level for correction.
ServiceNow is a comprehensive enterprise workflow and ITSM platform. It allows your organization to set up and handle proper flows and configurations for incidents, changes, problems, and much more right out of the box. It's an incredibly robust and multifaceted ITSM tool—the “gold standard” as far as ticketing systems are concerned.
Because it’s the gold standard, it’s an expensive toolset to purchase and implement. It’s also, as we’ve seen firsthand, quite “generic.” It takes time and energy to customize and optimize the workflows to meet the specific needs of your organization. Like some of the other tools we’ve mentioned here, ServiceNow has a wide array of powerful capabilities, but it doesn’t dictate those workflows. You have to build out the service catalog and each of the various workflows.
Once the system has been fine-tuned, however, the sky’s the limit in terms of powerful configurations. For example, your organization can integrate an AIOps tool to feed in the incidents for a whole new level of workflow efficiency.
Another capability is adding intelligence that auto-attaches configuration items that are impacted from your CMDB. It’s even possible to trigger scripts from ServiceNow to log into and collect data from your equipment or infrastructure and include it in the ticket. In this use case, ServiceNow (integrated with the appropriate pre-incident tools) can retrieve and present so much useful information that it can actually isolate an incident before a NOC engineer lays eyes on the ticket. The efficiency opportunities here can’t be overstated.
Jira differentiates itself from ServiceNow and other ITSM/ticketing platforms because it does have quite a few robust workflows built into it. Also, it’s less “expansive” than ServiceNow. Instead of offering a platform on which many different organizations can do many different things, Jira’s capabilities are more focused on giving developers the ability to track and manage activities throughout the development lifecycle.
It’s pointed first and foremost at enterprises that are partially or fully cloud-based, especially those that have adopted a DevOps model. Jira offers both on-premise and cloud-based solutions, which makes it pretty versatile.
Jira can fall down when too much is asked of it beyond the scope of a development team. If you need to control NOC incident workflow, for example, or if you need to customize communications to clients and their customers, you may be stretching Jira beyond its capability if you don’t have the operational intelligence you need to get Jira working well beyond its main focus.
We realize this is a big frustration for a lot of organizations whose DevOps teams otherwise love Jira, which is why we routinely fill that operational gap by integrating Jira into our NOC platform. This way, teams can keep the DevOps models they know and love for developing, deploying, and changing code to gain the intelligence of working with incidents.
ConnectWise offers a suite of products that can be purchased together as a complete ITSM solution or as targeted solutions based on the need, such as NMS or recovery. When used as a suite of tools for the NOC, ConnectWise offers some valuable value-adds, namely, an AIOps capability positioned in between the NMS and the ITSM product to enhance correlation.
While ConnectWise has some limitations to its monitoring capabilities and may not be ideal for every monitoring environment, it’s quite robust from a ticketing perspective. Similar to the value-adds that come with tying, say, Microsoft products together in a single environment, ConnectWise has found its niche among organizations that find value in its ability to tie its various tools together and become a ConnectWise “shop.”
Also, similar to ServiceNow, ConnectWise offers the ability to provide customer or end-user portals, which is particularly useful for organizations looking to provide visibility to other stakeholders.
Reporting has two primary functions in the NOC. One is to understand how the NOC is operating to better manage its components (tools, staff, and processes) for day-to-day operations and to understand trends for mid to long-term planning. The second is to identify patterns that point to chronic issues so teams can conduct long-term problem management to fix them.
To get reporting robust enough to achieve both of these goals well, two important components have to be in place. One is the backend, which consists of the data lake and the data warehouse. The front end must consist of a visualization component and a data exploration component.
Power BI is a Microsoft reporting product that everybody loves because it's free and quite powerful. And then there's Tableau, which is decidedly not free, but somewhat more powerful. They’re both cloud-based as well as on-premise solutions. But both of them are fundamentally business intelligence tools that allow you to build dashboards and visualizations to help you understand data. They can take complex data and allow you to analyze and present them in a simple, digestible way.
In the NOC, these tools help teams understand, for example, how much time is being spent handling issues, how many tickets are being generated over time, or more granularly, how individual engineers are performing.
Snowflake is a cloud-based data warehouse, and Redshift is a service within AWS. These tools serve as data stores optimized to deliver data into front-end platforms like those we mentioned above. They can be used as data lakes (for storing raw data) or data warehouses (for storing processed data that have been normalized to a common format so that frontend tools can easily consume it).
An ETL tool is used between the data lake and data warehouse to pull in and transform the data into a normalized format. This enables your organization to do simple reporting projects like generating graphs as well as much more sophisticated reporting such as using machine learning against the data to discern patterns and trends.
This level of reporting enables a NOC to mature its operation by examining itself in extreme detail. What specifically is taking up most of an engineer’s time? How can that task be re-examined or re-tooled to be more efficient? What root causes can we address through problem management?
It’s important to note that, as we’ve seen many times over, IT organizations can face a huge upfront investment in normalizing disparate data sources to be able to report across them. Once that hurdle has been cleared, however, the reporting opportunities deliver a level of consistent value that far exceeds that upfront cost.
Here at INOC, we first help organizations make that hurdle as small as possible by drawing on years of experience to make normalization as streamlined as possible and then present a wealth of established reports, dashboards, and visualizations to start getting value immediately.
Zooming Out: Properly Operationalizing Tools Through Strategic NOC Outsourcing
Outsourcing NOC support to an operationally mature NOC services provider offers a number of advantages over building out a NOC in-house. It often lowers both up-front and ongoing costs. It enables organizations to utilize their own IT resources better. It makes it incredibly easy to scale up and down to reflect changes in the business.
Outsourcing NOC service to a highly capable support provider is attractive from a tools perspective, too. Take the NMS, for example. The high cost of purchasing and integrating a monitoring and management solution only gets higher when multiple, disparate monitoring and management systems sow confusion, create tension between teams, and steal valuable time from revenue-generating projects.
This problem is extremely common among enterprises and communications service providers—and it's one that strategic NOC outsourcing is perfectly suited to solve. Rather than replacing otherwise well-functioning monitoring systems, an expert outsourced NOC service provider can simply fill the operational gap between them, enhance the insights that flow out of them, and standardize everything through a “single pane of glass.”
Here are a few of our own capabilities as an outsourced NOC support partner that have proven to be massive value-adds for organizations struggling to make their tools work for them, rather than the other way around:
- Alarming interface integrations: When monitoring tools are already in place, we integrate downstream of an NMS, EMS, and/or devices through an alarming interface—the mechanism by which your systems tell ours that an event has occurred.
- Event correlation and ticketing integrations: Once we’ve received an alarm, we employ both human and automated ticket correlation processes to create appropriate incident tickets, problem tickets, and other records, which can be synchronized to the ticketing system for troubleshooting and resolution.
- CMDB integrations: A seamless CMDB integration ensures our configurations are a perfect match. For each alarm we receive and each subsequent ticket we create, CMDB integration associates the appropriate meta information, arming the NOC engineer with the actionable information they need to make informed decisions. When necessary, we also draw on years of experience to enhance existing CMDB structures and capabilities, further enhancing efficiency and effectiveness.
Final Thoughts and Next Steps
We realize organizations make significant investments in their IT infrastructures and the tools they use to support them, which is why we built our outsourced NOC support services to be highly capable, highly flexible, and highly integrable. No matter what tools you’re currently using or where your operational gaps lie, we take the time to discover exactly where technology or intelligence is needed to make your service work better and tailor a support solution to fit.
When it comes to NOC tools specifically, we help IT organizations save time and money that would otherwise need to be spent configuring tools, integrating them with other systems, developing processes and procedures around them, and training staff to use them effectively. By feeding the right intelligence into your operational model, the end result is almost always the same: increased accuracy, increased productivity, higher success rate for resolution, and ultimately, reduced cost of operations.
Need to take your existing support infrastructure to the next level with an outsourced NOC solution? Schedule a NOC consultation with our Solution Engineers and start the conversation. Want to learn more about applying advanced tools to the NOC? Grab our free white paper below and learn how much you stand to gain from adding AIOps to your support workflows.
FREE WHITE PAPER
The Role of AIOps in Enhancing NOC Support
Download our free white paper and learn how your NOC support stands to gain from AIOps by overcoming operational challenges and delivering outstanding service. Use the free included worksheet to contextualize the value of AIOps for your organization.