In today’s expanding landscape of threats, both security-related and operational, network monitoring is an essential part of protecting your infrastructure. The right monitoring solution not only serves to maximize uptime and efficiency but alerts you to potential issues before they cause devastating—and costly—outages.

But especially now, as enterprises and large service providers continue to “lift and shift” their workloads to the cloud and move to distributed workforce models in light of the pandemic, the task of monitoring and management has stretched well beyond the confines of the “network.” This evolution has brought a whole new set of challenges and complexity.

What’s often referred to as network monitoring has functionally evolved into infrastructure monitoring. Businesses need eyes on a variety of elements across networks and the cloud, as well as databases, applications, and other types of assets. And only “monitoring” these environments isn’t enough, either. Organizations need efficient, real-time management to keep IT operations up and running at peak performance.

To do that, teams are asking new and tougher questions of their support vendors to find partners capable of meeting these changing and expanding service demands. A capable partner needs to understand both the complex operational challenges that are keeping teams up at night and come prepared with a service catalog that takes on all or some of that work to meet service level requirements.

Organizations need more than just basic support capabilities from a partner. They need process engineers who can look at, and solve problems holistically.

Here, we lay out a playbook your team can use find a Network Operations Center (NOC) support partner who has the proper technical capability, operational sophistication, and expertise to both protect and optimize your IT operations in today’s evolving environments and help you better utilize your own resources to focus on forward-thinking, revenue-generating activities.

Look for the following elements in any prospective NOC service partner:

A Structured Operational Framework
An Integrated and Integrable Platform
Continual Service Improvement (CSI)

1. A Structured Operational Framework

Turning up outsourced monitoring support through a third-party NOC support provider, especially for larger organizations, should never be treated like a flip of the switch.

Whether it’s just monitoring or more comprehensive NOC support, an effective service provider should conduct a comprehensive assessment and apply its findings to design an operation that exercises a trusted framework such as ITIL* or FCAPS, and organizes your NOC activities and workflows to properly utilize resources and lower operational expenses.

Turning up support on a properly-tiered IT support structure enables IT managers to leverage the lower-cost first-level or Tier 1 NOC to perform routine activities and free up higher-level or Tier 2 and 3 IT support engineers to focus on more advanced issues and implement strategic, revenue-generating initiatives for the organization.

The figure below illustrates a well-organized tiered NOC support structure in action. Here, the Tier 1 team uses monitoring tools and interacts with end-user help desks and Tier 2 and 3 engineers and third parties. Information flows between the various entities within a well-defined process framework.

Tiered NOC Support Structure

Grab our free white paper, Empowering the IT Service Manager, to learn more about how a well- organized support structure enables teams to cost-effectively address their IT support needs by leveraging lower-cost first-level (Tier 1) support.

Having such a structure for managing workflows prevents the NOC from being overwhelmed by the “wall of red” that leaves teams exhausted and your business vulnerable. Any NOC supporting an enterprise or large service provider should be prioritized and organized into a set of queues, so each of them can be handled by the appropriate group. With this approach, “monitoring” isn’t planned for as a siloed activity, but as a piece of the broader system it serves.

So how can you know if a potential support partner offers the structural operational framework your organization deserves? The following questions are just a few that can be instructive here.

First, do they have a structured operational framework for designing and delivering support? Don’t assume all NOC support providers think about operations holistically and intentionally. Many surprisingly don’t.
Do they conduct a comprehensive assessment? A capable service partner should go beyond simply “collecting” your support requirements. Here at INOC, for example, a certified project manager and engagement lead explores and clarifies your support needs in full detail before any piece of the service blueprint comes together.
Do they complement assessment with a dedicated design and onboarding process? Following the assessment, do they draw on their findings and knowledge of operational best practices to provide tiered organization structures, define appropriate workflows, and provide for full operational visibility into your infrastructure and support systems? Once support is turned up, you should get access to a client portal to track metrics that help both teams benchmark performance and determine the root cause of ongoing issues for more effective resolution and prevention.
Do they align with frameworks such as ITIL and FCAPS? Many organizations—especially enterprises and services providers—invest themselves in popular operational frameworks, underscoring the importance of not only having a structured NOC but also ensuring it, too, aligns with the practices and processes ITIL prescribes. This is a big differentiator between service providers. Find a partner that can demonstrate the same or an even higher level of investment in the operational framework you rely on.
Do they offer a service catalog? A fully-capable support partner should offer a service catalog that covers a wide range of operational support functions for proactively monitoring, detecting, and measuring service availability and performance across your infrastructure and its support operations.

2. An Integrated and Integrable Platform

An effective support partner should offer a platform that’s both highly-integrated and provides the ability to integrate with your environment without eliminating the policies, processes, and procedures that make your team effective.

There’s a lot to consider here, so we broke this point down into four important areas:

Integration Strategy
Advanced Correlation, Machine Learning, and Automation Capabilities
Comprehensive Incident Management
KPIs and Reporting

Integration Strategy

Most IT organizations already have significant investments in tools, technology, and other support resources. Your NOC support partner should be able to integrate with those resources to augment and enhance your support capabilities without forcing you out of your investments or changing the way you do business or the environment you’re comfortable working in.

To that end, their platform should integrate with all your support systems, both technical and procedural, to fill gaps and enhance current capabilities. Whether you need a support partner to augment your current service activities (to escalate activities to your engineers, for example), or a NOC team that can take a more active role in working and resolving issues, their platform should be flexible and powerful enough to meet your needs precisely.

In addition to having a highly integrable platform, a capable NOC support partner should also have a centralized visibility system (a “single pane of glass”) in place to ensure alarms are presented to highly-skilled engineers who can make meaningful correlations, reduce response and resolution times, and ultimately achieve (or exceed) SLA and SLO windows.

As we explore further in the next section, a lack of platform integration can result in tools like the NMS and EMS being isolated, which in turn can trigger a cascade of issues; namely forcing staff to gather data from multiple sources and manually correlate alarm and incident data for proper ticketing.

Any managed service provider you look to for monitoring or broader management support should offer a platform that can seamlessly integrate with your NMS or EMS—receiving event data and polling your infrastructure elements (network, cloud, and applications) using a variety of mechanisms and protocols.

Whether you’re leaning on them for direct monitoring or plan to integrate your own monitoring tools with their platform, your support partner should be able to seamlessly and flexibly connect their NOC platform to your infrastructure or monitoring tools so alarms flow freely with the appropriate integrations across both monitoring and ticketing systems.

Monitoring Systems:

SNMP
Email Events
REST API
Streaming telemetry data such as gNMI or other similar protocols

Ticketing Systems:

REST API
SOAP API
Email integration

Integration Capabilities:

Alarming interface integrations: If you run your own monitoring tools, an outsourced NOC support provider should be able to integrate downstream of your NMS, EMS, and/or devices through an alarming interface, the mechanism by which your systems tell their’s that an event has occurred.
Event correlation and ticketing integrations: An outsourced NOC support provider should also have the tools and processes set up so that once they received your alarm, they’re able to employ both human and automated ticket correlation processes to create appropriate incident tickets, problem tickets, and other records that can be synchronized to your ticketing system for troubleshooting and resolution. More on advanced correlation below.
CMDB integrations: A seamless CMDB integration ensures both parties’ configurations are a perfect match. For each alarm your service provider receives and each subsequent ticket they create, the CMDB integration should associate the appropriate meta information, arming the NOC engineer with the actionable information they need to make informed decisions. Taking this one step further, a highly-capable support partner should also be able to draw on years of experience to enhance your existing CMDB structures and capabilities, further improving efficiency and effectiveness.

Advanced Correlation, Machine Learning, and Automation Capabilities

Especially in large organizations, IT teams may receive dozens or even hundreds of alarms, all emanating from a single device. Without the means to collect and correlate that information efficiently, the scene is predictable: a team of engineers spending valuable time looking in different places trying to figure out what the problem is—all while downtime accrues and the business suffers.

As we mentioned earlier, solving this problem often starts with integration. But it also requires an effective correlation engine. An “isolated” NMS or EMS forces staff to spend valuable time looking at different systems to gather data and even more time to correlate alarm and incident data for proper ticketing manually.

Band-aid solutions, such as automated ticketing platforms that don’t also offer advanced correlation capabilities simply move the problem around rather than solve it. The issue re-emerges as a deluge of unnecessary tickets that overwhelm staff and threaten SLAs as unsophisticated or completely absent correlation engines generate multiple tickets for a single event or incident that presents itself through multiple alarms.

A few key questions can reveal the need for better correlation capabilities:

Is your staff manually collecting information to decide which alarms are actionable?
Is your staff manually correlating alarms to generate appropriate tickets?
Is your staff able to capture and capitalize on correlation lessons from the past to improve future performance? Is this “tribal” knowledge, or is it documented and distributed?

Especially at the enterprise and service provider level, a reliable NOC support partner should use advanced correlation and machine learning tools to improve accuracy and identify network, infrastructure, and application issues faster. It should also implement automation to respond to incidents wherever possible to reduce resolution times.

Here at INOC, for example, we’ve implemented AIOps (a blend of automation and machine learning) into our workflows to gather and correlate data better and faster than even the best human teams can manage so staff can spend their time where it matters most: taking action toward a resolution. We devoted an entire white paper to explain what this technology offers, how it applies to the NOC, and how it can be operationalized to unlock new opportunities for accuracy and efficiency. Grab it here.

Over time, both through machine learning and other inputs into the correlation platform, teams that implement these capabilities well can generate appropriately-correlated tickets in under a minute. Achieving this level of efficiency can solve some of the toughest operational challenges and bring results you can measure in dollars:

Reducing IT operational expenditures by managing more data and potential incident volume without having to grow headcount.
Avoiding outages and downtime by detecting incidents before they become outages. AIOps tools have unlocked a whole new ability to quickly uncover the root cause of incidents and get the right teams involved for rapid resolution.
Better utilizing resources by boosting Tier 1 resolution and reducing unnecessary escalations so advanced engineers can work on revenue-generating projects without incurring new operational risks.

Comprehensive Incident Management

Many incident management problems stem from poor workflows and an inadequate or absent operational framework. Ineffective runbooks can lead to a slow, inadequate ticket response. Without these critical operational components, teams can struggle to simply find information and operate the systems in front of them.

A capable monitoring and NOC support partner should demonstrate that they have the runbook and process sophistication to manage the incidents your business encounters. Just as importantly, that sophistication should be baked into the incident management platform as workflows within the ticketing platform.

Read on: Incident Management: The Foundation of a Successful NOC

KPIs and Reporting

Simply put, you can't manage services without identifying, tracking, and acting on the right metrics and KPIs.

A NOC support partner should securely warehouse all event, ticket, incident, performance, and CMDB data and feed it into a client portal to provide clear reporting overviews into KPIs through dashboards, as well as detailed information on NOC support activity, real-time network status, and performance. Standard and custom reports based on event, incident, and performance data should be available from the portal.

Expertise informed by experience is critical here. A capable partner should have in-depth knowledge in developing metrics and KPIs and analyze that performance to find and seize opportunities to improve service.

3. Continual Service Improvement (CSI)

Change is inevitable and often constant in any IT organization. It’s also a perennial challenge that speaks to the core project of ITSM: providing support for the devices and broader infrastructure you manage.

Without a thoughtfully structured approach to ITSM, the service quality will almost certainly suffer and degrade over time. And without a robust CSI program to capture those changes and detect potential or existing issues, that degradation can fly under the radar until the organization slowly and painfully becomes aware as problems become more frequent, severe, and costly.

This scenario is all too common when service decisions are driven primarily by price. Put bluntly, if cheap service results in poor service, those costs will come back (often many times over) financially and as damage to the brand.

To keep up with changes, an effective service provider needs to capture a change in the infrastructure and in the service and support environment tasked with monitoring and managing that infrastructure.

Here at INOC, for example, our service catalog breaks requests for change (RFCs) into two types specifically to address this need:

Infrastructure: This addresses, for instance, a server or cloud instance being decommissioned, a new network device is added, or changes to an existing network device.
Monitoring and management: The other type of RFC addresses changes to how we monitor and manage. Here, we’re considering changes to, say, an NMS or EMS platform, whether it’s addressing a change to a process, a person, or something else.

The critical point here is that the types of changes are clearly defined and delineated to manage them appropriately.

Capturing Infrastructure Changes

A support partner can capture infrastructure changes in multiple ways. One way is for you, the client, to simply inform them of an impending change and ask, “how are you going to monitor and manage this?”

Receiving basic change notifications like this should be considered table stakes for the NOC. The difference between good and great service providers here lies in the continual service improvement (CSI) program.

First, do they have a CSI program? And if so, does it offer the mechanisms needed to ensure your team can sleep at night knowing changes won’t slip through the cracks and create problems?

These mechanisms should include:

Comprehensive metrics and KPIs
A comprehensive understanding of how to develop and analyze metrics and KPIs
Periodic business reviews (including runbook reviews)
Dedicated quality control and quality assurance programs
A dedicated client engagement team
A dedicated account management team

The service provider should spot gaps in service and understand if that gap is widening (and therefore requires broader discovery and solutioning). Quality assurance and client engagement are other huge value-adds that work both reactively and proactively to resolve issues.

The takeaway here is that any capable service provider should acknowledge that change happens and needs to be monitored for and addressed—even if someone forgets to pick up the phone in preparation. How do they capture those service gaps?

Read on: ITIL CSI: A Guide and Checklist for IT Support and the NOC

Final Thoughts and Next Steps

IT support managers and staff face significant challenges as new and higher service demands stress lower time to resolution, higher service availability, and increased end-user satisfaction while controlling costs.

Effective infrastructure support requires understanding and proactively monitoring and quantifying IT support activities that include event and incident management, processing of service requests, maintenance of documentation, and periodic review of the service performance.

Whether for NOC monitoring or a more comprehensive management function, the right managed service provider should empower your team and bring you closer to your business goals.

Looking for a partner that brings all of these capabilities to improve uptime and performance for your business? Contact us to see how we can help you improve IT service and NOC support, or check out our other resources and download our free white paper below.

*Originally developed by the UK government’s Office of Government Commerce (OGC) - now known as the Cabinet Office - and currently managed and developed by AXELOS, ITIL is a framework of best practices for delivering efficient and effective support services.