NOC Operations: A Guide to People, Platforms and Processes

Written by Prasad Rao | Sep 20, 2022 4:36:00 PM

Organizations that rely on IT infrastructure to keep their core business activities functioning face many challenges within their IT support or NOC group.

How many of these sound familiar?

Burnout of IT support personnel
Impaired off-hour coverage
Lack of proper tools and a structured, process-oriented approach
Lack of accountability
Slow response and high resolution times
High costs of supporting IT infrastructure

From a business standpoint, these problems can result in poor end-user or customer experience, loss of productivity, delayed strategic initiatives, and employee attrition.

NOC teams often underestimate the amount of support activity performed by IT staff due to the absence of defined metrics, appropriate data collection, and a process-oriented support system. Moreover, a lack of visibility into the nature and types of support activities leads to inefficient utilization of qualified IT staff.

All of these challenges are expressions of a bigger, central problem: a suboptimal NOC operation. Simply put, if a NOC can’t operationalize itself effectively, it can’t work effectively—or efficiently for that matter. The challenges we just mentioned are, in a way, symptoms of an underpowered operation.

Here, we tackle a few of the most common and significant challenges that relate to a deeper operational problem and offer practical advice for overcoming them. Since there are many dimensions of a NOC operation, we’ve organized them into three core categories:

Process
People
Platform (Technology)

Investigating these areas is the first step in developing a NOC improvement plan to resolve or relieve performance issues. We identify a common challenge in each of these three areas and recommendations for overcoming it.

Process

Challenge: Lack of meaningful operational/utilization metrics

Many NOCs don’t have service level agreements (SLAs) or internal utilization metrics and are therefore operating without clearly defined service requirements or methods of monitoring their performance.

First, a quick review of these concepts:

SLAs are expectations of performance measures codified in a contractual agreement that specify what a service is, who’s responsible for it, and what the target expectations are.
Utilization metrics are measurements that reveal why the NOC is or isn’t busy at any point in time, and how staffing levels should be set accordingly. It’s ironically one of the biggest missing pieces we see in NOCs today while simultaneously being essential for operational efficiency.

If SLAs are missing or in need of improvement, support leaders should first consider what they want or need to deliver. What, specifically, do your end-users or customers expect from the NOC? From there, a NOC can develop SLAs if it hasn’t already or sharpen any existing agreements that are in place.

Once you know how quickly and accurately the NOC needs to perform, start by implementing the measurement of more precise service level measures: service level objectives and service level indicators (SLOs and SLIs). These are the contents of an SLA. By starting with these elements, you can essentially build your SLAs from the bottom up—and have an accurate way to determine whether you are meeting or missing them.
After implementing these metrics, establish the necessary reporting components to track them over time. Put a program in place to ensure you’re continually asking yourself: Where are we succeeding? Where are we not succeeding? How do we enact change to improve upon these metrics?

📄 Read our other post to learn more about SLAs, SLO, SLIs: NOC Service Level Agreements: A Guide to Service Level Management

To start measuring utilization metrics, you first need to determine what metrics you need to measure and then configure your reporting tools to capture and present that data in a way that’s clear and actionable.

These metrics aren’t always obvious, and figuring out which specific metrics a team should track is a task we help NOCs with all the time.

Here are three of the most important utilizations virtually every NOC should be measuring and acting on:

How much human labor is devoted to each edit of a particular type of ticket
How many ticket edits are typically processed/performed per hour
How many edits are typically made by time of day and day of week

Again, measuring these metrics shouldn’t be particularly burdensome—it’s a matter of configuring your tools to capture them. Reporting on and visualizing them, however, can be a much bigger lift if the team is metrics-deficient in general. NOCs often collect data from multiple sources and struggle to connect the dots.

For example:

The ticketing system has a database.
Email is another database.
Alarms are another database.
Maybe phone calls are another database.

Many if not most NOC managers aren’t equipped to pull those disparate data sources together into a single dashboard or visualization—one “single pane of glass.”

The solution here can look different from one organization to another. Still, generally, it boils down to ensuring you have data properly warehoused and are equipped with a reporting engine capable of collecting, analyzing, and displaying that data so it’s actionable. (Tools here include Power BI, Tableau, and Cognos.)

📄 Read our other post to learn more about establishing critical utilization metrics in the NOC: NOC Performance Metrics: How to Measure and Optimize Your Operation

To understand if you have work to do in either of these areas, ask yourself the following questions:

Are you measuring both the effectiveness and quality of the service you’re providing?
How are you evaluating whether you are delivering the service according to service levels?
Do you find yourself struggling to deliver acceptable services—and don’t know why?
Can you concisely articulate what your NOC is trying to accomplish and deliver?

Challenge: No standardized process framework

Another common issue we see is that businesses do not use a standardized process framework like ITIL to apply established ITSM best practices and maintain consistency across their operation. This leads to NOCs failing to perform at optimal levels.

Businesses can—and most often should—choose a framework, such as MOF, FCAPS, or ITIL, and use it to standardize NOC procedures starting with specific areas that are particularly troublesome, such as incident management, problem management, or the service desk. We’ve helped support teams do just that—and we know just how big and multi-faceted of a project it can be.

Applying a framework to your operation is hard enough on paper. Ensuring that you’re training staff on these procedures can complicate things considerably. Learn more about our NOC Operations Consulting services if you’re in need of expert help here.

📄 Read our other post to learn more about putting ITIL practices into use in the NOC: ITIL Service Operation and the NOC: A Quick-Guide and Checklist

Challenge: No quality control or assurance

To meet customer or end-user expectations, quality assurance is essential and should be fully ingrained into your operation. To do this, the NOC must put in place robust, documented, and trained-on incident and problem management procedures. These are table stakes for preventing quality inconsistencies.

To assist in incident and problem management procedure development, it’s a good idea to document processes and procedures suited to the environment in question and select appropriate performance metrics. These can be used to catch issues before they affect the client, as well as respond to client complaints appropriately.

Once these components are in place, the NOC can begin sharpening a more refined approach to QA/QC.

Here are some additional steps and measurements a NOC can take, track, and act on to control and gradually improve its performance:

Perform monthly audits of a subset of tickets
Ensure staff are properly trained and developed as environments evolve
Convene monthly and quarterly service reviews with stakeholders
Review tickets, incidents, and escalations
Review chronic alarms and incidents
Coordinate a maintenance calendar
Report on service and operations metrics
Perform contractor assessments of performance
Forecast planned activities
Review the quantity and causes for NOC support activities
Identify root causes and preventive actions
Review client escalations and open items
Review staffing and proficiency levels
Validate assumptions for contract
Identify and mitigate risks within infrastructure and operations
Review previous action items and status

Challenge: Lack of documentation and runbooks

Runbooks—standardized processes documented and accessible to staff—are essential for ensuring NOCs function consistently. In our view, the runbook is the single source of truth for everyone inside and outside the NOC.

An effective runbook is thoughtfully planned and produced by employing technical writers to document the tools and procedures needed to deliver NOC services successfully, and ensuring that they are kept up to date, rather than relying on institutional knowledge to train new employees. In addition to putting such documentation in place, it’s absolutely essential to keep runbooks up to date, as out-of-date information could be more harmful in some cases than none at all.

Maintaining an accurate repository of the network diagrams and asset details, including configurations and support service levels, allows the IT support group to prioritize incidents appropriately and resolve them quickly and efficiently. A controlled change management process and a routine patch and configuration management procedure are essential for preventing unnecessary downtime.

People

Challenge: Difficulty in staff hiring, training, and retention

The NOC can be notoriously difficult to keep staffed given the high-stress work involved. This churn is often made far worse by a poorly operationalized operation, which can hurt morale and motivate otherwise excellent talent to find other opportunities. (Read our guide to building and managing an effective NOC team for a much deeper dive here.)

If staffing is a chronic issue in your operation, begin by analyzing how you’re hiring, training, developing, retaining, and utilizing your staff. Homegrown NOCs may have smaller staffs that are sometimes more than sufficient to resolve issues, and at other times are furiously busy or constantly on-call.

This whipsaw-like variance in extremes can lead to inefficient staff utilization and burnout, which can in turn lead to turnover.

To analyze your practices around staffing and orient yourself for improvement, ask:

How are we training our staff?
What does our staff do during downtime?
When do we have too many staff—and why?
When is our staff overloaded—and why?
Is quality affected by the way we manage after-hours work or shift schedules?
Are we giving staff clear, individual goals they work towards?
Do we have a way to track the effort each individual engineer is putting in?

By implementing the utilization metrics we discussed above, you can identify when your busiest days and times are, and when the fewest issues need to be resolved, allowing you to better schedule your available staff rather than having to rely on assumptions.

In addition, many small, understaffed NOCs may never be able to justify scaling up their operations due to perpetually low incident volume. In these situations, outsourcing some or all of their NOC can provide relief from being constantly on call and staff complications from too little sleep.

📄 Read our other post for a closer look at how outsourced NOC service models offer a solution here: Shared vs. Dedicated NOC Support: A Quick-Guide

Platform (Technology)

Challenge: Failing to organize NOC activities and the subsequent workflow based on technology and skill level

Many NOCs are inefficient as a result of immediately escalating routine tasks to advanced staff rather than reserving them for the most complex issues or sending issues through the lower (less expensive) tiers first in the hopes that they will be able to resolve issues before they get to upper levels. This approach has its pitfalls too, however, since it can lead to misassignment of tasks and compounding inefficiencies.

Instead, NOCs should consider how to move tickets through tiers most efficiently, providing higher quality service. This may look different for each operation based on many factors, but the figure below, excerpted from our free white paper, may offer an instructive starting point.

Challenge: Disparate tools and platforms — i.e., no “single pane of glass”

To maximize NOC efficiency, companies should be mindful of what technology they use, and how they utilize it. One of the biggest operational challenges is simply how disparate systems become over time.

This is not an issue that typically calls attention to itself, but quietly exacts a huge toll on efficiency over time. Switching between systems not only slows the NOC down but increases the risk of something being missed. NOCs should gather and visualize all critical data in a single, easy-to-access dashboard—the “single pane of glass” we mentioned earlier.

Here are some questions to consider to see if you’ve got some work to do here:

Is your NOC fully integrated?
Are you automating any workflows?
Do you ever revise workflows to make them simpler (and thus easier to do consistently?)
Are they designed to automatically fix issues without a human touching them, or are workflows a way to fix the problem or detect the problem and consolidate ticketing?
How do your processes in this area affect your reporting capabilities?
Are there any steps that could be fully automated?
Do your clients require tickets to meet specific requirements each time (e.g. specific verbiage) that might be readily automated?

Let’s close by lingering on automation for a moment. Implementing automation carefully and efficiently can be a particularly hairy question. Many companies think of automation as a way of fixing problems without a human needing to touch them.

This can be hazardous, since the system may sometimes do things you did not intend. Instead, we encourage teams to think about automation in terms of a way of detecting problems and collecting data.

Identify situations where a human does not need to be involved, such as when you are strictly following a documented process, or when information simply needs to be gathered and sent on. For example, it might be useful to automate the standard elements of ticket generation, such as the verbiage, or logging the device to ensure consistency and save time.
Ticketing is another area where automation can be thoughtfully applied. There are challenges both to ignoring alarms that you are aware of and ticketing every alarm. With the first approach, you undermine the validity of your metrics, which can cause issues with issue detection and reporting. With the second approach, a single issue may result in a multitude of alarms, and ticketing each one rather than grouping them could be a waste of time.
Instead, we like to consolidate tickets that are related to a single incident to maximize efficiency without compromising valuable data. We recommend identifying alarms that are not actionable and using filtering to eliminate them to reduce noise in your NOC.

Final Thoughts and Next Steps

Improving your NOC for peak performance is not something you can fix by setting aside a couple of hours a week. This is a situation where hiring experienced engineers who have seen many NOCs and learned from their failures can be a real asset as opposed to trying to sort things out by yourself.

INOC offers two comprehensive solutions to help organizations maximize their NOC capabilities:

NOC Support Services

Our award-winning NOC support services, powered by the INOC Ops 3.0 Platform, provide comprehensive monitoring and management of your infrastructure through a sophisticated multi-tiered support structure. This advanced platform combines AIOps, automated workflows, and intelligent correlation to help you:

Achieve maximum uptime through proactive monitoring and accelerated incident response
Reduce manual intervention with automated event correlation and ticket creation
Scale your support capabilities without the complexity of building internal NOC infrastructure
Access real-time insights through a single pane of glass for efficient incident and problem management
Leverage our deep expertise across technologies while maintaining complete visibility through our client portal

NOC Operations Consulting

Our consulting team provides tactical, results-driven guidance for organizations looking to optimize their existing NOC or build a new one from the ground up. We help you:

Assess your current operations and identify opportunities for improvement
Develop standardized processes and runbooks that enhance efficiency
Implement best practices for event management, incident response, and problem management
Design scalable operational frameworks that grow with your business
Transform your NOC into a proactive, high-performance operation

Both services are backed by INOC's extensive experience serving enterprises, communications service providers, and OEMs worldwide. Our team brings proven methodologies and deep technical expertise to help you achieve your operational goals, whether through direct support or strategic guidance.

Learn more about NOC services and schedule a NOC consultation with our Solution Engineers to start the conversation. Want to learn more best practices for running a NOC at peak performance? Grab our free white paper below.

FREE WHITE PAPER

A Practical Guide to Running an Effective NOC

Download our free white paper and learn how to build, optimize, and manage your NOC to maximize performance and uptime.

Download →

View full post