Efficient workflows and fast response times are critical to maximizing IT infrastructure performance and uptime—the core goals of any Network Operations Center (NOC).
Whether your infrastructure is running in the cloud, on-premises, or a hybrid of the two, the impact of service unavailability can be disastrous. Your NOC needs to be able to detect and respond to issues within acceptable service levels to ensure the impact on your business is minimal.
Top-tier NOCs utilize a Service Level Management (SLM) framework to make and measure progress toward these goals. SLM serves as the foundation for gathering service requirements, establishing service levels, and monitoring and reporting performance according to those service levels.
However, implementing an SLM framework to manage NOC service levels isn’t a straightforward process. There’s no handy guidebook for NOCs to follow. As a result, many NOC teams face the following challenges:
A more comprehensive approach to SLM enables the NOC to paint a complete and accurate picture of the quality of service provided, which enables Continual Service Improvement (CSI).
Applying an SLM framework to the NOC at this level takes thoughtful planning and diligent management. Here at INOC, we’ve tuned this powerful framework to unleash its full potential not only for establishing SLAs but also for measuring and improving service proactively.
Here, we clearly explain SLM concepts relative to the NOC and how NOCs should apply SLM beyond the standard operational measures to provide greater value and achieve better outcomes.
📄 Read our other guide—NOC Service Level Reporting: Basics, Best Practices, and Examples—for a closer look at the reporting side of service level management.
At a high level, Service Level Management is the practice of bringing everyone together in agreement with how IT service is to work. SLM ensures that service levels are measured and reported. Practically speaking, SLM involves defining, documenting, and managing service levels—in this case, in the NOC.
To carry out SLM, we must define the performance measures in a Service Level Agreement.
An SLA, at its simplest level, is an agreement between an IT service provider—internal or external—and the customer that ensures certain characteristics that measure the performance of the service are defined. It also establishes the responsibilities, the means to measure, and the reporting cadence on actual outcomes relative to those agreements.
An SLA can contain one or more performance measures called Service Level Objectives for which the service provider is responsible. The SLA also contains reporting responsibilities, credits, and penalties.Service Level Objectives (SLO) specify the service, responsibilities, and service level targets that comprise an SLA. In other words, they’re the “substance” of an SLA.
Here’s an example: A NOC service provider may establish an SLO that sets the response time for phone calls. Here, a substantive SLO may be answering the phones in an average of 30 seconds measured over a month. Another SLO, this one for call handling, might indicate that the maximum time that a call can wait to be answered must be within five minutes.
Now that we have some hard objectives defined, we need the means to measure and report on them.
Service Level Indicators are the components of an SLO. An example is shown below.
Each of the SLIs can be measured, and in total, they reflect the SLO. These measures provide actual insight into the performance level of the NOC to comply with the SLO.
So, how do these IT service concepts apply specifically to the NOC? SLAs are the formal agreements that document service level targets and specify the responsibilities between a NOC and its customers.
Most ITIL-aligned NOCs utilize SLM for establishing, monitoring, and reporting on the standard service levels expected of basically any NOC. These include response times, average and maximum call hold times, notification and escalation times, and troubleshooting windows.
Here at INOC, we complement standard KPI reporting, which includes monthly SLA measurements, with an array of additional SLOs to better measure performance and keep both teams aligned on success.
In our view, limiting reporting to just a handful of rigid service levels rarely tells the full story about the quality of NOC service being provided. Limited reporting also ignores important operational signals that serve as inputs for continual improvement.
Our SLM model combines critical KPI reporting with a broader, often more meaningful set of objectives that bring additional data and context into view. In short, we analyze each SLO, break them into their components, and measure each of those. Rather than focusing on a composite metric, we focus on addressing and optimizing each of its component parts.
Take the critical SLO of Mean Time to Restore (MTTR) set at four hours, for example. This measure contains several more granular SLIs:
We break down and address each discrete indicator that, together, comprise MTTR.
These include:
So, how does this approach to SLM translate into tangible value for a client? Put simply, it drives a constant state of continual improvement. We want to take every opportunity to make processes and activities as efficient as possible. That means closely examining each component of an SLO, spotting those opportunities, and, for example, adding automation to make incremental improvements that contribute to greater availability and less downtime.
With this expanded approach to SLM, each monthly report we produce presents both precise reporting around key service levels and a big picture perspective that can inform proactive enhancements and optimization.
Describing the function and value of SLM in the NOC is one thing. Demonstrating it by example is another. We summarized two instances of SLM in action below to bring the concept down to earth and into the NOC.
Both examples demonstrate the value of going beyond the basic SLOs (like those shown in this example dashboard) to understand what specific factors are contributing to them.
Shortly after turning up a monitoring service for a client’s AWS infrastructure, an alert is raised to the NOC. The Disk Write Input/Output Operations Per Second (IOPS) had risen significantly.
Thanks to automation, the alert was received in the NOC virtually instantly. But once there, like any alert, the alert response must be measured.
Alarm to Ticket
Diagnosis and Response
Now that the event is ticketed, the NOC team needs to get to work determining what happened so they can direct the correct response and restore the service:
Restoration
While the NOC diagnosed this issue exceedingly fast, the IOPS continued to remain high 12 minutes after the alert came in. Cue service restoration:
This whole process took the NOC engineer about 45 minutes to complete. Since this was handled within the NOC and the MTTR on this incident was 57 minutes, the NOC was well within the four-hour SLO.
It’s ten o'clock in the morning, and a NOC-supported retail location finds itself unable to process credit card transactions. The store manager calls the NOC for support.
Call to Ticket
Escalation
Diagnosis
Resolution
In this case, the NOC could quickly resolve the issue because the correct SLOs were in place.
These examples show how SLM is key to measuring, quantifying, and ensuring the NOC performs to its full potential.
Beyond these functions, it’s also essential to understand how the service levels established and managed through SLM will impact the response time needed for various scenarios.
SLM is instrumental in tracking, reporting, and reviewing these services' performance to understand better every dimension of the service being provided and improve the performance for handling the next incident.
For IT leaders considering SLM in the context of their own IT service, the following questions can be illustrative of the need for action:
Not satisfied with answers to these questions, or need help working through the correct Service Level Management components for your organization? Schedule a free NOC consultation or contact us to see how we can help you improve your IT service strategy and NOC support.
Want to learn more about building, optimizing, and managing your NOC for maximum uptime and performance? Grab our free white paper below.