There are many metrics NOCs and service desks use to measure their performance against service level targets and drive improvement when it’s needed.
The challenge is sifting through all those metrics and data points to identify which ones best capture true operational performance and measure them accurately so operational improvement can be targeted where it needs to be.
This guide dives into one of the “MTTR” metrics most impacted by the NOC and service desk versus other stakeholders in the ITSM workflow: mean time to restore.
Having spent the last 20+ years helping enterprises, service providers, and OEMs improve the support provided to their customers and end-users through a suite of NOC solutions, we wrote this guide to demystify mean time to restore, put it in a useful context, and identify a few challenges and strategies in measuring and improving it.
- What mean time to restore is
- How it compares to the other “MTTR” metrics
- Its relevance to the NOC and service desk
- Challenges in measuring it accurately
- A few ways to reduce MTTR in the NOC/service desk
What is Mean Time to Restore?
Mean time to restore (sometimes called mean time to recovery) is one of the four big MTTR metrics, the others being mean time to “repair,” “resolve,” and “respond.” Mean time to restore is the average time spent getting a downed service back up and running following a performance issue or downtime incident.
This KPI enables teams to measure the speed of their restoration or recovery process, specifically, as opposed to how quickly they’re completing repairs or fully resolving issues. It indicates how quickly a team is recovering systems from failure but says nothing about the nature of the problems making that metric higher than it could be.
Mean time to restore is an essential metric in incident management as it shows—on the whole—how quickly you’re restoring service following downtime incidents and getting systems back up and running, which is often the first priority for the IT support group when downtime occurs.
In this way, this metric is a good aggregate measure of a NOC or service desk’s ability to maintain control over service levels and restore services as quickly as customers or end-users expect in the event of unplanned incidents.
When this metric isn’t landing where it ought to be, it should trigger the support team to, among a few things:
- Examine its incident management process to spot weak points that are dragging restore times down across the incident management lifecycle (detection, ticketing, alarming, etc.).
- Use problem management to identify and resolve the root causes of recurring incidents.
- Develop network redundancies and service workarounds that can restore services in multiple ways when incidents bring them down.
However, as we explore later in this guide, the more pressing MTTR challenge isn’t figuring out how to improve it, but how to accurately measure it in the first place. A lack of operational structure and “selective ticketing” practices are two of the most common roadblocks to capturing real recovery times, which we’ll discuss in some depth later.
Mean Time to Restore vs. Other MTTR Metrics
Before diving deeper into MTTR, it’s important to level-set on what “R” we’re talking about since the acronym can stand for four different measurements:
- Mean time to repair
- Mean time to restore (or recover)
- Mean time to respond
- Mean time to resolve
Let’s quickly define each of these to tease out their differences, and why it makes the most sense to focus on the speed of service restoration in measuring a NOC’s performance.
Mean time to repair
Mean time to repair is the average time it takes to repair a system (as in getting it functioning again). It includes both the time it takes to make a repair and the time it takes to test that repair to make sure it worked. The clock only stops once the system that’s not functional becomes fully functional once again.
It’s a useful metric for support and maintenance teams to track their repair times—ideally getting it as low as possible through process improvements, training, and other inputs that affect efficiency.
Mean time to restore (or recover)
Mean time to restore (or recover) is the average time it takes a team to restore a service after a system or product failure. This metric includes the full time of the outage, from the moment of failure to the moment when the service that went down is restored. To diagnose a problem here, you’ll have to dig deeper.
The nuance here, and the reason it has so much import to the NOC, is that restoring a given service might be achieved separately from repairing what broke or completely resolving the issue that led to an outage. A skilled NOC team will help establish redundancies and workarounds that enable a downed service to be brought back up and running in more than one way, if possible.
Super importantly, and germane to this discussion of MTTR in the NOC, mean time to resolve is an aggregate measure of all teams involved in repair and remediation—not just the NOC or service desk. The NOC can’t make field technicians drive their trucks faster, or work smarter once they’re on-site, so it’s just one of many actors and inputs that contribute to it.
Mean time to respond
Mean time to respond is the average amount of time it takes a team to respond to a product or system failure once you are first alerted to that failure (not counting system-caused lag time).
Mean time to resolve
Mean time to resolve is the average time it takes to fully resolve an issue (more specifically, a failure of some kind). This includes not only the time spent detecting the failure, diagnosing the problem, and repairing the issue but also the time spent ensuring that the failure won’t happen again.
It’s typically used to measure the time it’s taking to fully resolve unplanned incidents.
To more clearly illustrate the difference between restoration and resolution in this context, an example of restoration may be using temporary fiber to get service back up and running, while full resolution may involve opening a maintenance window to replace the fiber permanently and bury it.
The Importance of Measuring and Optimizing Mean Time to Restore in the NOC
Mean time to restore is often held up as one of the north star performance metrics in the NOC and service desk since service recovery/restoration is central to the type of incident management these teams do. As we just mentioned, though, it’s a “high-level” metric that only indicates good or bad recovery performance without expressing anything about what part of the incident management process or what specific external factors are extending recovery times.
This metric is an aggregate measure of multiple steps in managing an incident, from detection, ticketing, and alerting all the way to response and actual service restoration. So it can’t tell you which of those steps can or should be improved—just that you should start digging into those sub-processes to find inefficiencies and smooth them out.
For example: High recovery time could be due to a misconfigured alerting or ticketing system within the NOC, which is dragging down performance early in the lifecycle of incidents.
But it could also be the case that the NOC’s ticketing and alerting workflows are excellent, and the slow-down is actually the result of bad runbooks, which are leaving engineers scratching their heads during initial troubleshooting.
The point here is that your recovery times can’t diagnose performance issues—they can only tell you they exist. Without more context, it’s impossible to know what’s wrong and what to do about it
To identify and address the problems indicated by high recovery/restore times, teams can use other metrics, including the other types of “MTTR,” to identify which parts of the process are problematic (response or repair, for example) as well as additional KPIs like mean time to acknowledge (MTTA).
Once teams know where to look within the process, they can zoom in to find the precise root cause of a performance drag—and then fix it—to improve recovery times.
The Challenges of Measuring MTTR Accurately and Completely
Mean time to restore is arguably one of the best measures of the overall speed and effectiveness of the NOC in its duties to keep services up for the customers or end-users that rely on those services. But one of the central problems NOCs struggle with is measuring MTTR well. Simply put, if you can’t measure it, you can’t know where to work to improve it.
The difficulty in measuring MTTR often boils down to two primary problems: an unstructured support environment, and what we call “selective ticketing.”
Let’s briefly unpack each of them.
MTTR Challenge #1: The unstructured support environment
Once again, mean time to restore captures the aggregate performance of several sub-processes in the broader incident management process. So, only when you can measure that performance end-to-end can you start to understand it and use it as a tool to interrogate the way you handle incidents to find opportunities to restore service faster.
To do that, though, teams need to map out the actual steps they take in managing incidents from beginning to end so they can begin looking for opportunities to improve within those steps.
In other words:
- How have you organized yourself to manage incidents from start to finish?
- What does your incident management process look like in practice?
This initial step of mapping out how you execute incident management is often where NOCs discover they have a ton of “basic” work to do to put a documented process in place first so they can get to improving it later.
Many teams realize they don’t actually have a robust, documented structure that governs the way each person manages incidents. Instead, they take an ad hoc approach—manually passing incidents to whoever is there at the time and best equipped with the tribal knowledge to handle it. When support teams see they’re operating without structure, they quickly realize they’re not in a position to accurately measure MTTR because there’s no clear process to derive that measurement from.
Compare that to a highly structured NOC with clear roles, assignments, and standardized steps to follow to detect, categorize, prioritize, investigate, escalate, resolve, and close incidents, and the difference in the ability to capture a good metric becomes clear.
Only when these processes formally exist and are documented can you really begin to look for ways to fully capture MTTR and then use it to run those processes faster.
“When teams set out to improve their restoration times, they often find that a more or less ‘free-for-all’ approach to incident management makes it nearly impossible to actually capture a good metric, let alone improve it. Performance is left to the best intentions and experience of the person actioning that incident.
Contrast that with a highly structured NOC where certain people are focused on the event ticket, another group taking that ticket to diagnosis, and yet another group is carrying it forward from diagnosis to repair. With this kind of structure in place, you can really start to analyze those moving parts and zero in on areas that need attention.”
— Mark Biegler, Senior Operations Architect, INOC
If you’re operating in an unstructured support environment, stop here to read a few of our other guides below that can help you structure your operation so you can start gathering accurate metrics for optimizing performance:
MTTR Challenge #2: Selective ticketing
Selective ticketing—meaning not ticketing every incident that ought to be ticketed—is another hindrance to measuring MTTR as it prevents teams from gathering incident data for analysis and capturing restoration times for support motions deemed unworthy of a ticket.
While many teams “over ticket” by creating tickets for every event, for example, many other teams swing the pendulum in the completely opposite direction—under ticketing in an attempt to avoid tedious documentation for quick fixes or perceived non-issues.
In our experience, some teams decide not to ticket incidents that don’t register an immediate business impact. But this fails to acknowledge that many incidents do impact the business, just not right away.
When “ticket complacency” sets in, teams set their future selves up for headaches when they eventually find themselves struggling to respond to a major failure after ignoring the early warning signs for weeks or months.
“Ticketing is what gives you visibility into what's really happening across your network. Before you can talk about improving restoration performance or taking a preventive approach to support, you really have to know what’s occurring in fine detail throughout the network.
If you're not ticketing enough, you'll never have that baseline on which to measure things. Teams that really nail their ticketing can start to measure things like how long it takes to actually raise a ticket, for example. If that’s taking too long, you know where to focus your efforts to bring those recovery times down.”
— Mark Biegler, Senior Operations Architect, INOC
INOC’s Automated Alarm Correlation and Auto-Resolution of Short Duration Incidents capabilities
Here at INOC, the next iteration of our 24x7 NOC support platform combines next-level support capabilities like Automated Alarm Correlation and Auto-Resolution of Short Duration Incidents to automate ticketing and dramatically reduce the time from initial alarm to incident ticket creation while ingesting all incoming incident data, so nothing is missed and precise metrics can be measured.
Auto-Resolution of Short Duration Incidents automatically resolves any ticket where recovery is achieved without intervention within a few minutes—a common occurrence. This provides faster updates to our clients and reduces non-productive work for the NOC, all while driving recovery times to new lows.
For clients that utilize our problem management service, all alarms that fall under this category are still reported on and reviewed by our Advanced Technical Services team as part of problem management.
📄 Read more about how we’re applying AIOps in the NOC to dramatically improve support performance in our free white paper: The Role of AIOps in Enhancing NOC Support
How to Improve Mean Time to Restore
If downtimes are too long and too frequent, your ability to deliver the service levels your customers or end-users expect is in jeopardy. Reducing MTTR becomes a core business imperative. Once you can accurately measure MTTR, making that number go down requires clear, detailed visibility into the problems that cause things to go down and need to be restored in the first place.
Once those root causes are known, they can be actioned proactively—thus making restoration go quicker and easier as time goes on. This is where processes like problem management become indispensable.
Here are a few common levers teams often pull to improve their MTTR once they’re in a position to measure it.
1. Refine your own incident management process
We’ve talked about this one at length already. When recovery times are trending in the wrong direction, it’s time to put your incident management process under the microscope to spot inefficiencies and iron them out.
If you’re currently running an ad hoc approach to incident management, this is likely your sign that you’ve outgrown it and need a structured operation in line with traditional ITSM.
Check out the resources we linked to above to lay down that blueprint and incorporate some best practices or contact us to put that project in the hands of NOC experts who can immediately assess your operation and apply proven frameworks to improve its performance.
2. Create or improve your runbooks
As you develop or refine your incident response procedures, make sure to document everything in runbooks built to take the friction out of those processes while making execution consistent from one person to another. The runbook should provide your team with a step-by-step guide for understanding what alerts mean and how to respond to them in a way that standardizes actions regardless of who’s working a ticket.
But the purpose of runbooks doesn’t stop there. As INOC’s Pete Prosen explains, the runbook isn’t just about executing processes consistently—it’s also about addressing the shortcomings of what hardware can tell you about certain failures, so troubleshooting is focused and informed.
"There’s a deeper point about the importance of runbooks that isn’t always obvious. Hardware is built to tell you about hardware failures. But hardware is not built to tell you about network failures because most devices can’t provide data beyond themselves. In real-world environments, true hardware failures are comparatively rare, usually around 2% across our client book. It’s almost always something external that the hardware alone can’t tell you about.
Take a fiber network, for example. If you’re not receiving light, you can’t really infer anything from an alert other than, “I need to figure out what I can do to fix this.” Your runbook should be the resource you can count on to tell you what to do based on what you know is most likely to be happening.
Maybe you can bounce port A; maybe you can bounce port B. Maybe it’s a Cisco device, and a port reset is all it takes. Or maybe you need someone on-site to put a light meter on a fiber link and see if you’re getting light off of it. The runbook documents all these actions engineers should try to put their finger on the problem and the fix. With a structured process, you can start to build runbooks based on which problems are most likely to blame for incidents and train staff on them.”
— Pete Prosen, VP of NOC Operations, INOC
When it comes to writing effective runbooks, don’t overcomplicate them: run the processes and write them down. Take the tribal knowledge out of your team’s heads and make it accessible for everyone in a guide that acknowledges what problems alerts are indicating most often and prioritizes action accordingly.
Many times, teams will come to find that 80% of their incidents look the same each time, so everyone in the NOC should know those alarm-to-action guides forward and back. The runbook gives you the opportunity to write down the very best set of steps to troubleshoot and work those incidents the same way each time.
📄 Read our other guide for more on writing excellent NOC runbooks: The Anatomy of an Effective NOC Runbook
3. Use problem management to prevent incidents
ITIL defines a problem as “a cause, or potential cause, of one or more incidents.” Problem management provides a framework to control problems by finding the root cause of those incidents and providing a solution to prevent their recurrence.
To do that, staff first need to have strong analytical skills and the technical expertise to be able to parse incidents to find those root causes. They also need historical incident data to see trends emerge and do those deep investigations.
The output of problem management may be true root cause solutions that prevent incidents. But it may also be an insight that helps teams implement workarounds that can be deployed in the event of an incident to restore services in different ways quickly.
4. Consider automated recovery mechanisms
Here at INOC, one of the things our newest ticketing system will be able to do is execute simple recovery commands based on alarms it recognizes.
"One of the Wi-Fi systems we support for a client sometimes suffers from stuck wireless remotes. The fix is simple: Issue a command to shut down the power over Ethernet port, wait five minutes, and turn it back on. Issue resolved—beautiful MTTR.
Our new system sees the alarm, recognizes the lost access to a remote, bounces it, and recovers—all without a human touch. This is where the machine learning behind AIOps really starts to help.
You can take your incident data and runbooks and implement an algorithm that recognizes known incidents and implements simple recovery you can trust a machine to do right.”
— Pete Prosen, VP of NOC Operations, INOC
When such a capability comes online, these simple, recurring incidents will be put “out of the way” by AIOps so engineers’ attention can be focused elsewhere, all while achieving extremely fast recovery times.
While AIOps-powered recovery mechanisms are on the bleeding edge of support automation, you may find opportunities to implement simple brute-force scripts to automate simple recovery motions. If one of the problems you're detecting is too many incidents for the size of your team, but a lot of those incidents prompt the same recovery step, simple scripts can give you a similar ability to take that “noise” out of the system so that don't have to perform manual actions to deal with them.
5. Add redundancies and workarounds to reduce the impact of incidents
Network and IT infrastructure must incorporate an appropriate degree of redundancy to make them resilient to failure. Failures will occur no matter what. Links break; devices fail. Teams must design the network to gracefully handle those failures so service availability isn’t reliant on a single incident not happening.
Redundancies within the network or infrastructure establish multiple fail-safes to keep a service up and running when an incident occurs. Put another way, redundancies protect services from incidents by providing more than one way for that service to operate when something breaks—for example, automatically rerouting traffic through a particular network.
Workarounds may be necessary when full resolution is difficult or time-consuming. In certain cases, a workaround can become permanent if the resolution would not be viable or cost-effective. In this case, the problem should remain in a known error database so the documented workaround can be used when related incidents occur.
Workarounds can sometimes be made more efficient by finding opportunities to automate them. In today’s systems, if there's a way to reroute traffic, for example, a system can be configured to simply do it automatically. In the past, a network engineer would have to manually change the paths in a network to reroute traffic away from a fault. Systems have gotten smarter, and many attempt to do trigger workarounds themselves now.
Final Thoughts and Next Steps
INOC’s 24x7 NOC services offer a highly-mature service platform that delivers faster, smarter, more efficient incident response—driving significant improvements to MTTR and other performance metrics.
Our application of AIOps to key points in our support workflows empowers our team and our client teams with the intelligence and automation to find, troubleshoot, and restore services with incredible speed and accuracy. Our structured NOC demonstrates its value most clearly when it’s implemented in a NOC environment where little to no intentional structure existed before.
In just weeks or months, teams will see response and restoration times steadily drop and support activities migrate to appropriate tiers—lightening the load on advanced engineers while enabling the NOC to restore and resolve issues faster and more effectively across the board.
When a structured NOC is implemented, we generally see 60% to 80% of all NOC issues addressed by Tier 1 staff, rather than involving advanced engineers in nearly all issues.
Want to learn more about building, optimizing, or outsourcing your NOC? Our NOC solutions enable you to meet demanding infrastructure support requirements and gain full control of your technology, support, and operations. Contact us and get the conversation started or download our free white paper below to learn practical steps you can take to build, optimize, and manage your NOC for maximum uptime and performance.
FREE WHITE PAPER
A Practical Guide to Running an Effective NOC
Download our free white paper and learn how to build, optimize, and manage your NOC to maximize performance and uptime.