Mean Time to Resolution (MTTR) Explained: How to Measure & Improve It

Written by Peter Prosen | Sep 16, 2024 5:30:00 AM

In network operations, every second of downtime can mean lost revenue, frustrated customers, and damaged reputations. At the heart of measuring a NOC's effectiveness lies a critical metric: Mean Time to Resolution (MTTR). But MTTR is more than just a number—it reflects your team's efficiency, your network's resilience, and your ability to deliver on client promises.

However, MTTR is often misunderstood and, more critically, mismeasured. Many NOCs are flying blind, operating on incomplete data and false assumptions about their true performance. In this guide, we'll pull back the curtain on MTTR, revealing why your current measurements might be wildly inaccurate and how this inaccuracy could be costing you dearly.

We'll explore how top-performing NOCs are leveraging automation, standardized processes, and a culture of continuous improvement to drive their MTTR down to unprecedented levels. Whether you're struggling with "selective ticketing," grappling with recurring issues or simply looking to take your NOC's performance to the next level, this guide will provide you with actionable strategies to transform your approach to incident resolution.

What is Mean Time to Resolution?

First things first, we need to clarify what we're talking about when we say MTTR. There are actually several "R"s that get thrown around:

Mean Time to Restore
Mean Time to Repair
Mean Time to Resolve
Mean Time to Recovery

Each of these has a slightly different focus, and those distinctions are critical.

For instance, Mean Time to Restore is about getting services back up for customers, while Mean Time to Repair is about fixing the underlying network issue. In our NOC, we often focus on restoration for our clients' services, but we also track repair times for our network.

Mean Time to Resolution vs. Other MTTR Metrics

Let's level set here by looking at these metrics through the lens of the NOC:

Mean Time to Restore

In NOC operations, we often focus on Mean Time to Restore. This metric measures the time it takes to restore services for customers. It's particularly relevant when you have redundancy in your network. You might have a fiber cut that takes a long time to physically repair, but if you can restore service quickly through redundant paths, that's what matters most to your customers.

Read our other guide for more on this metric.

Mean Time to Repair

This metric is more about fixing the underlying network issue. In the context of NOC operations, this might involve dispatching technicians to physically repair equipment or infrastructure. The NOC often can't control how quickly these repairs happen — you can't make a technician drive their truck any faster or splice fiber more quickly.

Mean Time to Resolve

Mean time to resolve is the average time it takes to fully resolve an issue (more specifically, a failure of some kind). This includes not only the time spent detecting the failure, diagnosing the problem, and repairing the issue but also the time spent ensuring that the failure won’t happen again. It’s typically used to measure the time it’s taking to resolve unplanned incidents fully.

To more clearly illustrate the difference between restoration and resolution in this context, an example of restoration may be using temporary fiber to get the service back up and running. In contrast, full resolution may require opening a maintenance window to permanently replace the fiber and bury it.

Mean Time to Respond

Mean time to respond is the average amount of time it takes a team to respond to a product or system failure once you are first alerted to that failure (not counting system-caused lag time).

It's important to note that different stakeholders might be interested in different metrics. Your network engineers might be more concerned with repair times, while your customers are focused on restoration times. As a NOC, we need to be able to track and report on both, but our primary focus is on how quickly we can get services back up and running for customers.

The Importance of Accurate Measurement

You can't improve what you don't measure, and you can't measure what you don't see. This is where a lot of NOCs and ITOps teams more generally fall short. We call it "selective ticketing" — only creating tickets for big, obvious issues and letting smaller ones go unticketed and, therefore, untracked. But here's the thing: those small issues add up, and they're crucial for understanding your true MTTR.

In many NOCs, especially smaller or homegrown operations, there's a tendency to overlook minor issues. Maybe it's a brief outage that clears up on its own, or a recurring problem that the team knows how to fix quickly. Often, these don't get ticketed because it seems like extra work for something that's already resolved.

But this approach can severely skew your MTTR measurements. Let's say you have ten 5-minute outages and one 1-hour outage. If you only ticket the big one, your MTTR looks like it's an hour. But if you're tracking everything, your actual MTTR is closer to 6 minutes. That's a huge difference — and one teams ought to care about.

This discrepancy can lead to several problems:

Inaccurate performance metrics: You're not getting a true picture of your network's stability.
Missed patterns: Small, recurring issues might indicate larger underlying problems.
SLA violations: You might be promising better uptime than you're actually delivering.

We've seen this firsthand with clients who come to us saying they have about 20 incidents a month. Once we implement comprehensive ticketing, that number often jumps to 50 or more. It's not that their network suddenly got worse — we're just capturing everything that was already happening.

This can be a shock for IT and network teams. They might say, "We never had these problems before!" The reality is, they did — they just didn't know about them. Or worse, they were relying on team members to handle issues informally, like waking up in the middle of the night to fix something without logging it.

Getting to an accurate MTTR measurement

Simply put, to get an accurate MTTR, you need to:

1. Ticket everything.

Every alarm, every issue, no matter how small. Even if it clears up on its own in a few minutes, it should be logged.

2. Automate your ticketing.

This ensures you're capturing start and end times accurately, down to the second. In our old system, we rounded to the nearest minute, which could add an average of 30 seconds to each MTTR. Over thousands of tickets, that adds up.

3. Implement auto-resolve for quick-clearing issues.

Many brief outages resolve themselves without intervention. Automated systems can log these and close them out, giving you a true picture of your network's stability without creating extra work for your team.

4. Eliminate manual time entry.

Don't rely on NOC staff to manually enter resolution times. Human error and the tendency to round times can significantly skew your data. Automated systems can capture the exact time an alarm clears, giving you much more accurate measurements.

5. Measure from the first alarm to the last clear.

Don't just measure from when a ticket was opened to when it was closed. Measure from the first alarm to the final clear. This gives you the full picture of the incident's impact.

Implementing these practices does more than just give you a more accurate MTTR. It provides several key benefits. You'll have a much clearer picture of how stable your network really is, including all those small hiccups that might have been flying under the radar. You'll also be able to do better capacity planning. If you're seeing a lot more incidents than you thought, for example, it might be time to expand your team or invest in more automation.

It also helps with problem management. With a complete record of all issues, you're better equipped to spot patterns and address root causes.

Perhaps most importantly, you can get more accurate SLA reporting by confidently reporting on your uptime to clients, knowing you're capturing every second of downtime. Lastly, with hard data on all your outages, it's easier to make the case for network upgrades or new tools.

Remember, the goal isn't to make your MTTR look good on paper. It's to truly understand how your network is performing so you can make it better. Sometimes, implementing comprehensive measurement might make your MTTR look worse in the short term. But that's okay — you're now working with real data, and that's the first step to real improvement.

Improving Your MTTR

Once you're measuring accurately, you can start improving.

Here are some key strategies we typically employ ourselves, and for the NOCs, we advise through our consulting service.

1. Standardize your runbooks.

Your runbooks set consistent step-by-step instructions for certain sets of alarms, regardless of who works the ticket. This is crucial because it ensures that no matter who's on duty, they know exactly what to do for common issues.

Runbooks should follow a logical progression, whether it's bouncing or resetting another port or dispatching a field tech with a light meter on a fiber link. The key is to start with the most likely and easiest solutions and progress to more complex ones.

Read our runbook guide for more.

2. Automate simple resolutions.

Set up automated responses for recurring issues with known fixes. Take an AT&T Wi-Fi product, for example. We know that the wireless remotes get stuck occasionally. All you have to do is go out there and issue a command to shut down that power over Ethernet port, wait five minutes and turn it back on which resolves the issue.

By automating these simple fixes, you can dramatically reduce your MTTR for these common issues. As Pete said, "Beautiful MTTR, you know, because it'll see the alarm, I lost access to a remote. It'll go bounce it and recover."

3. Implement ITIL problem management.

Don't just fix issues as they come up. Look for patterns and address root causes.

4. Invest in redundancy.

While this doesn't technically improve MTTR, it can improve service uptime for your customers. A redundant network can often switch over automatically, restoring service quickly even if the underlying issue takes longer to resolve. This is particularly important for critical services or high-paying customers who expect maximum uptime.

5. Implement machine learning (AIOps).

Advanced systems can predict potential failures before they happen, allowing for proactive maintenance.

The Role of Visibility

Improving MTTR isn't just about fixing things faster - it's about seeing the whole picture. If you're not documenting everything, you can't measure it, and if you can't measure it, you can't fix it.

Good ticketing practices and accurate measurements give you visibility into what's really happening in your network.

This visibility is crucial for several reasons:

Identifying Recurring Issues: Small, seemingly insignificant issues like dribbling bit errors on a link can be signs of potential failures. Many teams ignore these because investigating them can be difficult and expensive. However, tracking these minor issues can help spot patterns that indicate larger problems brewing.
Spotting Trends Over Time: With comprehensive data, you can start to see patterns that might not be obvious day-to-day. For instance, you might notice that a piece of equipment is slowly failing over time. This kind of trend analysis can help you predict and prevent future outages.
Justifying Investments: When you have hard data on your network's performance, it's easier to make the case for upgrades or new equipment. You can determine when it's worth investing to push for that extra "nine" of uptime, versus living with a slightly lower uptime to avoid excessive costs.
Providing Accurate Uptime Statistics: This is crucial for maintaining good relationships with your customers. Large clients may be monitoring your service 24/7. If you promise 99.9% uptime but only deliver 99.5%, they're likely to ask for service credits.
Understanding Your True Network Stability: When transitioning from a homegrown NOC to a more structured approach, it's common to see the number of reported incidents increase significantly — sometimes doubling or more. This increased visibility can be shocking at first, but it's necessary for truly understanding and improving your network's performance.

Also, with better visibility, you can better inform your customers. Instead of relying on ad hoc processes or individual memory, a clear process ensures that issues are properly tracked and customers are kept in the loop.

Remember, visibility isn't just about looking good on paper. It's about truly understanding your network's performance so you can make informed decisions and provide the best possible service to your customers. If you sell a service without knowing about frequent brief outages, you'll end up paying for it later.

Improving visibility might make your stats look worse in the short term, but it's the first step towards real, sustainable improvements in your network's performance and reliability. It allows you to see the full picture of your network's health, identify areas for improvement, and ultimately provide better service to your customers.

The Human Factor

While we've talked a lot about systems and processes, it's crucial not to overlook the human element in improving MTTR. Complacency can be a significant issue in NOCs, and it manifests in various ways that can negatively impact your network's performance and your ability to resolve issues quickly.

One common form of complacency is getting used to recurring issues. For instance, a T1 line might go down every day at a specific time, and over time, the team just accepts it as normal. This acceptance can lead to missed improvement opportunities and mask larger underlying issues.

Another aspect of the human factor is the tendency to take shortcuts or bypass established processes. In smaller or less structured NOCs, team members might handle issues informally. They might wake up in the middle of the night to fix something without logging it, or they might know a quick fix for a common problem and implement it without creating a ticket. While this might seem efficient in the moment, it leads to incomplete data and can hide systemic issues.

The human factor also plays a role in how issues are prioritized and handled. Without clear processes, individuals might make subjective decisions about what's important enough to ticket or which issues to escalate. This can lead to inconsistent service and missed opportunities for improvement.

To address these human factors, we've found it's important to:

Train your team thoroughly on the importance of following established processes, including comprehensive ticketing.
Encourage a culture of continuous improvement where team members are always looking for ways to enhance processes and reduce MTTR.
Implement and enforce the use of standardized runbooks to ensure consistent issue handling regardless of who is on duty.
Regularly review and update processes to prevent them from becoming outdated or irrelevant.
Foster an environment where team members feel comfortable raising concerns or suggesting improvements.
Use automation where possible to reduce the reliance on manual processes and minimize the potential for human error.
Regularly analyze performance data and share insights with the team to reinforce the importance of following procedures.

Sometimes the quickest way to improve MTTR is simply to ensure everyone is following the processes you already have in place. By addressing the human factors and fostering a culture of discipline and continuous improvement, you can significantly enhance your NOC's performance and responsiveness

Final Thoughts and Next Steps

Improving your Mean Time to Resolution isn't just about making numbers look good on a spreadsheet — it's about delivering better service to your customers, reducing the strain on your team, and, ultimately, protecting your bottom line. The journey to optimal MTTR is ongoing, but the steps are clear:

Start with accurate measurement. You can't improve what you don't measure, and you can't measure what you don't see. Implement comprehensive ticketing for all issues, no matter how small.

Standardize your processes. Develop clear runbooks that guide your team through common issues, ensuring consistent and efficient responses.

Automate where possible. From ticketing to simple resolutions, automation can dramatically reduce your MTTR for common issues.

Invest in redundancy. While it doesn't directly improve MTTR, it can significantly enhance service uptime—which is what your customers ultimately care about.

Implement robust problem management. Don't just fix issues as they arise; dig deep to find and address root causes.

Foster a culture of continuous improvement. Encourage your team to always be looking for ways to enhance processes and reduce resolution times.

If you're feeling overwhelmed by the challenge, you're not alone. Many organizations struggle to implement these changes effectively while still managing their day-to-day operations. That's where INOC comes in:

NOC Support Services

Our award-winning NOC support services, powered by the INOC Ops 3.0 Platform, provide comprehensive monitoring and management of your infrastructure through a sophisticated multi-tiered support structure. This advanced platform combines AIOps, automated workflows, and intelligent correlation to help you:

Achieve maximum uptime through proactive monitoring and accelerated incident response
Reduce manual intervention with automated event correlation and ticket creation
Scale your support capabilities without the complexity of building internal NOC infrastructure
Access real-time insights through a single pane of glass for efficient incident and problem management
Leverage our deep expertise across technologies while maintaining complete visibility through our client portal

NOC Operations Consulting

Our consulting team provides tactical, results-driven guidance for organizations looking to optimize their existing NOC or build a new one from the ground up. We help you:

Assess your current operations and identify opportunities for improvement
Develop standardized processes and runbooks that enhance efficiency
Implement best practices for event management, incident response, and problem management
Design scalable operational frameworks that grow with your business
Transform your NOC into a proactive, high-performance operation

Both services are backed by INOC's extensive experience serving enterprises, communications service providers, and OEMs worldwide. Our team brings proven methodologies and deep technical expertise to help you achieve your operational goals, whether through direct support or strategic guidance.

View full post