INOC's Ops 3.0 Platform is transforming NOC service delivery. Ops 3.0 is the third major iteration of our NOC service platform, serving as a comprehensive operating system for technology, operations, and service delivery. It enhances NOC service delivery by automating alarm feed ingestion, correlation, and ticketing, increasing accuracy and speed while minimizing human delays.
These terms are commonly confused, even by technicians. To some, they seem interchangeable, but this is not the case.
While incident management aims to restore services as quickly as possible, managing problems involves determining and addressing the root cause(s) of an incident or a series of incidents by identifying, tracking, and resolving the problems that express themselves as incidents.
As ITIL 4 puts it, the purpose of problem management is to “reduce the likelihood and impact of incidents by identifying causes of incidents as well as managing workarounds and known errors.”
Put another way, when something breaks, and a technician fixes it, that’s incident management. They are managing a specific incident. When similar issues occur multiple times within a particular area, there is likely a recurring issue or series of issues within the system. In this case, problem management is necessary.
For example, unavailable seconds are incidents, while many unavailable seconds within a particular path in the network are a problem. Problems tend to cause situations where the network may be significantly degraded or down—and where incidents that express the same root causes recur over time.
Many challenges can stand in the way of a robust and streamlined incident management workflow. And while trying to identify everything that holds incident management back isn’t quite possible, a more helpful way to approach these challenges—and perhaps start looking for them within your organization—is to break them down by category.
Having looked under the hood of many different incident management workflows, we find that the challenges or problems typically fall into one of three categories: staff utilization, cost, and communications.
The problem here usually isn’t with the people on a team, but rather how they’re utilized and how they coordinate with one another to manage incidents. Many teams don’t lay down a framework that acknowledges the types of incidents they receive, the structure of the group receiving them, and how to give order to that work in a repeatable and thoughtful way.
The chaotic consequences of a poorly utilized and coordinated support team often compound as inefficiencies flood teams with too many incidents—many of which may be tied to the same problems. An example is having highly trained and experienced support staff watching alarm screens and calling out to underlying service providers.
When this is the case, the organization usually lacks a tiered support structure that can allocate tickets according to skill level and root cause. Such a support structure provides a framework that governs who is tasked with certain activities based on the nature of the incident.
📄 Want to see what just such a framework looks like? Download our white paper, Top 10 Challenges to Running a Successful NOC — And How to Solve Them, for a high-level breakdown of how incident management can be operationalized appropriately within your organization and what it takes to get there.
Cost is another big challenge that undermines incident management. To implement incident management appropriately, especially in the enterprise and mid-market level, companies must invest in support staff and tools for:
However, many companies that need these tools and the staff qualified to run them effectively don’t make the required upfront investments to implement and orchestrate these service components well.
A common example is applying a basic integration between the monitoring platform and the ticketing system. In this example, each alarm—regardless of whether it is actionable or not—creates its own ticket. The NOC in these cases is often spending more time during a large incident trying to figure out which tickets are related than they are focusing on resolving the issue. Automation and intelligence of a correlation tool such as those provided in an AIOps platform would provide a much-needed lift. Yet, it’s another tool that requires tuning and maintenance.
The resulting dysfunction only creates additional costs as ill-equipped teams struggle to manage incidents and spend more to deal with the second-order effects of that inefficiency.
In short, investing in robust incident management is almost certainly worth the trouble of dealing with the expensive mess it can make when it doesn’t work.
Another common challenge is communication—specifically, timely and accurate communication.
When incidents and problems occur, it’s critical for the NOC or other support team to identify them quickly and alert stakeholders in a timely and effective fashion.
Yet this is often easier said than done. Usually, if resources are devoted to improving incident management, teams focus only on optimizing the accuracy and speed of identifying and fully characterizing incidents without investing in the communications afterward.
“One of the biggest pain points we see all the time is, first, getting incidents identified quickly enough, and, second, making those initial incident communications effective and then continuing to keep all of the appropriate stakeholders involved. That stretches all the way to communicating with third-party vendors. Making communication effective and consistent is a huge challenge.”
— Ben Cone, Senior Solutions Engineer, INOC
Introducing a tiered NOC support structure can dramatically increase efficiency in managing events, processing service requests, and solving problems while reducing the time higher-level engineers spend on break-fix.
Instead of trying to make sense of the dreaded "Wall of Red", NOCs can prioritize tickets based on urgency. Issues are funneled into different queues and handled by engineers at the minimum skill level necessary to resolve them, instead of "paying professors to do janitorial work," so to speak. Additionally, setting a deadline for engineers to address a ticket can help ensure tickets are resolved as quickly as possible. Either a lower level engineer resolves the issue at the minimum cost to the client, or it is escalated to a higher skill level after a set amount of time has elapsed, keeping the process moving smoothly.
As a result of a tiered system like the one discussed above, INOC resolves the majority of tickets ( 65 to 75% ) at Tier 1, while more complex tickets are escalated to specialized, Tier 2 or 3 teams.
So how can a NOC ensure they are managing incidents most efficiently? Here are a few best practices we’ve seen emerge over the years.
Especially in larger organizations that handle more (and more complex) incidents, communication needs to be present throughout the entire incident lifecycle.
Top-performing support teams communicate the status of their incidents from the moment an issue is identified to the end of the incident’s life, ensuring users and stakeholders that all incidents are being properly handled. This level of communication also helps to manage stakeholder expectations and engages them to follow up if they have additional questions or comments.
In short, more communication can reduce and often eliminate many of the problems that arise from a user, customer, or another stakeholder not knowing what’s happening now and what’s happening next.
To achieve whole-lifecycle communication, make sure you have a documented process framework for incidents that include:
After this process is established and in use, you may identify areas that can be automated for greater efficiency—improvements can be iterated as opportunities arise. Automating various parts of the incident management process is not only possible but highly advantageous to companies that can do it thoughtfully.
📄 Read our white paper, The Role of AIOps in Enhancing NOC Support, for a complete explanation of how incident management and other processes can be automated with today’s AIOps tools and all the advantages automation can bring.
Read more on how we apply AIOps in the NOC here.
In general, incidents should always be managed and resolved by the lowest-tiered team possible so more specialized higher-level teams can focus on more complex issues. When done right, this can significantly improve resolution times and have an enormous positive impact on customers or end-users.
A documented process that defines how and when to escalate an incident—and who should do so ensures that incoming incidents end up in the most capable and efficient hands to resolve them as soon as possible.
Also, ensuring that need-to-know information is coming in with alarms allows technicians to save time that might have been spent hunting for follow-up information. Again, when these enhancements are made thoughtfully, the impact on speed can be huge.
Another critical part of successful incident management within your NOC is having a well-maintained and robust knowledge base. The knowledge base is a handy reference for staff to use when troubleshooting issues that have come up and been solved before.
This knowledge base should include supplemental support documentation, such as runbooks and flowcharts. These help technical staff quickly identify the next steps and probable causes, again avoiding unnecessary rework and research. Successful resolutions for known issues can be recorded and redeployed.
A robust knowledge base also ensures that first-level technical staff has the proper resources to resolve incidents, rather than needlessly escalating incidents to more specialized staff.
To do this, staff must identify all the sources of incidents (e.g., phone, email, alarm, portal, instant message), pinpoint the most common types of issues, and develop alarm-to-action procedures for staff to follow when they encounter each type of incident.
Most teams (especially larger ones) need to bring voice, email, text, customer portals, knowledge bases, documentation, and workflow management tools into the NOC to manage incidents.
The tricky part is that each of these might have its own platform—and integrating them to work together can be a tripping point that slows teams down and injects all kinds of opportunities for mistakes.
Without proper integrations connecting these tools, support teams have to track and manage multiple screens for incident and event information, manually collect data from various sources to document what’s happening, notify/escalate information and issues to the appropriate parties, and work toward service restoration.
An effective NOC receives notifications (like alarms) and information from multiple sources and presents them to staff in a single, consolidated view. It’s hard to be prescriptive here since the way an organization should integrate its tools depends on which specific tools they have, as well as other factors unique to their operation.
If your team finds itself without the integrations or “single pane of glass” it needs to be efficient, schedule a free NOC consultation with us and get the conversation started.
Identify KPIs (key performance metrics) that can help you measure and track your performance. Set performance goals for metrics such as time to action on critical incidents, time to ticket, MTTR, and notification.
Ask yourself what your:
For IT leaders wondering if, and to what degree their incident management workflow is ripe for some enhancement or would be better served by a third-party NOC support provider, the following questions may be helpful to consider:
Inefficient incident management can be costly, but refining your processes and having an efficient team handling incidents in a timely fashion can lead to fewer complaints, more satisfied customers, and happier employees.
Of course, the support structures, operational workflow setup, and staffing requirements may seem like an expensive proposition for many companies, often requiring eight to ten or more 24x7 staff, purchasing additional tools, and taking the time to develop more efficient processes. In these situations, an outsourced solution can sometimes make sense and reduce the cost of in-house staff and hiring.
For IT leaders weighing the benefits of in-house incident management vs. outsourcing, the following questions may be helpful to consider:
Interested in learning more about ITIL-aligned NOC operations support? Contact us to see how we can help you improve your IT service strategy and NOC support and download our free white paper below.
*Originally developed by the UK government’s Office of Government Commerce (OGC) - now known as the Cabinet Office - and currently managed and developed by AXELOS, ITIL is a framework of best practices for delivering efficient and effective support services