Incident management is among the most commonly adopted ITIL* v3 processes, now ITIL 4 practices. According to the ITIL 4 Foundation book, the purpose of incident management is to “minimize the negative impact of incidents by restoring normal service operation as quickly as possible.”
Ideally, the way you restore normal service operation should be done in a way that has little to no negative impact on your business services. That may sometimes rely on using temporary workarounds while identifying and resolving the root cause of the problem.
Optimizing your incident management process is critical for keeping a NOC (or any support operation) running efficiently. The strength of your incident management process can directly impact MTTR (mean time to recovery or mean time to restore), unscheduled downtime, staff morale, and customer satisfaction. Yet many companies struggle to implement workflows that use their staff’s time effectively by thinking creatively about how best to manage incidents and recognize best practices.
If you’re unsatisfied with your MTTR, time to first-action, or time to impact, it might be time to take a closer look at the systems and workflows you have in place for incident management and see where there may be opportunities to improve.
First, however, a point of clarification on “incidents” versus “problems,” as it’s essential to this discussion.
ITIL Incident Management vs. ITIL Problem Management
These terms are commonly confused, even by technicians. To some, they seem interchangeable, but this is not the case.
While incident management aims to restore services as quickly as possible, managing problems involves determining and addressing the root cause(s) of an incident or a series of incidents by identifying, tracking, and resolving the problems that express themselves as incidents.
As ITIL 4 puts it, the purpose of problem management is to “reduce the likelihood and impact of incidents by identifying causes of incidents as well as managing workarounds and known errors.”
Put another way, when something breaks, and a technician fixes it, that’s incident management. They are managing a specific incident. When similar issues occur multiple times within a particular area, there is likely a recurring issue or series of issues within the system. In which case, problem management is necessary.
For example, unavailable seconds are incidents, while many unavailable seconds within a particular path in the network are a problem. Problems tend to cause situations where the network may be significantly degraded or down—and where incidents that express the same root causes recur over time.
The 3 Most Common Challenges of Incident Management
Many challenges can stand in the way of a robust and streamlined incident management workflow. And while trying to identify everything that holds incident management back isn’t quite possible, a more helpful way to approach these challenges—and perhaps start looking for them within your organization—is to break them down by category.
Having looked under the hood of many different incident management workflows, we find that the challenges or problems typically fall into one of three categories: staff utilization, cost, and communications.
1. Staff utilization
The problem here usually isn’t with the people on a team, but rather how they’re utilized and how they coordinate with one another to manage incidents. Many teams don’t lay down a framework that acknowledges the types of incidents they receive, the structure of the group receiving them, and how to give order to that work in a repeatable and thoughtful way.
The chaotic consequences of a poorly utilized and coordinated support team often compound as inefficiencies flood teams with too many incidents—many of which may be tied to the same problems. An example is having highly trained and experienced support staff watching alarm screens and calling out to underlying service providers.
When this is the case, the organization usually lacks a tiered support structure that can allocate tickets according to skill level and root cause. Such a support structure provides a framework that governs who is tasked with certain activities based on the nature of the incident.
📄 Want to see what just such a framework looks like? Download our white paper, Top 10 Challenges to Running a Successful NOC — And How to Solve Them, for a high-level breakdown of how incident management can be operationalized appropriately within your organization and what it takes to get there.
Cost is another big challenge that undermines incident management. To implement incident management appropriately, especially in the enterprise and mid-market level, companies must invest in support staff and tools for:
- Ticketing platforms
- Monitoring platforms
- Phone systems
- Email systems
- Knowledge Management platforms
- Chat/Instant Messaging platforms
- Analytics and Reporting systems
However, many companies that need these tools and the staff qualified to run them effectively don’t make the required upfront investments to put these service components in place and orchestrate them well.
A common example is applying a basic integration between the monitoring platform and the ticketing system. In this example, each alarm—regardless of whether it is actionable or not—creates its own ticket. The NOC in these cases is often spending more time during a large incident trying to figure out which tickets are related than they are focusing on resolving the issue. Automation and intelligence of a correlation tool such as those provided in an AIOps platform would provide a much-needed lift. Yet, it’s another tool that requires tuning and maintenance.
The resulting dysfunction only creates additional costs as ill-equipped teams struggle to manage incidents and spend more to deal with the second-order effects of that inefficiency.
In short, investing in robust incident management is almost certainly worth the trouble of dealing with the expensive mess it can make when it doesn’t work.
Another common challenge is communication—specifically, timely and accurate communication.
When incidents and problems occur, it’s critical for the NOC or other support team to identify them quickly and alert stakeholders in a timely and effective fashion.
Yet this is often easier said than done. Usually, if resources are devoted to improving incident management, teams focus only on optimizing the accuracy and speed of identifying and fully characterizing incidents without investing in the communications afterward.
“One of the biggest pain points we see all the time is, first, getting incidents identified quickly enough, and, second, making those initial incident communications effective and then continuing to keep all of the appropriate stakeholders involved. That stretches all the way to communicating with third-party vendors. Making communication effective and consistent is a huge challenge.”
— Ben Cone, Senior Solutions Engineer, INOC
Best Practices for ITIL Incident Management
So how can a NOC ensure they are managing incidents most efficiently? Here are a few best practices we’ve seen emerge over the years.
1. Communicate to stakeholders throughout the entire lifecycle of an incident
Especially in larger organizations that handle more (and more complex) incidents, communication needs to be present throughout the entire incident lifecycle.
Top-performing support teams communicate the status of their incidents from the moment an issue is identified to the end of the incident’s life, ensuring users and stakeholders that all incidents are being properly handled. This level of communication also helps to manage stakeholder expectations and engages them to follow up if they have additional questions or comments.
In short, more communication can reduce and often eliminate many of the problems that arise from a user, customer, or another stakeholder not knowing what’s happening now and what’s happening next.
To achieve whole-lifecycle communication, make sure you have a documented process framework for incidents that include:
- The stakeholder you need to contact
- A preferred method of communication (e.g. SMS)
- A set of technical questions that must be answered before contacting the stakeholder
- A templated communication to fill out and send
After this process is established and in use, you may identify areas that can be automated for greater efficiency—improvements can be iterated as opportunities arise. Automating various parts of the incident management process is not only possible but highly advantageous to companies that can do it thoughtfully.
📄 Read our white paper, The Role of AIOps in Enhancing NOC Support, for a complete explanation of how incident management and other processes can be automated with today’s AIOps tools and all the advantages automation can bring.
Read more on how we apply AIOps in the NOC here.
2. Minimize escalations whenever possible
In general, incidents should always be managed and resolved by the lowest tiered team possible so more specialized higher-level teams can focus on more complex issues. This can significantly improve resolution times and have an enormous positive impact on customers or end-users when done right.
A documented process that defines how and when to escalate an incident—and who should do so ensures that incoming incidents end up in the most capable and efficient hands to resolve them as soon as possible.
Also, ensuring that need-to-know information is coming in with alarms allows technicians to save time that might have been spent hunting for follow-up information. Again, when these enhancements are made thoughtfully, the impact on speed can be huge.
3. Build a robust knowledge base
Another critical part of successful incident management within your NOC is having a well-maintained and robust knowledge base. The knowledge base is a handy reference for staff to use when troubleshooting issues that have come up and been solved before.
This knowledge base should include supplemental support documentation, such as runbooks and flowcharts. These help technical staff quickly identify the next steps and probable causes, again avoiding unnecessary rework and research. Successful resolutions for known issues can be recorded and re-deployed.
A robust knowledge base also ensures that first-level technical staff has the proper resources to resolve incidents, rather than needlessly escalating incidents to more specialized staff.
To do this, staff must identify all the sources of incidents (e.g., phone, email, alarm, portal, instant message), pinpoint the most common types of issues, and develop alarm-to-action procedures for staff to follow when they encounter each type of incident.
4. Integrate your NOC’s tools for maximum efficiency
Most teams (especially larger ones) need to bring voice, email, text, customer portals, knowledge bases, documentation, and workflow management tools into the NOC to manage incidents.
The tricky part is, each of these might have its own platform—and integrating them to work together can be a tripping point that slows teams down and injects all kinds of opportunities for mistakes.
Without proper integrations connecting these tools, support teams have to track and manage multiple screens for incident and event information, manually collect data from various sources to document what’s happening, notify/escalate information and issues to the appropriate parties, and work toward service restoration.
An effective NOC receives notifications (like alarms) and information from multiple sources and presents them to staff in a single, consolidated view. It’s hard to be prescriptive here since the way an organization should integrate their tools depends on which specific tools they have, as well as other factors unique to their operation.
If your team finds itself without the integrations or “single pane of glass” it needs to be efficient, schedule a free NOC consultation with us and get the conversation started.
5. Establish a framework for operational service levels
Identify KPIs (key performance metrics) that can help you measure and track your performance. Set performance goals for metrics such as time to action on critical incidents, time to ticket, MTTR, and notification.
Ask yourself what your:
- Update frequencies are based on
- Level of severity is
- Follow-ups are on incidents
- Priorities are
Then, craft a plan to capture these metrics. Finally, schedule regular times to analyze them. For instance, you may want to see them in real-time or look at them monthly.
This should all feed into a continual service improvement (CSI) program, which we explain in detail here.
A Simple ITIL Incident Management Checklist
For IT leaders wondering if, and to what degree their incident management workflow is ripe for some enhancement or would be better served by a third-party NOC support provider, the following questions may be helpful to consider:
- Is your team communicating effectively through the entire lifecycle of an incident, or are key parties left out of the loop in some stages?
- Is your incident management workflow designed to ensure that the lowest tiered team handles incidents appropriately for the issue at hand, or are advanced resources routinely distracted with lower-tier incidents?
- Are you using templates for troubleshooting and communication to ensure all the necessary information has been gathered before reaching out to another stakeholder or escalating an issue?
- Are those responsible for managing incidents cataloging their successes into a knowledge base, and are staff routinely referencing that knowledge base when managing incidents?
- Are your monitoring and management tools adequately integrated, and is the data gathered across them made available through a single view?
- Are you capturing all the metrics you need to gauge the success of your incident management workflow accurately?
Final Thoughts and Next Steps
Inefficient incident management can be costly, but refining your processes and having an efficient team handling incidents in a timely fashion can lead to fewer complaints, more satisfied customers, and happier employees.
Of course, the support structures, operational workflow set up, and staffing requirements may seem like an expensive proposition for many companies, often requiring eight to ten or more 24x7 staff, purchasing additional tools, and taking the time to develop more efficient processes. In these situations, an outsourced solution can sometimes make sense and reduce the cost of in-house staff and hiring.
For IT leaders weighing the benefits of in-house incident management vs. outsourcing, the following questions may be helpful to consider:
- Are you satisfied with the communications that are sent from your incident management group?
- Are you happy with your response times?
- Are you resolving incidents quickly enough?
- Are your customers and employees satisfied?
Interested in learning more about ITIL-aligned NOC operations support? Contact us to see how we can help you improve your IT service strategy and NOC support or download our free white paper below.
FREE WHITE PAPER
A Practical Guide to Running an Effective NOC
Download our free white paper and learn how to build, optimize, and manage your NOC to maximize performance and uptime.
*Originally developed by the UK government’s Office of Government Commerce (OGC) - now known as the Cabinet Office - and currently managed and developed by AXELOS, ITIL is a framework of best practices for delivering efficient and effective support services