How to Elevate Your ITIL Incident Management Process (in 2024)

People setting up the components of a NOC
Peter Prosen

By Peter Prosen

Vice President of NOC Operations, INOCPete has 30 years’ experience in engineering and sales, including systems engineering, software development, OEM and channel sales, product management, process design and improvement, and system quality assurance. He has worked for a range of companies in the cable, telephony, cellular data and wireless markets, including E-Band Communications, Motorola, FlexLight Networks and ADC Telecommunications. The high-level wireless expertise Pete brought to INOC enabled the development of new and innovative solutions for clients. As manager of both INOC’s NOC and field services teams, he ensures quality and efficiency across the operation.

As IT support technology and software platforms continue to evolve, a well-conceived incident management process that incorporates a proven framework and abides by long-standing best practices has become essential in designing a support operation that can adapt to change and keep up with innovation.

The Information Technology Infrastructure Library’s (ITIL’s)* well-established best practices can help you set up a solid incident management framework for your organization. ITIL’s incident management lifecycle provides steps to process incidents from beginning to end.

Service desk teams, often working within a network operations center (NOC), are typically the frontline of support for managing incidents. Their role in incident management is to identify, diagnose, and work incidents to restore the defined service levels as quickly as possible.

The best incident management teams work through a dedicated process flow with a formal ticketing system. They rely on a clear, documented process with defined steps to work through each incident. That approach varies between organizations and teams and how rigidly they follow the ITIL framework.

Despite being a central, daily task for service desk teams, incident management processes are often fraught with inefficiencies. Over time, these inefficiencies can compound into a significant drag on performance and satisfaction—both among customers or end-users and support staff.

We sat down with two of INOC’s resident incident management experts to understand how IT support teams can take ITIL incident management to the next level through each step within ITIL’s incident management process.

We cover:

Meet our contributors:

Eric Idler, INOC


Eric Idler, INOC’s Director of Shared NOC

Eric has worked at INOC for six years, starting with the Service Desk team before progressing into numerous leadership roles, including Manager of Advanced Incident Management and Senior NOC Manager. With hands-on experience in Operations and Service Desk Management, he currently manages the Shared NOC. He ensures that incidents are handled efficiently and effectively through their life cycle to support INOC clients.

Peter Prosen, INOC


Peter Prosen, INOC’s VP of NOC Operations

Pete has 30 years of experience in engineering and sales, including systems engineering, software development, OEM and channel sales, product management, process design and improvement, and system quality assurance. He has worked for a range of companies in the cable, telephony, cellular data, and wireless markets, including E-Band Communications, Motorola, FlexLight Networks, and ADC Telecommunications. The high-level wireless expertise Pete brought to INOC enabled the development of new and innovative solutions for clients. As manager of both INOC’s NOC and field services teams, he ensures quality and efficiency across the operation.

1. Incident Identification and Prioritization

The key to identifying and prioritizing incidents is discerning critical, actionable alarms from less severe or unactionable ones.

A lack of operational maturity is a common problem support teams encounter here. Identifying incidents worthy of attention and prioritizing them so that attention can be devoted where it ought to be can be a struggle in support environments (like NOCs) when processes are less operationally mature than they ought to be. Teams need to acknowledge that every incident isn’t equally urgent. And when human attention and bandwidth are limited, it’s important to prioritize incidents accordingly.

  • Problems here almost always scale with incident volume: The more incidents that a team encounters, the more time they waste giving attention to those that aren’t actionable.
  • Many incidents simply indicate that something is happening without helping engineers understand what is happening or how to resolve it. Treating these “unactionable” incidents with the same urgency as actionable ones often results in sleepless nights for staff woken up in the middle of the night by alarms that don’t require their attention, or that do not suggest a solution. Over time—and especially as a company scales—this can wear on morale and lead to turnover, not to mention hurt the metrics a support team’s performance is measured by.

Equipment can also get in the way when determining whether an incident is actionable, how severe it is, and how to prioritize it accordingly.

  • For example, an alarm triggered by someone powering off a piece of personal equipment as they head home after work likely shouldn’t be causing a critical alarm. But your backhaul going down should—because multiple customers have been affected. The point is, no one wants to roll trucks and wake up a boss for an issue that isn’t truly critical.

💡 How to elevate your incident identification and prioritization process:

  • To identify and prioritize incidents more efficiently, start by examining your alarms in detail. Identify which are actionable and which aren’t. (Again, “actionable” in this context means helping you understand and resolve the issue expressed by the incident rather than simply indicating an issue is occurring without telling you more.)
  • Use what you learn to help you classify future alarms to help engineers more quickly determine whether an event signaled by an alarm is truly a business-impacting incident, and how severe it is.
  • Then use your severity determinations to establish a priority level that can help engineers understand what takes precedence in moments where incidents compete with each other for attention.

At INOC, we help teams separate noise from meaningful incidents, determine which alarms are actionable, and use a robust database of port utilization to identify and prioritize each alarm.

To us, incident prioritization largely boils down to two elements: impact and urgency

  • Impact refers to the consequences of the incident on individuals, such as how many individuals it affects. For example, an incident that affects a single individual would likely be a lower priority than an incident that affects an entire town.
  • Urgency refers to the financial repercussions of the incident continuing. For example, if a trading circuit actively used by a trading company is down, it could lose a significant amount of money every minute it’s down. 

Together, the urgency and impact will determine an appropriate priority designation. This in turn will help you use your resources efficiently so that your engineers aren’t scrambling to resolve every incident with the same level of fervor.

📄 Download our free white paperThe Role of AIOps in Enhancing NOC Support—for a detailed look at the ways we’re applying bleeding-edge machine learning and automation to simultaneously reduce human effort and better identify and prioritize incidents. And schedule a free NOC consultation to see how we can bring these capabilities to improve your support workflow.

2. Incident Logging

Many teams narrowly define the incident logging process as capturing information on a ticket. But for most teams—especially larger ones—logging should extend to how incident data is stored for later analysis. Simply put, if a team isn’t storing its incident data, it can’t use it later.

NOCs that can store a significant amount of historical incident data, such as alarms and ticket history, can squeeze insights from this historical data to conduct problem management (preventing problems from occurring in the first place by identifying incidents’ root causes and ultimately resulting in greater stability and better support.)

Unfortunately, many of the tools support teams use to manage incidents automatically clear logs after a set amount of logs or file size.

An external data storage solution is often necessary here. Moreover, data often must be consolidated from multiple sources and transferred from devices, and stored in such a way that it won’t be cumbersome to access in the future. This can be a challenge in and of itself if you don’t have an NMS (like SolarWinds or Nagios) capable of processing data from multiple sources at the same time (and actually configured to do so).


💡 How to elevate your incident logging process:

  • Conduct a review to see if you’re capturing and storing historical incident data. You may need an external storage solution to house that data.
  • Once you’ve captured a good amount of historical incident data (over several weeks or months), use what you learn to help you classify future alarms to help engineers more quickly determine whether an event signaled by an alarm is truly a business-impacting incident, and how severe it is.
  • If you haven’t already, establish a formal problem management program to pull important trends out of your incident data to drive proactive continuous improvement.

If investing in a solution like that is out of reach at the moment, this may be a situation where partnering with a third-party NOC support provider like INOC can be beneficial. 

At INOC, we have a data storage vault and mature ITIL incident management system, including a robust, “single pane of glass” solution like the one described above. We also help teams extract the valuable insights stored within their historical incident data to drive continual improvement.

Schedule a free NOC consultation to start the conversation with our Solutions Engineers.

3. Incident Categorization

When categorizing incidents for reporting purposes, it can be difficult to find a happy medium between too much and not enough detail.

It can also be difficult to appropriately label incidents by category since you often run the risk of getting too granular and creating incident categories that are so specific that they only occur, say, once every six months—and are therefore not very helpful for trending problems.

  • Keep in mind that in addition to helping support staff sort and prioritize incidents when they occur, the other important purpose of categorization (the one teams often struggle the most to capitalize on) is to track incidents over time and see patterns that you can act on via problem management. 

Incident tracking is one of the biggest components we see missing in modern support operations. Without a thoughtful approach to categorizing your incidents, it’s difficult or even impossible to know how often particular incidents arise so you can see the trends that require some work—whether via problem management, training, or another lever you can pull.

Take it from us—it’s much easier to get buy-in from leaders on new tools, better training, or other resources when the data makes the business case for you.

💡 How to elevate your incident categorization process:

  • When thinking about categorization, start with your end goals and derive categories that serve them.
  • Take a page from our incident categorization playbook by sorting incidents into categories such as carrier incidents, power incidents, hardware incidents, software incidents, etc.
  • Also, consider using subcategories to further aid prioritization. One way to do this would be to separate incidents by provider types, such as telco provider, fiber provider, or network provider.
  • Make sure these categories aren’t arbitrary or useless; they should provide actionable context that helps you work smarter, faster, and more proactively.

 

4. Incident Response

ITIL breaks incident response into a few steps: initial diagnosis, escalation if necessary, investigation and diagnosis, resolution and recovery, and closure.

Here are some quick tips in these areas:

Initial diagnosis 

Effective incident diagnosis requires visibility into a variety of metrics and past incident data to determine the cause and quickly restore service. It’s essentially a triage function. 

Service Desk staff have to rely on the information immediately in front of them to come up with an informed hypothesis about what’s likely happening. Based on this, they make decisions: start working to resolve it themselves or follow the right procedure to gather information, notify the appropriate parties, and take whatever action is necessary to get it resolved.

💡 How to elevate your initial incident diagnosis process:

  • Consider using your incident categories to develop communication scripts the service desk can use to help isolate the cause of a given incident and gather information for further investigation and diagnosis.
  • Utilize a known error database for quick information gathering. Many ITSM tools enable teams to document and store information about known errors and solutions for quick retrieval. Frontline support staff can use search tools to match an incident with a known error to diagnose and work to resolve it quickly.
  • An easily navigable knowledge base can also put crucial information in front of engineers to make informed initial diagnoses.
  • Consider how applying modern AIOps tools could profoundly improve incident diagnosis (and other parts of your workflow). Here at INOC, we’re applying machine learning and automation to auto-correlate alarms and events and automatically surface likely root causes, enabling engineers to analyze incidents much faster. Read much more about that in our free white paper: The Role of AIOps in Enhancing NOC Support

 

Incident escalation

If your frontline (or Tier 1) support team is properly trained and operationalized, it should be able to perform about 65% to 75% (or more) of all support activities on its own (thereby increasing productivity and enabling advanced staff to focus on strategic business initiatives).

But even the best frontline teams encounter incidents that need to be escalated. For incidents that require escalation, the goal is to collect and log as much information as is needed for Tier 2 or 3 engineers to quickly understand, diagnose, and resolve them.

  • For escalated cases, IT specialist engineers should be able to respond to and resolve issues in a systematic and timely manner. Every minute saved compounds into significant cost savings from improved staff efficiency, reduced mean time to resolution, and a much better end-user experience.
  • The biggest challenge we see here is that frontline staff aren’t trained or operationalized as well as they ought to be. As a result, they escalate way more incidents than they ought to, burdening advanced engineers with what’s often simple break-fix work. 
  • A similar challenge is that there simply is no frontline. Support isn’t organized, so incidents fall on everyone’s lap. This can exact a massive toll on productivity and job satisfaction, not to mention the business itself. 

Our other white paper, Empowering the IT Support Manager, clearly explains the purpose and value of having a Tier 1 NOC to handle incidents.

Here’s the key takeaway:

“By utilizing a skilled internal or outsourced 24x7 Tier 1 NOC service that consistently monitors, records, and manages events and incidents, IT Support Managers can ensure that 60% or more of their support issues are resolved at the front line. For escalated cases, the IT specialist engineers of an organization are able to respond to and resolve issues in a systematic and timely manner. This results in significant cost savings from improved staff efficiency, reduced mean time to resolution, and a much better end-user experience.”

Talk to us about outsourcing this vital frontline of support so your team can get back to revenue-generating projects.

💡 How to elevate your incident escalation process:

  • One simple way to achieve a better balance in how incidents are escalated is to use a combination of engineers’ judgment and a timed system. For example, you could set a time limit for an engineer to resolve a ticket, and when that limit is reached, it must be escalated to the next level. Meanwhile, leads and supervisors can monitor engineers’ activity and reach out to individuals who are struggling to improve their performance over time.
  • If advanced staff are especially burdened by break-fix work, the bigger solution is a structural one: implementing a tiered organization/workflow. Organizing your NOC activities and workflows based on your specific technologies and skill levels is one of the biggest operational unlocks in the NOC. Teams that implement a structure similar to the one visualized below virtually always handle events, service requests, and resolve incidents at the appropriate tier faster than before.

Tiered NOC Support Structure

Having such a structure for properly managing your workflow can prevent your NOC from being overwhelmed by the “wall of red” NOC teams strive to avoid at all costs. In most NOCs, issues should be prioritized and organized into a set of queues, so each of them can be handled by the appropriate group.

Download our white paperTop 10 Challenges to Running a Successful NOC—for a set of example workflow queues you can use to break up issues and assign them to groups based on skillset.

Investigation and diagnosis

ITIL formally separates this as a discrete step, but in real-world operations, it’s happening across the entire incident lifecycle.

By collecting information, your frontline support staff are already investigating to a certain extent and may even successfully diagnose and resolve the incident right then and there. If not, staff will be investigating and diagnosing the issue as it’s escalated and worked until it’s resolved.

Resolution and recovery

After a diagnosis has been reached–and hopefully, within your SLAs–you will proceed to resolve the issue. 

The term "recovery" in this case simply refers to how long it takes to fully recover operations after the proper fix has been identified. (Some fixes may require testing and deployment after the proper fix has been identified.)

Incident closure

At the end of its lifecycle, an incident is passed back to the service desk (if it was escalated) to be closed. Closed incidents should only be closed by service desk employees to maintain quality. To close the incident, the incident owner should verify with the person who reported the incident that the resolution is satisfactory.

One of the challenges with closure is ensuring that a given issue was completely resolved. A prime example here is a fiber cut on an optical network. Just because a fiber cut has been repaired and the fiber is back online doesn't necessarily mean the issue is resolved. You could have created other problems during the fix that need to be addressed.

Peter Prosen, INOC
“One of the things that we see a lot here at INOC is something like a fiber cut getting fixed—and everything going back online with alarms cleared. But before we can resolve it, we have to do a health check on that circuit or that particular fiber.

And what we find a lot of times is that the splice, even though it was performed in the circuit as a backup line, what we're seeing now is some degradation across the line.

That degradation could be enough that even though it's working, it's potentially going to cause future problems. So we have to be cognizant of the second-order effects of a given fix—and quick enough to be able to say—while crews are still there in the field—“this looks like it’s back up, but that splice isn't up to standard. Let’s re-splice that to avoid loss across that span.”

— Peter Prosen, INOC’s VP of NOC Operations

💡 How to elevate your incident closure process:

  • Identify possible second-order problems that may arise from actions taken to resolve incidents and control for them.
  • Develop a procedure to confirm total resolution before incident closure.


5. Communication With the User Community Throughout the Life of the Incident

Determining how much to communicate with the user community throughout the life of an incident is something of an art, but it’s a critical component of incident management.

If you communicate too much, you risk taxing a carrier or overwhelming a client or end-user—or risking important messages getting lost in the noise of endless notifications. But if you don’t provide updates frequently enough, this could stress those stakeholders and force them to operate uninformed.

💡 How to elevate your incident communication process:

  • The beginning of finding a solution here is to start with what the “user” actually needs to know. There are a number of ways to execute on this information.
  • For some, notifying the user only when necessary is enough. For example, a lot of action tends to happen when you are on a P3, so notifying them when this is going on makes sense.
  • On the other end of the spectrum, it’s useful to know when a client likes to sort and filter messages themselves. In this case, they often want every email, so notifying them more frequently is appropriate because it will be delegated internally.
  • By the same token, some clients may prefer to deal with minor issues that occurred during 3rd shift first thing in the morning if they aren’t critical.
  • Another way to limit the volume of notifications is to set a time limit for resolution. For example, if the incident is not resolved in under 3 minutes, then the customer should be notified. This can cut back on unnecessary notifications for routine bounces.

Greater context surrounding the issue, such as how parties are involved in the resolution process, is another factor in client communication. If the client is lacking the necessary information to understand why you are making the decisions you are making regarding urgency and impact, it would be in your best interest to fill them in. 

Going back to our fiber cut example, if your client is experiencing issues due to a cut fiber, and you know that the fiber carrier is fixing it, and cannot fix it any faster, you may only check in with them every hour to get a status report. 

At the same time, you may not have a higher team working on it, because it’s not necessary. In these situations, the client may need to be informed why you aren’t escalating the issue and the fact that it’s pointless to update them more than every hour since you understand how the third-party provider works.

📋 A simple ITIL incident management checklist:

For IT leaders wondering if, and to what degree their incident management workflow is ripe for some enhancement or would be better served by a third-party NOC support provider, the following questions may be helpful to consider:

  • Is your team communicating effectively through the entire lifecycle of an incident, or are key parties left out of the loop in some stages?
  • Is your incident management workflow designed to ensure that the lowest-tiered team handles incidents appropriately for the issue at hand, or are advanced resources routinely distracted by lower-tier incidents?
  • Are you using templates for troubleshooting and communication to ensure all the necessary information has been gathered before reaching out to another stakeholder or escalating an issue?
  • Are those responsible for managing incidents cataloging their successes into a knowledge base, and are staff routinely referencing that knowledge base when managing incidents?
  • Are your monitoring and management tools adequately integrated, and is the data gathered across them made available through a single view?
  • Are you capturing all the metrics you need to gauge the success of your incident management workflow accurately?


Why INOC for Incident Management?

20+ years of experience in NOC services have helped us deliver outsourced incident management and expert consulting that helps teams hit their SLAs by bringing next-level efficiencies into their workflows.

We realize incident management relies heavily on processes documented in the runbook and delivered through tools like ticketing systems. These processes are essential to unlocking the value of a tiered NOC organization and its resources. Effectively receiving, diagnosing, and then responding to an incident requires a thoughtfully operationalized support structure and visibility into a variety of metrics and past incident data to determine true root causes, restore service rapidly, and prevent their recurrence.

Here at INOC, we apply the very latest in machine learning and automation capabilities (AIOps) to radically improve the speed and quality of service to hit service-level targets while reducing human effort.

Teams that entrust their incident management to us free themselves from the bulk of their break-fix work while achieving things like:

  • Faster incident analysis and response. Our AIOps-enabled incident workflows autonomously surface the probable cause of incidents and allow NOC engineers to confirm that the analysis and data are sound before implementing a plan for resolution.
  • Automated response. Thoughtfully automating low-risk tasks for routine alerts in non-business-critical workloads frees engineers to focus on complex infrastructure support issues rather than simple break-fix work. By examining existing incident response procedures, we identify the most time-consuming repetitive actions and apply automation. When implemented well, AIOps can reduce resolution times substantially.
  • Predictive alerting. By correlating real-time event and performance data with past event data that resulted in outages, AIOps can identify developing problems before they require reactive response. This advantage helps the NOC move from a reactive and proactive support model to a predictive one. Impending failures are identified for further action, saving customers downtime. In addition, by identifying potential remediation paths based on incident similarity, AIOps can help ensure insights from past remediation efforts are not wasted.

Want to learn more about effective incident management? Contact us to see how we can help you improve your IT service strategy and NOC support or download our free white paper below.

White paper cover: A Practical Guide to Running an Effective NOCFREE WHITE PAPER

A Practical Guide to Running an Effective NOC

Download our free white paper and learn how to build, optimize, and manage your NOC to maximize performance and uptime.

Download

 

 

 

*Originally developed by the UK government’s Office of Government Commerce (OGC) - now known as the Cabinet Office - and currently managed and developed by AXELOS, ITIL is a framework of best practices for delivering efficient and effective support services.

Peter Prosen

Author Bio

Peter Prosen

Vice President of NOC Operations, INOCPete has 30 years’ experience in engineering and sales, including systems engineering, software development, OEM and channel sales, product management, process design and improvement, and system quality assurance. He has worked for a range of companies in the cable, telephony, cellular data and wireless markets, including E-Band Communications, Motorola, FlexLight Networks and ADC Telecommunications. The high-level wireless expertise Pete brought to INOC enabled the development of new and innovative solutions for clients. As manager of both INOC’s NOC and field services teams, he ensures quality and efficiency across the operation.

Let’s Talk NOC

Use the form below to drop us a line. We'll follow up within one business day.

men shaking hands after making a deal