As IT support technology and software platforms continue to evolve, a well-conceived incident management process that incorporates a proven framework and abides by long-standing best practices has become essential in designing a support operation that can adapt to change and keep up with innovation.
The Information Technology Infrastructure Library’s (ITIL’s)* well-established best practices can help you set up a solid incident management framework for your organization. ITIL’s incident management lifecycle provides steps to process incidents from beginning to end.
Service desk teams, often working within a network operations center (NOC), are typically the frontline of support for managing incidents. Their role in incident management is to identify, diagnose, and work incidents to restore the defined service levels as quickly as possible.
The best incident management teams work through a dedicated process flow with a formal ticketing system. They rely on a clear, documented process with defined steps to work through each incident. That approach varies between organizations and teams and how rigidly they follow the ITIL framework.
Despite being a central, daily task for service desk teams, incident management processes are often fraught with inefficiencies. Over time, these inefficiencies can compound into a significant drag on performance and satisfaction—both among customers or end-users and support staff.
We sat down with two of INOC’s resident incident management experts to understand how IT support teams can take ITIL incident management to the next level through each step within ITIL’s incident management process.
We cover:
- Incident Identification and Prioritization
- Incident Logging
- Incident Categorization
- Incident Response
- Communication With the User Community Throughout the Life of the Incident
Meet our contributors:
Eric Idler, INOC’s Director of Shared NOC
Eric has worked at INOC for six years, starting with the Service Desk team before progressing into numerous leadership roles, including Manager of Advanced Incident Management and Senior NOC Manager. With hands-on experience in Operations and Service Desk Management, he currently manages the Shared NOC. He ensures that incidents are handled efficiently and effectively through their life cycle to support INOC clients.
Peter Prosen, INOC’s VP of NOC Operations
Pete has 30 years of experience in engineering and sales, including systems engineering, software development, OEM and channel sales, product management, process design and improvement, and system quality assurance. He has worked for a range of companies in the cable, telephony, cellular data, and wireless markets, including E-Band Communications, Motorola, FlexLight Networks, and ADC Telecommunications. The high-level wireless expertise Pete brought to INOC enabled the development of new and innovative solutions for clients. As manager of both INOC’s NOC and field services teams, he ensures quality and efficiency across the operation.
1. Incident Identification and Prioritization
The key to identifying and prioritizing incidents is discerning critical, actionable alarms from less severe or unactionable ones.
A lack of operational maturity is a common problem support teams encounter here. Identifying incidents worthy of attention and prioritizing them so that attention can be devoted where it ought to be can be a struggle in support environments (like NOCs) when processes are less operationally mature than they ought to be. Teams need to acknowledge that every incident isn’t equally urgent. And when human attention and bandwidth are limited, it’s important to prioritize incidents accordingly.
- Problems here almost always scale with incident volume: The more incidents that a team encounters, the more time they waste giving attention to those that aren’t actionable.
- Many incidents simply indicate that something is happening without helping engineers understand what is happening or how to resolve it. Treating these “unactionable” incidents with the same urgency as actionable ones often results in sleepless nights for staff woken up in the middle of the night by alarms that don’t require their attention, or that do not suggest a solution. Over time—and especially as a company scales—this can wear on morale and lead to turnover, not to mention hurt the metrics a support team’s performance is measured by.
Equipment can also get in the way when determining whether an incident is actionable, how severe it is, and how to prioritize it accordingly.
- For example, an alarm triggered by someone powering off a piece of personal equipment as they head home after work likely shouldn’t be causing a critical alarm. But your backhaul going down should—because multiple customers have been affected. The point is, no one wants to roll trucks and wake up a boss for an issue that isn’t truly critical.
💡 How to elevate your incident identification and prioritization process:
|
At INOC, we help teams separate noise from meaningful incidents, determine which alarms are actionable, and use a robust database of port utilization to identify and prioritize each alarm.
To us, incident prioritization largely boils down to two elements: impact and urgency.
- Impact refers to the consequences of the incident on individuals, such as how many individuals it affects. For example, an incident that affects a single individual would likely be a lower priority than an incident that affects an entire town.
- Urgency refers to the financial repercussions of the incident continuing. For example, if a trading circuit actively used by a trading company is down, it could lose a significant amount of money every minute it’s down.
Together, the urgency and impact will determine an appropriate priority designation. This in turn will help you use your resources efficiently so that your engineers aren’t scrambling to resolve every incident with the same level of fervor.
📄 Download our free white paper—The Role of AIOps in Enhancing NOC Support—for a detailed look at the ways we’re applying bleeding-edge machine learning and automation to simultaneously reduce human effort and better identify and prioritize incidents. And schedule a free NOC consultation to see how we can bring these capabilities to improve your support workflow.
2. Incident Logging
Many teams narrowly define the incident logging process as capturing information on a ticket. But for most teams—especially larger ones—logging should extend to how incident data is stored for later analysis. Simply put, if a team isn’t storing its incident data, it can’t use it later.
NOCs that can store a significant amount of historical incident data, such as alarms and ticket history, can squeeze insights from this historical data to conduct problem management (preventing problems from occurring in the first place by identifying incidents’ root causes and ultimately resulting in greater stability and better support.)
Unfortunately, many of the tools support teams use to manage incidents automatically clear logs after a set amount of logs or file size.
An external data storage solution is often necessary here. Moreover, data often must be consolidated from multiple sources and transferred from devices, and stored in such a way that it won’t be cumbersome to access in the future. This can be a challenge in and of itself if you don’t have an NMS (like SolarWinds or Nagios) capable of processing data from multiple sources at the same time (and actually configured to do so).
💡 How to elevate your incident logging process:
|
If investing in a solution like that is out of reach at the moment, this may be a situation where partnering with a third-party NOC support provider like INOC can be beneficial.
At INOC, we have a data storage vault and mature ITIL incident management system, including a robust, “single pane of glass” solution like the one described above. We also help teams extract the valuable insights stored within their historical incident data to drive continual improvement.
Schedule a free NOC consultation to start the conversation with our Solutions Engineers.
3. Incident Categorization
When categorizing incidents for reporting purposes, it can be difficult to find a happy medium between too much and not enough detail.
It can also be difficult to appropriately label incidents by category since you often run the risk of getting too granular and creating incident categories that are so specific that they only occur, say, once every six months—and are therefore not very helpful for trending problems.
- Keep in mind that in addition to helping support staff sort and prioritize incidents when they occur, the other important purpose of categorization (the one teams often struggle the most to capitalize on) is to track incidents over time and see patterns that you can act on via problem management.
Incident tracking is one of the biggest components we see missing in modern support operations. Without a thoughtful approach to categorizing your incidents, it’s difficult or even impossible to know how often particular incidents arise so you can see the trends that require some work—whether via problem management, training, or another lever you can pull.
Take it from us—it’s much easier to get buy-in from leaders on new tools, better training, or other resources when the data makes the business case for you.
💡 How to elevate your incident categorization process:
|
4. Incident Response
ITIL breaks incident response into a few steps: initial diagnosis, escalation if necessary, investigation and diagnosis, resolution and recovery, and closure.
Here are some quick tips in these areas:
Initial diagnosis
Effective incident diagnosis requires visibility into a variety of metrics and past incident data to determine the cause and quickly restore service. It’s essentially a triage function.
Service Desk staff have to rely on the information immediately in front of them to come up with an informed hypothesis about what’s likely happening. Based on this, they make decisions: start working to resolve it themselves or follow the right procedure to gather information, notify the appropriate parties, and take whatever action is necessary to get it resolved.
💡 How to elevate your initial incident diagnosis process:
|
Incident escalation
If your frontline (or Tier 1) support team is properly trained and operationalized, it should be able to perform about 65% to 75% (or more) of all support activities on its own (thereby increasing productivity and enabling advanced staff to focus on strategic business initiatives).
But even the best frontline teams encounter incidents that need to be escalated. For incidents that require escalation, the goal is to collect and log as much information as is needed for Tier 2 or 3 engineers to quickly understand, diagnose, and resolve them.
- For escalated cases, IT specialist engineers should be able to respond to and resolve issues in a systematic and timely manner. Every minute saved compounds into significant cost savings from improved staff efficiency, reduced mean time to resolution, and a much better end-user experience.
- The biggest challenge we see here is that frontline staff aren’t trained or operationalized as well as they ought to be. As a result, they escalate way more incidents than they ought to, burdening advanced engineers with what’s often simple break-fix work.
- A similar challenge is that there simply is no frontline. Support isn’t organized, so incidents fall on everyone’s lap. This can exact a massive toll on productivity and job satisfaction, not to mention the business itself.
Our other white paper, Empowering the IT Support Manager, clearly explains the purpose and value of having a Tier 1 NOC to handle incidents.
Here’s the key takeaway:
“By utilizing a skilled internal or outsourced 24x7 Tier 1 NOC service that consistently monitors, records, and manages events and incidents, IT Support Managers can ensure that 60% or more of their support issues are resolved at the front line. For escalated cases, the IT specialist engineers of an organization are able to respond to and resolve issues in a systematic and timely manner. This results in significant cost savings from improved staff efficiency, reduced mean time to resolution, and a much better end-user experience.”
Talk to us about outsourcing this vital frontline of support so your team can get back to revenue-generating projects.
💡 How to elevate your incident escalation process:
|
Having such a structure for properly managing your workflow can prevent your NOC from being overwhelmed by the “wall of red” NOC teams strive to avoid at all costs. In most NOCs, issues should be prioritized and organized into a set of queues, so each of them can be handled by the appropriate group.
Download our white paper—Top 10 Challenges to Running a Successful NOC—for a set of example workflow queues you can use to break up issues and assign them to groups based on skillset.
Investigation and diagnosis
ITIL formally separates this as a discrete step, but in real-world operations, it’s happening across the entire incident lifecycle.
By collecting information, your frontline support staff are already investigating to a certain extent and may even successfully diagnose and resolve the incident right then and there. If not, staff will be investigating and diagnosing the issue as it’s escalated and worked until it’s resolved.
Resolution and recovery
After a diagnosis has been reached–and hopefully, within your SLAs–you will proceed to resolve the issue.
The term "recovery" in this case simply refers to how long it takes to fully recover operations after the proper fix has been identified. (Some fixes may require testing and deployment after the proper fix has been identified.)
Incident closure
At the end of its lifecycle, an incident is passed back to the service desk (if it was escalated) to be closed. Closed incidents should only be closed by service desk employees to maintain quality. To close the incident, the incident owner should verify with the person who reported the incident that the resolution is satisfactory.
One of the challenges with closure is ensuring that a given issue was completely resolved. A prime example here is a fiber cut on an optical network. Just because a fiber cut has been repaired and the fiber is back online doesn't necessarily mean the issue is resolved. You could have created other problems during the fix that need to be addressed.
“One of the things that we see a lot here at INOC is something like a fiber cut getting fixed—and everything going back online with alarms cleared. But before we can resolve it, we have to do a health check on that circuit or that particular fiber.
And what we find a lot of times is that the splice, even though it was performed in the circuit as a backup line, what we're seeing now is some degradation across the line.
That degradation could be enough that even though it's working, it's potentially going to cause future problems. So we have to be cognizant of the second-order effects of a given fix—and quick enough to be able to say—while crews are still there in the field—“this looks like it’s back up, but that splice isn't up to standard. Let’s re-splice that to avoid loss across that span.”
— Peter Prosen, INOC’s VP of NOC Operations
💡 How to elevate your incident closure process:
|
5. Communication With the User Community Throughout the Life of the Incident
Determining how much to communicate with the user community throughout the life of an incident is something of an art, but it’s a critical component of incident management.
If you communicate too much, you risk taxing a carrier or overwhelming a client or end-user—or risking important messages getting lost in the noise of endless notifications. But if you don’t provide updates frequently enough, this could stress those stakeholders and force them to operate uninformed.
💡 How to elevate your incident communication process:
|
Greater context surrounding the issue, such as how parties are involved in the resolution process, is another factor in client communication. If the client is lacking the necessary information to understand why you are making the decisions you are making regarding urgency and impact, it would be in your best interest to fill them in.
Going back to our fiber cut example, if your client is experiencing issues due to a cut fiber, and you know that the fiber carrier is fixing it, and cannot fix it any faster, you may only check in with them every hour to get a status report.
At the same time, you may not have a higher team working on it, because it’s not necessary. In these situations, the client may need to be informed why you aren’t escalating the issue and the fact that it’s pointless to update them more than every hour since you understand how the third-party provider works.
📋 A simple ITIL incident management checklist: For IT leaders wondering if, and to what degree their incident management workflow is ripe for some enhancement or would be better served by a third-party NOC support provider, the following questions may be helpful to consider:
|
Why INOC for Incident Management?
20+ years of experience in NOC services have helped us deliver outsourced incident management and expert consulting that helps teams hit their SLAs by bringing next-level efficiencies into their workflows.
We realize incident management relies heavily on processes documented in the runbook and delivered through tools like ticketing systems. These processes are essential to unlocking the value of a tiered NOC organization and its resources. Effectively receiving, diagnosing, and then responding to an incident requires a thoughtfully operationalized support structure and visibility into a variety of metrics and past incident data to determine true root causes, restore service rapidly, and prevent their recurrence.
Here at INOC, we apply the very latest in machine learning and automation capabilities (AIOps) to radically improve the speed and quality of service to hit service-level targets while reducing human effort.
Teams that entrust their incident management to us free themselves from the bulk of their break-fix work while achieving things like:
|
Want to learn more about effective incident management? Contact us to see how we can help you improve your IT service strategy and NOC support or download our free white paper below.
FREE WHITE PAPER
A Practical Guide to Running an Effective NOC
Download our free white paper and learn how to build, optimize, and manage your NOC to maximize performance and uptime.
*Originally developed by the UK government’s Office of Government Commerce (OGC) - now known as the Cabinet Office - and currently managed and developed by AXELOS, ITIL is a framework of best practices for delivering efficient and effective support services.
SUBSCRIBE TO RECEIVE NEW POSTS IN YOUR INBOX
Let's talk NOC
Book a free NOC consultation and explore support possibilities with a Solutions Engineer.
Post Your Comment Here