As IT support technology and software platforms continue to evolve, a well-conceived incident management process incorporating a proven framework and abiding by long-standing best practices has become essential in designing a support operation that can adapt to change and keep up with innovation.
The Information Technology Infrastructure Library’s (ITIL’s)* well-established best practices can help you set up a solid incident management framework for your organization. ITIL’s incident management life cycle provides steps to process incidents from beginning to end.
Service Desk teams, often working within a network operations center (NOC), are typically the frontline of support for managing incidents. Their role in incident management is to identify, diagnose, and work incidents to restore the defined service levels as quickly as possible.
The best incident management teams work through a dedicated process flow with a formal ticketing system. They rely on a clear, documented process with defined steps to work through each incident. The approach varies between organizations and teams and how rigidly they follow the ITIL framework.
Despite being a central, daily task for service desk teams, incident management processes are often fraught with inefficiencies. Over time, these inefficiencies can compound into a significant drag on performance and satisfaction—both among customers or end-users and support staff.
We sat down with two of INOC’s resident incident management experts to understand how IT support teams can take ITIL incident management to the next level through each step within ITIL’s incident management process.
Eric Idler, INOC’s Director of Shared NOC
Eric has worked at INOC for six years, starting with the Service Desk team before progressing into numerous leadership roles, including Manager of Advanced Incident Management and Senior NOC Manager. With hands-on experience in Operations and Service Desk Management, he currently manages the Shared NOC. He ensures that incidents are handled efficiently and effectively through their life cycle to support INOC clients.
Peter Prosen, INOC’s VP of NOC Operations
Pete has 30 years of experience in engineering and sales, including systems engineering, software development, OEM and channel sales, product management, process design and improvement, and system quality assurance. He has worked for a range of companies in the cable, telephony, cellular data, and wireless markets, including E-Band Communications, Motorola, FlexLight Networks, and ADC Telecommunications. The high-level wireless expertise Pete brought to INOC enabled the development of new and innovative solutions for clients. As manager of both INOC’s NOC and field services teams, he ensures quality and efficiency across the operation.
The key to identifying and prioritizing incidents is discerning critical, actionable alarms from less severe or unactionable ones.
A lack of operational maturity is a common problem support teams encounter here. Identifying incidents worthy of attention and prioritizing them so that attention can be devoted where it ought to be can be a struggle in support environments (like NOCs) when processes are less operationally mature than they ought to be. Teams need to acknowledge that every incident isn’t equally urgent. And when human attention and bandwidth are limited, it’s important to prioritize incidents accordingly.
Equipment can also get in the way when determining whether an incident is actionable, how severe it is, and how to prioritize it accordingly.
At INOC, we help teams separate noise from meaningful incidents, determine which alarms are actionable, and use a robust database of port utilization to identify and prioritize each alarm.
To us, incident prioritization largely boils down to two elements: impact and urgency.
Together, the urgency and impact will determine an appropriate priority designation. This in turn will help you use your resources efficiently so that your engineers aren’t scrambling to resolve every incident with the same level of fervor.
📄 Download our free white paper—The Role of AIOps in Enhancing NOC Support—for a detailed look at the ways we’re applying bleeding-edge machine learning and automation to reduce human effort and better identify and prioritize incidents simultaneously. And schedule a free NOC consultation to see how we can bring these capabilities to improve your support workflow.
Many teams narrowly define the incident logging process as capturing information in a ticket. But for most teams—especially larger ones—logging should extend to how incident data is stored for later analysis. Simply put, if a team isn’t storing its incident data, it can’t use it later.
NOCs that can store a significant amount of historical incident data, such as alarms and ticket history, can squeeze insights from this data to conduct problem management (preventing problems from occurring in the first place by identifying incidents’ root causes, ultimately resulting in greater stability and better support.)
Unfortunately, many of the tools support teams use to manage incidents automatically clear logs after a set amount of logs or file size.
An external data storage solution is often necessary here. Moreover, data often must be consolidated from multiple sources, transferred from devices, and stored in such a way that it won’t be cumbersome to access in the future. This can be a challenge in and of itself if you don’t have a centralized Manager of Managers capable of processing data from multiple sources simultaneously (and actually configured to do so).
If investing in a solution like that is out of reach at the moment, partnering with a third-party NOC support provider like INOC may be beneficial.
At INOC, we have a data storage vault and mature ITIL incident management system, including a robust, “single pane of glass” solution like the one described above. We also help teams extract the valuable insights stored within their historical incident data to drive continual improvement. Schedule a free NOC consultation to start the conversation with our Solutions Engineers.
Finding a happy medium between too much and not enough detail when categorizing incidents for reporting purposes can be difficult.
It can also be difficult to appropriately label incidents by category since you often run the risk of getting too granular and creating incident categories that are so specific that they only occur, say, once every six months—and are, therefore, not very helpful for trending analyses.
Keep in mind that in addition to helping support staff sort and prioritize incidents when they occur, the other important purpose of categorization (the one teams often struggle the most to capitalize on) is to track incidents over time and see patterns that you can act on via problem management.
Incident tracking is one of the biggest components we see missing in support operations. Without a thoughtful approach to categorizing your incidents, it’s difficult or even impossible to know how often particular incidents arise so you can see the trends that require some work—whether via problem management, training, or another lever you can pull.
Take it from us—it’s much easier to get buy-in from leaders on new tools, better training, or other resources when the data makes the business case for you.
ITIL breaks incident response into a few steps: initial diagnosis, escalation if necessary, investigation and diagnosis, resolution and recovery, and closure.
Here are some quick tips for these areas:
Effective incident diagnosis requires visibility into a variety of metrics and past incident data to determine the cause and quickly restore service. It’s essentially a triage function.
Service Desk staff have to rely on the information immediately in front of them to come up with an informed hypothesis about what’s likely happening. Based on this, they make decisions: start working to resolve it themselves or follow the right procedure to gather information, notify the appropriate parties, and take whatever action is necessary to get it resolved.
If your frontline (or Tier 1) support team is properly trained and operationalized, it should be able to perform about 65% to 75% (or more) of all support activities on its own (thereby increasing productivity and enabling advanced staff to focus on strategic business initiatives).
But even the best frontline teams encounter incidents that need to be escalated. For incidents that require escalation, the goal is to collect and log as much information as is needed for Tier 2 or 3 engineers to quickly understand, diagnose, and resolve them.
For escalated cases, IT specialist engineers should be able to respond to and resolve issues in a systematic and timely manner. Every minute saved compounds into significant cost savings from improved staff efficiency, reduced mean time to resolution, and a much better end-user experience.
The biggest challenge we see here is that frontline staff aren’t trained or operationalized as well as they ought to be. As a result, they escalate way more incidents than they ought to, burdening advanced engineers with what’s often simple break-fix work.
A similar challenge is that there simply is no frontline. Support isn’t organized, so incidents fall on everyone’s lap. This can exact a massive toll on productivity and job satisfaction, not to mention the business itself.
Our other white paper, Empowering the IT Support Manager, clearly explains the purpose and value of having a Tier 1 NOC to handle incidents.
Here’s the key takeaway:
“By utilizing a skilled internal or outsourced 24x7 Tier 1 NOC service that consistently monitors, records, and manages events and incidents, IT Support Managers can ensure that 60% or more of their support issues are resolved at the front line. For escalated cases, the IT specialist engineers of an organization are able to respond to and resolve issues in a systematic and timely manner. This results in significant cost savings from improved staff efficiency, reduced mean time to resolution, and a much better end-user experience.”
Talk to us about outsourcing this vital frontline of support so your team can get back to revenue-generating projects.
Having such a structure for properly managing your workflow can prevent your NOC from being overwhelmed by the “wall of red” NOC teams strive to avoid at all costs. In most NOCs, issues should be prioritized and organized into a set of queues so each of them can be handled by the appropriate group.
Download our white paper—Top 11 Challenges to Running a Successful NOC—for a set of example workflow queues you can use to break up issues and assign them to groups based on skill set.
ITIL formally separates this as a discrete step, but in real-world operations, it’s happening across the entire incident lifecycle.
By collecting information, your frontline support staff are already investigating to a certain extent and may even successfully diagnose and resolve the incident right then and there. If not, staff will be investigating and diagnosing the issue as it’s escalated and worked until it’s resolved.
After a diagnosis has been reached–and hopefully, within your SLAs–you will proceed to resolve the issue.
The term "recovery" in this case simply refers to how long it takes to fully recover operations after the proper fix has been identified. (Some fixes may require testing and deployment after the proper fix has been identified.)
At the end of its lifecycle, an incident is passed back to the service desk (if it was escalated) to be closed. Closed incidents should only be closed by service desk employees to maintain quality. To close the incident, the incident owner should verify with the person who reported the incident that the resolution is satisfactory.
One of the challenges with closure is ensuring that a given issue is completely resolved. A prime example here is a fiber cut on an optical network. Just because a fiber cut has been repaired and the fiber is back online doesn't necessarily mean the issue is resolved. You could have created other problems during the fix that need to be addressed.
“One of the things that we see a lot here at INOC is something like a fiber cut getting fixed—and everything going back online with alarms cleared. But before we can resolve it, we have to do a health check on that circuit or that particular fiber. And what we find a lot of times is that the splice, even though it was performed in the circuit as a backup line, what we're seeing now is some degradation across the line. That degradation could be enough that even though it's working, it's potentially going to cause future problems. So we have to be cognizant of the second-order effects of a given fix—and quick enough to be able to say—while crews are still there in the field—“this looks like it’s back up, but that splice isn't up to standard. Let’s re-splice that to avoid loss across that span.” — Peter Prosen, INOC’s VP of NOC Operations
Determining how much to communicate with the user community throughout the life of an incident is something of an art, but it’s a critical component of incident management.
If you communicate too much, you risk taxing a carrier or overwhelming a client or end-user—or risking important messages getting lost in the noise of endless notifications. But if you don’t provide updates frequently enough, this could stress those stakeholders and force them to operate uninformed.
Another factor in client communication is the greater context surrounding the issue, such as who and what parties are involved in the resolution process. If the client lacks the necessary information to understand why you are making the decisions you are making regarding urgency and impact, it would be in your best interest to fill them in.
Going back to our fiber cut example, if your client is experiencing issues due to a cut fiber, and you know that the fiber carrier is fixing it, and cannot fix it any faster, you may only check in with them every hour to get a status report.
At the same time, you may not have a higher-level team working on it, because it’s not necessary. In these situations, the client may need to be informed why you aren’t escalating the issue and the fact that it’s pointless to update them more than every hour since you understand how the third-party provider works.
For IT leaders wondering if, and to what degree their incident management workflow is ripe for some enhancement or would be better served by a third-party NOC support provider, the following questions may be helpful to consider:
20+ years of experience in NOC services have helped us deliver outsourced incident management and expert consulting that helps teams hit their SLAs by bringing next-level efficiencies into their workflows.
We realize incident management relies heavily on processes documented in the runbook and delivered through tools like ticketing systems. These processes are essential to unlocking the value of a tiered NOC organization and its resources. Effectively receiving, diagnosing, and then responding to an incident requires a thoughtfully operationalized support structure and visibility into a variety of metrics and past incident data to determine true root causes, restore service rapidly, and prevent their recurrence.
Here at INOC, we apply the very latest in machine learning and automation capabilities (AIOps) to radically improve the speed and quality of service to hit service-level targets while reducing human effort.
Get an introduction to our Ops 3.0 platform for NOC service delivery from our own VP of Technology, Jim Martin:
Teams that entrust their incident management to us free themselves from the bulk of their break-fix work while achieving things like:
Want to learn more about effective incident management? Contact us to see how we can help you improve your IT service strategy and NOC support or download our free white paper below.
*Originally developed by the UK government’s Office of Government Commerce (OGC) - now known as the Cabinet Office - and currently managed and developed by AXELOS, ITIL is a framework of best practices for delivering efficient and effective support services.