Effective incident management is the foundation of a successful Network Operations Center (NOC) and can ensure critical infrastructure issues are handled in a timely manner. Establishing an incident management framework at your organization will help your operations run smoothly, transparently, and efficiently.
Incident Management relies heavily on workflows documented in the NOC runbook and delivered through tools like ticketing systems. These workflows are essential to unlocking the value of a tiered NOC organization and its resources.
Definition and Purpose of Incident Management
The Information Technology Infrastructure Library (ITIL)* service framework defines an incident as:
- An unplanned interruption to a service,
- A reduction in the quality of a service, or
- An event that has not yet impacted the service to the customer or user.
ITIL states that incident management aims “to minimize the negative impact of incidents by restoring normal service operations as quickly as possible.” Effective incident management enables your NOC staff to fix what is broken as quickly as possible.
Benefits of Incident Management
The benefits of incident management include:
- Increased transparency and efficient communications for stakeholders regarding incident status and timelines;
- Documented records of past incidents;
- Ability to track, analyze, and report trends in incident data;
- Ability to document solutions for repeat incidents;
- Lower risk of serious outages;
- Quicker resolution times; and
- Increased customer satisfaction.
Incident Management Lifecycle
ITIL’s well-established best practices can help you set up a solid incident management framework for your organization. The incident management lifecycle provides steps to process incidents from beginning to end:
1. Detect and record the incident
Someone, or something, must identify that an incident is happening and log it so it can be tracked. Make sure you have the appropriate tools to report and document incidents. Incidents may be identified by technical staff, detected and reported automatically by monitoring tools, or communicated by end users. Organizations should offer multiple ways for end users to report incidents, including email, phone, and a self-service portal.
2. Categorize and prioritize the incident
After an incident is logged, it needs to be categorized and prioritized to determine how it should be handled and who should perform the next steps. Categorization and prioritization allow NOC support staff to make more informed decisions and quickly understand whether an incident can be easily resolved or requires escalation. Categories and priorities also reduce redundancy and speed up time to resolution.
Every incident should be assigned a logical category and, if necessary, a subcategory based on the type of incidents your organization is likely to encounter. Common examples of incident categories are network, cloud or virtual infrastructure, database, and application. Potential network subcategories include optical layer, switching, routing, and circuit.
Categorization will help you analyze incident data effectively and look for trends and patterns, which is a key part of effective problem management to prevent future incidents. It also helps you build your knowledge base and look for opportunities to automate processes, such as log data collection.
In addition to categories, incidents should be assigned priorities, such as P1, P2, P3, and P4, or High, Medium, and Low, based on the business impact and urgency of the incident. Prioritization helps determine the order in which incidents are sorted and worked on by technical staff.
3. Investigate the incident
Once an incident is categorized and prioritized, engineers can investigate the incident to find a resolution. This step can involve time-consuming research that drains your NOC’s resources. A key piece of this step is having well-trained staff who can investigate incidents efficiently and find the quickest path to resolution, along with a strong knowledge base that staff can reference for guidance. (See the “Best Practices for NOC Incident Management” section below for more on building a knowledge base.)
In most cases, the first-level team should be able to resolve incidents. Incidents that cannot be resolved in this initial investigation need to be escalated. See the “Best Practices for NOC Incident Management” below for more on how to minimize escalations.
4. Escalate the incident (if necessary)
Incidents that require escalation are assigned to the appropriate specialized technical groups, who will use their expertise or additional resources to determine how to resolve each incident.
5. Resolve the incident
The appropriate technical staff working on the incident should focus on resolving it or finding a workaround to restore the impacted service as quickly as possible. The technical staff should then communicate with the end users and/or impacted stakeholders to verify that they are satisfied and that the expected service has resumed.
6. Close the incident
Once the resolution is verified, the incident can be closed and the resolution documented in the knowledge base.
Best Practices for NOC Incident Management
Here are a few best practices to bolster your NOC’s incident lifecycle efficiency and effectiveness:
- Communicate to stakeholders throughout the incident lifecycle: Communicating the status of an incident throughout its lifecycle assures users and stakeholders that the incident is being properly handled. It also manages stakeholder expectations and engages them to follow up if they have additional questions or comments. Well-polished NOC support will standardize these communications as much as possible through NOC automation and templates.
- Minimize escalations whenever possible: Incidents should always be resolved by the lowest tiered team possible so higher-level specialized teams can focus on more complex issues and impacted users receive prompt resolution of incidents. A documented process that defines how and when to escalate an incident and who may do so ensures that incoming incidents end up in the most capable and efficient hands to resolve them as soon as possible.
- Build a robust knowledge base: Another key piece of successful incident management within your NOC is having a robust and well-maintained knowledge base for staff to reference for troubleshooting, which aids in limiting the number of escalations. This knowledge base should include supplemental support documentation, such as runbooks and flowcharts. This helps technical staff quickly identify next steps and probable causes, avoiding unnecessary rework and research. It also ensures that first-level technical staff have the proper resources to resolve incidents, rather than needlessly escalating incidents to more specialized staff.
- Integrate your NOC’s tools for maximum efficiency: An effective NOC receives notifications (like alarms) and information from multiple sources and presents them to staff in a single, consolidated view. The NOC also needs to incorporate input from calls, email, text, customer portals, knowledge bases, documentation, and workflow management tools, each potentially with its own platform.
This incident management framework and best practices can help ensure that your NOC resolves incidents quickly while keeping stakeholders informed. Aligning your NOC’s incident management lifecycle with ITIL best practices creates ease of mind and allows you to focus on your business.
Want to learn more about effective incident management? Contact us to see how we can help you improve your IT service strategy and NOC support or download our free white paper below.
FREE WHITE PAPER
A Practical Guide to Running an Effective NOC
Download our free white paper and learn how to build, optimize, and manage your NOC to maximize performance and uptime.
*Originally developed by the UK government’s Office of Government Commerce (OGC) - now known as the Cabinet Office - and currently managed and developed by AXELOS, ITIL is a framework of best practices for delivering efficient and effective support services.