Effective problem management is the IT service strategy that can provide stability and optimize your NOC support. Although problem management is less visible than incident management, it is just as critical to your NOC.
Here, we’ll discuss the purpose of problem management, the activities it involves, and the benefits it provides for your infrastructure availability and performance.
Definition and Purpose of Problem Management
The Information Technology Infrastructure Library (ITIL)* defines a problem as “a cause, or potential cause, of one or more incidents.” (For more on incidents, see our post on Incident Management.) Problem management provides a framework to control problems through a series of specific actions.
The goal of problem management is to find the root cause of a problem, provide a solution, prevent recurrence—and in a highly optimized NOC—prevent problems from occurring in the first place. Problem management staff need to have strong analytical skills and technical expertise.
Problem Management vs. Incident Management
Incident management and problem management are different. The goal of incident management is to restore services as quickly as possible. Problem management aims to determine and address the root cause of an incident or a series of incidents by identifying, tracking, and resolving the underlying problems.
Problem management tends to be less visible than incident management. Whereas users feel the direct impact of incidents, they are unlikely to be aware of problem management work, because the ultimate objective is to stop incidents before they happen. This lack of visibility can cause organizations to spend the bulk of their NOC resources resolving incidents, rather than investing in problem management to prevent incidents from occurring.
Problem Management Lifecycle
You can follow the well-established ITIL lifecycle framework to set up a problem management strategy for your organization. The main activities in the problem management lifecycle are as follows:
Problems can be detected via an incident report, uncovered in an ongoing incident analysis, or detected by an automated tool. A problem is commonly discovered when an incident is resolved and then reoccurs. If technical staff are unsure of the root cause, they create a problem record. Or, if an incident is clearly associated with a problem that is already recorded (referred to as a “known problem”), the new incident can be linked to the existing problem record.
2. Record the problem
f the problem does not already have a record, one needs to be created. The record should contain information such as the date/time the problem was detected, user information, description of the problem, affected services and users, and associated incidents.
3. Categorize and prioritize the problem
The problem must be categorized and prioritized so it can be properly triaged. Every problem should be assigned a logical category and, if necessary, a subcategory based the types of problems your organization is likely to encounter. Common examples of problem categories are network, cloud or virtual infrastructure, database, and application. Potential subcategories of the network category are optical layer, switching, routing, and circuit. Your problem and incident categorization schemes should be the same since problems and incidents are often directly related to one another.
Assigning priority is critical in determining how and when technical staff will handle the problem. Priorities, such as P1, P2, P3, and P4 or High, Medium, and Low, should be based on the business impact and urgency of the problem. Problems that pose the greatest risk to services should be prioritized.
4. Investigate and diagnose the problem
This step is the investigation into the root cause of the problem. Common investigation techniques include:
- Analyzing the problem record, including its history;
- Reviewing the Known Error database to find matching problems and resolutions and/or re-creating the failure to determine the cause; and
- Analyzing application and network logs.
Once the root cause of the problem is diagnosed, a resolution or workaround can be established.
5. Identify and document workarounds (if necessary)
When a problem cannot be resolved quickly, an attempt should be made to find and document a workaround that reduces or eliminates the impact of future incidents. Workarounds can be discovered at any point in the problem lifecycle and should be documented in problem records.
6. Create a Known Error record
Once the investigation is complete, a Known Error record should be created. This record will enable analysts to identify and provide a quicker resolution or workaround in the event of future incidents or problems.
7. Resolve the problem
Once the solution has been discovered, it can be implemented using your organization’s change management procedure.
8. Close the problem
After the resolution has been tested and confirmed, the problem record and any associated incident records can be updated and closed. The NOC engineer handling the closure should ensure that the details of the problem and its resolution are recorded accurately for future reference.
Reactive vs. Proactive Problem Management
Problem management can be reactive or proactive. The same problem lifecycle applies in both approaches.
Reactive problem management occurs when a problem is solved as a direct result of an incident or a series of incidents. This is the approach people often think of when they hear the term “problem management.” Reactive problem management involves identifying the causes that contributed to an incident or incidents, after the incident management process has been followed to restore the impacted services. Reactive problem management might be initiated because of severe incident(s) or in response to a stakeholder’s request for a root cause analysis.
Proactive problem management means identifying and solving problems before incidents occur. Specific proactive problem management activities include risk assessments, trend analysis, analysis of errors from application logs, and proactive searches of Known Error databases and product field notices to determine if any fixes are necessary. Investing in proactive problem management can give your NOC a leading edge, as many businesses fail to invest adequate resources into this approach.
Best Practices for NOC Problem Management
Here are a few best practices to make your NOC’s problem management more efficient and effective:
- Leverage workarounds when needed: Workarounds may be necessary when finding a resolution is difficult or time consuming. In certain cases, a workaround can become permanent if the resolution would not be viable or cost-effective. In this case, the problem should remain in the Known Error database so the documented workaround can be used when related incidents occur. Workarounds can sometimes be made more efficient by finding opportunities to automate them.
- Ensure your NOC structure can efficiently support problem management: Most NOC activities, including problem and incident management, are 24×7 activities that require dedicated resources. Make sure your operational support structure enables managers to assign routine activities to lower-cost first-level teams and enable higher-level technical teams to focus on more advanced issues, like diagnosing and resolving problems.
- Prioritize proactive problem management: Practicing proactive problem management on a regular basis can help reveal new opportunities for improvement. Of course, it can be difficult to get out of a reactive mode, but deliberate investment into proactive problem management has the potential to shift your organization’s focus from putting out fires to optimizing your technologies, processes, and talents.
- Invest in your NOC staff: Make sure you’re providing your NOC staff with sufficient technical training and career progression opportunities. The objective should be to build your staff’s technical knowledge so they have the experience and expertise to dig into complex problems and find solutions. Your NOC training program should include both initial onboarding and ongoing training. Also, identify a clear path for employees to advance from one level to the next or move into other departments within the organization. Current members of your first-level team can become complex problem solvers with the proper investment.
Benefits of Problem Management
This problem management framework can be adapted to meet your organization’s specific NOC support needs and constraints.
Effective problem management, using both the reactive and proactive approaches, can help your organization demonstrate value, seek new efficiencies, and overcome infrastructure issues. Additional benefits of an effective problem management strategy include:
- Decreased number of incidents,
- Streamlined NOC support processes,
- Optimized technologies,
- More productive and knowledgeable staff,
- Higher service quality,
- Increased service availability or uptime,
- Reduced costs, and
- Improved customer satisfaction.
Want to learn more about effective problem management? Contact us to see how we can help you improve your IT service strategy and NOC support or download our free white paper below.
FREE WHITE PAPER
A Practical Guide to Running an Effective NOC
Download our free white paper and learn how to build, optimize, and manage your NOC to maximize performance and uptime.
*Originally developed by the UK government’s Office of Government Commerce (OGC) - now known as the Cabinet Office - and currently managed and developed by AXELOS, ITIL is a framework of best practices for delivering efficient and effective support services.