Incident management

This research project focuses on the modeling and performance optimization of IT support organizations, and its main objective is to realize software to support strategic business-driven decisions in the incident management domain. All the work on this project is done in collaboration with theService Automation and Integration Lab of HP Labs in Palo Alto, CA (USA), the Hewlett Packard research division.

Motivation

The IT Infrastructure Library (ITIL) is a comprehensive set of concepts and techniques for managing IT infrastructure, development, and operations. Developed by the UK Office of Government Commerce, ITIL is today the de facto best practice standard for IT service management. Among the processes that ITIL defines, Incident Management is the process for "… restoring normal service operation after a disruption, as quickly as possible and with minimum impact on the business".

IT support organizations consist of a network of support groups (each one with a team of operators). Support groups are organized into support levels, with lower level groups dealing with generic issues and higher level groups handling technical and time-consuming tasks. Real-life IT support organizations implement complex organizational, structural, and behavioral processes according to the strategic objectives defined at the business management level.

This project tackles the problem of optimizing the performance of an IT organization with particular regard to its help desk function and incident management process. The objective for the performance optimization of the incident management process is to manage customer incident reports as quickly as possible, by quickly forwarding them to the best equipped (in terms of, e.g., skill set, cost effectiveness, or workforce availability) support group(s) for their definitive resolution.

In order to tune the performance of the IT support organization, it is necessary to evaluate the possible improvements brought by realignments of the current incident management strategy, or by the adoption of alternative strategies. This is a very challenging task, as it requires considering a large set of possible operations, such as re-staffing (the restructuring of the support organization by increasing or cutting staffing levels, or the transfer of operators around support groups, possibly on retraining), and the implementation of different policies for incident assignment and prioritization. In addition, the process of implementing the actual corrective measures is very expensive and time-consuming, so alternative strategies should be carefully considered before putting them in practice. This calls for what-if scenario analysis, a technique that enables the behavioral analysis of complex real-life systems under alternative working conditions. More specifically, what-if scenario analysis is based on the definition of an accurate model of the system under evaluation and on its exploitation to reenact of the system behavior with modified parameters.

 

The SYMIAN Decision Support Tool

We have designed SYMIAN (SYMulation for Incident ANalysis), a decision support tool for the performance analysis and optimization of the incident management function in IT support organizations allowing what-if scenario analysis. SYMIAN enables its users to build an accurate model of real-life IT support organizations, to evaluate their performance, and to assess likely improvements brought by organizational, structural and behavioral changes.

SYMIAN models the IT support organization as an open queueing network. This  approach is particularly well suited for modeling the incident management process, as it builds on models of the dynamics of IT support organizations in terms of throughput, queue lengths, response times, and utilization, both at the system level and at the single support group level. The model is also able to make a distinction between the time spent by operators working on service restoration and the time spent waiting for operator availability, all the way down to the single incident level. The fine-grained model of the IT support organization implemented in SYMIAN allows users to play out what-if scenarios, such as adding technicians to a given support group and merging support groups together.

SYMIAN exploits a discrete event simulator to reproduce in detail the behavior of IT organizations and to evaluate their performance in managing incidents. In fact, the scale and the complexity of real-life organizations tackling incident management make it extremely difficult to devise an analytical model and calls for simulation-based approaches.

 

The SYMIAN tool has been applied to assessing and improving the performance of several real-life IT support organizations. The results demonstrated the effectiveness of the SYMIAN-based performance analysis and tuning process.

 

Publications

[1] M. Tortonesi, C. Stefanelli, C. Bartolini, G. Barash, L. Fradin, "Optimizing the IT incident management process: a simulation-based tool", in Proceedings of 2008 Workshop of HP Software University Association (HP-SUA 2008), 22-25 May 2008, Marrakech, Morocco.

[2] C. Bartolini, C. Stefanelli, M. Tortonesi, "SYMIAN: a Simulation Tool for the Optimization of the IT Incident Management Process", in Proceedings of 19th IFIP/IEEE International Workshop on Distributed Systems: Operations and Management (DSOM 2008), September 2008, Pythagorion, Samos Island, Greece.

[3] C. Bartolini, C. Stefanelli, M. Tortonesi, "Business-impact analysis and simulation of critical incidents in IT service management", in Proceedings of the 11th IFIP/IEEE International Symposium on Integrated Network Management (IM 2009), 1-5 June 2009, New York, NY, USA.

[4] C. Bartolini, C. Stefanelli, M. Tortonesi, "Modeling IT Support Organizations from Transactional Logs", accepted for publication in Proceedings of the 12th IEEE/IFIP Network Operations and Management Symposium (NOMS 2010), 19-23 April 2010, Osaka, Japan.

[5]  C. Bartolini, C.Stefanelli, M. Tortonesi, "SYMIAN: Analysis and Performance Improvement of the IT Incident Management Process", accepted for publication in IEEE Transactions on Network and Service Management.