I recently went looking for introductory material on incident and outage management principles, practices, and so forth, to help an ASP-like client of mine educate some of their management team who are pretty sharp in their own fields but who are new to Operations and IT management. I'm not talking about security incidents in particular, though those are certainly one type; I'm talking about more general types of incidents and outages that a service provider (particularly an ASP) might run into. Network outages, hardware failures, system overloads, cooling/power failures, software meltdowns, database debacles, etc.

I found a couple of interesting papers...

First, in the June 2005 issue of the USENIX magazine ;login:, there's an article called "When Disaster Strikes: Cailin and Roland Discuss Crisis Management", by Thomas Sluyter and Roland van Maarschalkerweerd (USENIX membership is required to download the article PDF from the web site). It's a good, high-level, 4-page intro to crisis management, and does a pretty decent job of providing an overview of a workable incident management process.

Second, I found a free white paper written by INS called "A Framework for Incident and Problem Management", by Victor Kapella, April 2003. This is longer (20 pages), more formal, more abstract, and more management-oriented, but very useful and very interesting. One thing to keep in mind in reading this paper is that they draw a distinction (as does ITIL, so I understand) between "incidents" (single occurrences) and "problems" (ongoing sets of individual issues which have a common cause). This paper does a particularly good job of talking about the ways in which Operations, Engineering, and other functions within an organization need to work together to resolve Incidents and Problems.

I recently evaluated a consulting client's IT infrastructure and operational capabilities using COBIT, which is an assessment standard originally developed in the UK and now used worldwide. I found COBIT to be very useful for this task at the management/process level, although it doesn't really get into the technical details. Here's how I used it, and how you might find it useful too...

