Recently in Category Management > Management Tools

LOPSA board member Trey Harris has posted an excellent message outlining his thoughts on effective organization and scheduling for groups of sysadmins in a high-interrupt, high-profile, high-availability environment (Amazon, Google, etc.).

Trey's message was part of a very interesting discussion taking place on the LOPSA "discuss" mailing list regarding "interruptions coverage" for sysadmins. The basic question under discussion is, given that much of system administration work is by nature interrupt-driven, how can an organization best shield some of its sysadmins' time from these interrupts, so that the sysadmins can get long-term work done (and maintain their own sanity!)? To read the whole thread, search for "Interruptions coverage" in the list's archive.

I think that this discussion (and Trey's contribution in particular) is an excellent example of the sort of thoughtful discussions from experienced professionals which you can expect from LOPSA, which is why I'm encouraging everyone involved with system administration to join and support this important new organization.

USENIX has made available an audio recording (in MP3 format) of the Incident Command for IT: What We Can Learn from the Fire Department invited talk (Adobe Acrobat PDF format) that I did at the 2005 LISA conference a couple of weeks ago. You'll want to skip past approximately the first 3 minutes (2 minutes, 56 seconds, to be exact) of the recording, which are silence and administrivia announcements from before the start of the presentation; it would have been nice if USENIX had edited that out, but they didn't.

Besides my professional work in the networking field, I do a lot of volunteer emergency services work. For example, I'm one of only about 40 fully-qualified air search and rescue Incident Commanders in the California Wing of the Civil Air Patrol, and I help teach community disaster preparedness classes for the Mountain View Fire Department. So, I have a fair understanding of the tools (methods, structures, and principles) that such agencies use to organize themselves to deal with emergencies, and I've long pondered how some of those tools could be applied to emergencies in information technology.

I've been invited to give a 90 minute talk on the topic at the USENIX/SAGE LISA Conference in San Diego on Thursday, 8 December 2005, and I'll be giving a preview of the talk at the BayLISA meeting on Thursday, 20 October 2005:

Incident Command for IT: What We Can Learn from the Fire Department

Have you ever wondered how fire departments organize themselves on the fly to deal with a major incident? How they quickly and effectively coordinate the efforts of multiple agencies? How they evolve the organization as the incident changes in scope, scale, or focus? They accomplish all this by using the Incident Command System (ICS), a standardized organizational structure and set of operating principles adopted by most emergency agencies nationwide. In this talk, Brent will introduce the concepts and principles of ICS, and discuss how these can be applied to IT events, such as security incidents and service outages.

Please join me for one or both of these talks!

Slides from the BayLISA talk on Thursday, 20 October 2005:

I recently went looking for introductory material on incident and outage management principles, practices, and so forth, to help an ASP-like client of mine educate some of their management team who are pretty sharp in their own fields but who are new to Operations and IT management. I'm not talking about security incidents in particular, though those are certainly one type; I'm talking about more general types of incidents and outages that a service provider (particularly an ASP) might run into. Network outages, hardware failures, system overloads, cooling/power failures, software meltdowns, database debacles, etc.

I found a couple of interesting papers...

First, in the June 2005 issue of the USENIX magazine ;login:, there's an article called "When Disaster Strikes: Cailin and Roland Discuss Crisis Management", by Thomas Sluyter and Roland van Maarschalkerweerd (USENIX membership is required to download the article PDF from the web site). It's a good, high-level, 4-page intro to crisis management, and does a pretty decent job of providing an overview of a workable incident management process.

Second, I found a free white paper written by INS called "A Framework for Incident and Problem Management", by Victor Kapella, April 2003. This is longer (20 pages), more formal, more abstract, and more management-oriented, but very useful and very interesting. One thing to keep in mind in reading this paper is that they draw a distinction (as does ITIL, so I understand) between "incidents" (single occurrences) and "problems" (ongoing sets of individual issues which have a common cause). This paper does a particularly good job of talking about the ways in which Operations, Engineering, and other functions within an organization need to work together to resolve Incidents and Problems.

I recently evaluated a consulting client's IT infrastructure and operational capabilities using COBIT, which is an assessment standard originally developed in the UK and now used worldwide. I found COBIT to be very useful for this task at the management/process level, although it doesn't really get into the technical details. Here's how I used it, and how you might find it useful too...