Wisdom: March 2005 Archives

One of the concerns folks have about automated network management systems is that they'll become "automated network destruction systems" if things go wrong; in particular, it's a challenge to figure out what to do when the automation system discovers that the way something is currently configured doesn't match the way the system thinks it ought to be configured.

In a comment on another thread (Reluctance to trust automated network management tools), Kirby Files shares an interesting approach to fixing discrepencies found by automated systems (emphasis mine, and edited slightly to hilight Kirby's two key principles):

I agree that it's a bad thing(tm) to have automated tools "fixing" problems. In our home-grown configuration automation system, we take a different approach for service activation changes vs. auditing errors.

User-requested service activation add/modify/delete actions will identify the set of affected equipment from our service management database, dynamically create the configuration by combining templates with user- and datamodel-derived values, then deploy the changes on each piece of equipment, rolling back if one has an error.

By contrast, our nightly network auditing processes generate a list of reports of inconsistencies between the service management / network inventory database and network device configs. These reports do not in and of themselves cause changes to the network; an Ops user goes through them and decides whether to fix the database or update the network.

This follows from two personal principles of configuration managment


  • The database is always right
  • Don't fix what you don't understand

Under this process model, manual entry for service activation is avoided, but there's no automated "fixing" of unexpected configurations that might break the network.

--kirby
NMS Software Lead
Masergy Communications

I think that these are very powerful principles, good advice, and a good way to approach real-world deployments of automated systems. Thanks, Kirby!

In a comment on another thread (Reluctance to trust automated network management tools), Landon Noll make some very astute observations about how management can inadvertently strengthen and perpetuate a culture of manual (as opposed to automated) network administration by rewarding "network heros" (emphasis mine):

Reluctance to trust automated network management tools can also be rooted in the way management encourages heroism.

I have seen clients where their network was maintained on a completely ad hoc / by hand basis. Audits revealed many mistakes and inconsistencies in their network setup. The network admins said "too busy" keeping their working running to automate. When a problem arose, the network admins performed heroic duty to bring the network back from disaster. Management was too grateful for service restoration to ask about the root cause. Management would praise the "skill and dedication" of their network staff instead of being critical of the way their network was managed.

...

... There is a strong desire on behalf of these so-called "network admin heroes" to have a direct personal control over the company's network assets. They feel they need this direct control so that that when they are called on, they can to perform a heroic rescue and reap their reward.

Network hero's fear that network automation will reduce their level of control. They fear that when an automated network breaks, they won't be able to fulfill the role of network hero. This ad hoc non-automated condition is likely to remain unless some external pressure (i.e., merger/acquisition, major security breach, regulatory compliance) forces things to change.

Excellent observation. I've seen this myself, and even unwittingly indulged in it myself, both as a "hero" (saving the day, and reaping the rewards) and as a manager (rewarding folks for being a hero rather than asking the hard questions about why the situation reached the point where heroics were necessary).

To counter this, obviously, management needs to ask those hard questions, and figure out a way to reward folks for preventing problems (by automation, for example) as well as "heroically" responding to them. We've got to ask questions like:

  • Why were heroic measures necessary in this circumstance?
  • What could we have done to prevent this situation, so that such heroics wouldn't have been necessary?
  • Are the folks who do good, solid work on preventing problems getting properly recognized for their work? Or are we inadvertently creating an incentive to let problems fester until heroic measures are required (and rewarded)?

March 2005: Monthly Archives

Pages

About this Archive Archives

This page is a archive of entries in the Wisdom category from March 2005.

Find recent content on the main index or look in the archives to find all content.

Mailing List

Creative Commons License
This weblog is licensed under a Creative Commons License.
Powered by Movable Type 4.12