The database is always right; don't fix what you don't understand

| | Comments (3)

One of the concerns folks have about automated network management systems is that they'll become "automated network destruction systems" if things go wrong; in particular, it's a challenge to figure out what to do when the automation system discovers that the way something is currently configured doesn't match the way the system thinks it ought to be configured.

In a comment on another thread (Reluctance to trust automated network management tools), Kirby Files shares an interesting approach to fixing discrepencies found by automated systems (emphasis mine, and edited slightly to hilight Kirby's two key principles):

I agree that it's a bad thing(tm) to have automated tools "fixing" problems. In our home-grown configuration automation system, we take a different approach for service activation changes vs. auditing errors.

User-requested service activation add/modify/delete actions will identify the set of affected equipment from our service management database, dynamically create the configuration by combining templates with user- and datamodel-derived values, then deploy the changes on each piece of equipment, rolling back if one has an error.

By contrast, our nightly network auditing processes generate a list of reports of inconsistencies between the service management / network inventory database and network device configs. These reports do not in and of themselves cause changes to the network; an Ops user goes through them and decides whether to fix the database or update the network.

This follows from two personal principles of configuration managment


  • The database is always right
  • Don't fix what you don't understand

Under this process model, manual entry for service activation is avoided, but there's no automated "fixing" of unexpected configurations that might break the network.

--kirby
NMS Software Lead
Masergy Communications

I think that these are very powerful principles, good advice, and a good way to approach real-world deployments of automated systems. Thanks, Kirby!

3 Comments

The places ripe for automation are the validation of changes before the change is applied, the application of changes to the network, and the process of taking a component in and out of service.

I will argue if your system for applying changes to the network is all-encompassing, then the process of auditing the configuration is not necessary. If configuration changes can ONLY be made through the database and the application of database changes is automated and guaranteed, then there is no reason the system can ever be out-of-sync with the database. The problem occurs when there multiple ways to modify the configuration of the network.

If the database supports versioning, then rollback is the equivalent of upgrade.

The most important automation, however, is the process for taking a component out of service. If taking a malfunctioning component out of service is complex or difficult, then the network is fundamentally not manageable.

If a component can be taken out of service, then with enough redundancy, you can afford to wait for a human to diagnose, fix, and place the component back in service.

Brad Porter wrote on Sun 20 Mar 2005:

I will argue if your system for applying changes to the network is all-encompassing, then the process of auditing the configuration is not necessary.

Yeah, have you ever tried to convince NetOps that they don't need logins, or "conf t" access to the network? I agree that the only official process for change should be via the automated systems, but practically, auditing is still necessary.

The most important automation, however, is the process for taking a component out of service. If taking a malfunctioning component out of service is complex or difficult, then the network is fundamentally not manageable.

I agree that this is a really important function. It's a difficult one to achieve in many network topologies though, particulaly on PE elements that maintain most of the customer config.

This is the area we are focusing most of our efforts on now, as very few networks can gracefully survive PE failures at the moment, and technologies like VRRP are just not meant to solve this in a provider network.

--kirby

Oh yes, I did go through the process of convincing NetOps that they didn't need logins or most of the management tools they were used to. Let's just say it was incredibly painful, but in the end exactly the right thing to do.

My argument is too much emphasis is placed on the auditing and not enough on eliminating the root cause. If you eliminate multiple change points, then the auditing really isn't necessary.

On the second point, a network topology that doesn't allow you to take a component out of service easily is a fundamentally flawed network topology.

Again, I still see way too much emphasis placed on special processes, steps, and work-around tools and not enough on eliminating the root cause.

Give me a network with centralized change management and the ability to take any component out of service easily at any time and I'll happily run that network. Otherwise, good luck to all those who thrive on the firefighting!

Pages

About this Entry Archives

This page contains a single entry by Brent Chapman published on March 11, 2005 1:49 PM.

Everybody wants to be a hero was the previous entry in this blog.

Introductory articles on network management from OPENXTRA is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Mailing List

Creative Commons License
This weblog is licensed under a Creative Commons License.
Powered by Movable Type 4.12