Brent Chapman: March 2005 Archives

Joel Spolsky publishes a great blog on software management entitled Joel on Software. One of my favorite articles there is his piece entitled The Guerilla Guide to Interviewing, which is full of great advice on what to look for and how to look for it, particularly for technical staff.

His key points are that you're looking for someone with two characteristics (and he offers advice on how to evaluate those characteristics):

  • Smart
  • Gets Things Done

and that the key thing you need to do after an interview is make a decision. The key thing here is being unafraid to say "no hire". If you're not sure, it's "no hire"; if you think "not in my group, but maybe in yours", it's "no hire"; if you think "maybe, but I can't tell", it's "no hire". As Joel points out

An important thing to remember about interviewing is this: it is much better to reject a good candidate than to accept a bad candidate. A bad candidate will cost a lot of money and effort and waste other people's time fixing all their bugs. If you have any doubts whatsoever, No Hire.

Anyway, it's a great piece on interviewing, and I strongly recommend it.

On 7 Mar 05, Opsware (the company formerly known as Loudcloud, which Marc Andreessen founded after he left Netscape) announced that its Opsware Network Automation System 4.0 would be available beginning 21 Mar 05.

Interesting tidbits from the press release:

  • "The Opsware Network Automation System is based on Rendition Networks' award-winning TrueControl product, which Opsware acquired in February 2005."
  • "Opsware NAS 4.0 moves beyond simple network change and configuration management to offer complete network automation including change automation, compliance management, process automation, security administration and reporting around all these operational activities."
  • "Opsware NAS 4.0 includes key automation capabilities ... such as the ability to automate processes that span different IT groups and systems and an advanced Compliance Center for compliance management. ... The Compliance Center includes automated auditing and reporting for Sarbanes-Oxley, ITIL, HIPAA and COBIT."

OPENXTRA is a UK VAR that offers a variety of network management and server room monitoring tools and information. They publish a set of related newsletters, including one on Network Management, although there don't appear to have been any issues published in the last few months (since December 2004).

I haven't read all the articles on their web site yet, but the ones I've looked at so far all seem fairly introductory and high-level; for example, An Introduction to Network Configuration Management is a good very high-level overview of what network configuration management is and why it's useful, but it's fairly short on details. Regardless, I'm glad to see them making these articles available.

One of the concerns folks have about automated network management systems is that they'll become "automated network destruction systems" if things go wrong; in particular, it's a challenge to figure out what to do when the automation system discovers that the way something is currently configured doesn't match the way the system thinks it ought to be configured.

In a comment on another thread (Reluctance to trust automated network management tools), Kirby Files shares an interesting approach to fixing discrepencies found by automated systems (emphasis mine, and edited slightly to hilight Kirby's two key principles):

I agree that it's a bad thing(tm) to have automated tools "fixing" problems. In our home-grown configuration automation system, we take a different approach for service activation changes vs. auditing errors.

User-requested service activation add/modify/delete actions will identify the set of affected equipment from our service management database, dynamically create the configuration by combining templates with user- and datamodel-derived values, then deploy the changes on each piece of equipment, rolling back if one has an error.

By contrast, our nightly network auditing processes generate a list of reports of inconsistencies between the service management / network inventory database and network device configs. These reports do not in and of themselves cause changes to the network; an Ops user goes through them and decides whether to fix the database or update the network.

This follows from two personal principles of configuration managment

  • The database is always right
  • Don't fix what you don't understand

Under this process model, manual entry for service activation is avoided, but there's no automated "fixing" of unexpected configurations that might break the network.

NMS Software Lead
Masergy Communications

I think that these are very powerful principles, good advice, and a good way to approach real-world deployments of automated systems. Thanks, Kirby!

In a comment on another thread (Reluctance to trust automated network management tools), Landon Noll make some very astute observations about how management can inadvertently strengthen and perpetuate a culture of manual (as opposed to automated) network administration by rewarding "network heros" (emphasis mine):

Reluctance to trust automated network management tools can also be rooted in the way management encourages heroism.

I have seen clients where their network was maintained on a completely ad hoc / by hand basis. Audits revealed many mistakes and inconsistencies in their network setup. The network admins said "too busy" keeping their working running to automate. When a problem arose, the network admins performed heroic duty to bring the network back from disaster. Management was too grateful for service restoration to ask about the root cause. Management would praise the "skill and dedication" of their network staff instead of being critical of the way their network was managed.


... There is a strong desire on behalf of these so-called "network admin heroes" to have a direct personal control over the company's network assets. They feel they need this direct control so that that when they are called on, they can to perform a heroic rescue and reap their reward.

Network hero's fear that network automation will reduce their level of control. They fear that when an automated network breaks, they won't be able to fulfill the role of network hero. This ad hoc non-automated condition is likely to remain unless some external pressure (i.e., merger/acquisition, major security breach, regulatory compliance) forces things to change.

Excellent observation. I've seen this myself, and even unwittingly indulged in it myself, both as a "hero" (saving the day, and reaping the rewards) and as a manager (rewarding folks for being a hero rather than asking the hard questions about why the situation reached the point where heroics were necessary).

To counter this, obviously, management needs to ask those hard questions, and figure out a way to reward folks for preventing problems (by automation, for example) as well as "heroically" responding to them. We've got to ask questions like:

  • Why were heroic measures necessary in this circumstance?
  • What could we have done to prevent this situation, so that such heroics wouldn't have been necessary?
  • Are the folks who do good, solid work on preventing problems getting properly recognized for their work? Or are we inadvertently creating an incentive to let problems fester until heroic measures are required (and rewarded)?

Steve Lodin of Roche Diagnostics North America was kind enough to tell me about a newly published paper in the Feb 2005 issue of the International Journal of Information Security entitled "Rigorous Automated Network Security Management", by Joshua D. Guttman and Amy L. Herzog of The MITRE Corporation.

The paper's abstract:

Achieving a security goal in a networked system requires the cooperation of a variety of devices, each device potentially requiring a different configuration. Many information security problems may be solved with appropriate models of these devices and their interactions, and giving a systematic way to handle the complexity of real situations.

We present an approach, rigorous automated network security management, which front-loads formal modeling and analysis before problemsolving, thereby providing easy-to-run tools with rigorously justified results. With this approach, we model the network and a class of practically important security goals. The models derived suggest algorithms which, given system configuration information, determine the security goals satisfied by the system. The modeling provides rigorous justification for the algorithms, which may then be implemented as ordinary computer programs requiring no formal methods training to operate.

We have applied this approach to several problems. In this paper we describe two: distributed packet filtering and the use of IP security (IPsec) gateways. We also describe how to piece together the two separate solutions to these problems, jointly enforcing packet filtering as well as IPsec authentication and confidentiality on a single network.


| | Comments (1)

Steve Traugott at Infrastructures.ORG says:

Most IT organizations still install and maintain computers the same way the automotive industry built cars in the early 1900's: An individual craftsman manually manipulates a machine into being, and manually maintains it afterward. This is expensive. The automotive industry discovered first mass production, then mass customization using standard tooling.

Indeed... Most network devices are still configured by hand and manually maintained, with all of the attendant problems (typos, inconsistency of configuration, difficulty making common changes to many systems in parallel, etc.). I'm very interested in taking the same principles that Steve has been codifying and espousing for systems, and applying them to networks.

For the last several years, Steve has been driving this effort, including creating and hosting the Infrastructures mailing list. Their goal is to develop and discuss the

... standards and practices [that] are the standarized tooling needed for mass customization within IT. This tooling enables:
  • Scalable, flexible, and rapid deployments and changes
  • Cost effective, timely return on IT investment
  • Low labor headcount
  • Secure, trustworthy computing environments
  • Reliable enterprise infrastructures

Uplogix offers a product named the Envoy, which is a device to help automate management of network devices such as routers and switches. You attach Envoy units to the serial consoles of your network devices (each Envoy can manage up to 4 devices), and use in-band or out-of-band access to manage those devices through the Envoy.

From a discussion today with someone who wishes to remain anonymous (emphasis mine):

I think you'll find most of these [network management tools] are sort of a RANCID outgrowth - config monitoring systems + other functions which differ between all the vendors, although there is growth towards an approach of establishing a baseline and then creating and enforcing compliance rules/templates across the network. I think we're a bit cautious of using software written by someone else that writes to a device (all of the [network management tools we were discussing] do, but those functions aren't widely used), opting instead for tell me what's different and I'll change it myself. As more of these tools become well known and stable, and with more people using automated provisioning tools which do network device writes, that attitude will gradually ease off. But I believe many people are a bit scared of auto-enforcing features when it comes to routers/switches/etc., and maybe that explains a bit of what's lacking in comparison to sysadmin tools.

I agree with this assessment, but personally, I'm more worried about somebody fat-fingering a manual configuration. Another concern is that the configurations just getting too complex to maintain manually, particularly things like packet filtering ACLs, BGP policy statements, and so forth. In a lot of ways, it's like the old arguments about programming in assembly language versus higher-level languages.

Network World Fusion did a review of network configuration tools back in April 2004.

Their choice for the best product evaluated was Rendition's TrueControl.

Elsewhere on their web site, they also have a more up-to-date list (but not review) of configuration management products.

IETF has chartered a Network Configuration Working Group (NETCONF) to "produce a protocol for network configuration". (More details in the full blog entry; click the "Continue reading..." link below.)

Their focus seems to be on defining a protocol to supplant SNMP (a worthy goal, in my opinion; SNMP has proven largely unworkable for network configuration, although it has been useful for network monitoring), but they're intentionally punting on the underlying data model to use to describe how the network ought to be configured (which I think is at least as challenging a problem).

I posted a message to the NANOG mailing list earlier this morning, hoping to stimulate discussion:

Date: Fri, 4 Mar 2005 09:15:19 -0800
From: Brent Chapman <Brent@GreatCircle.COM>
Subject: Network automation?

What's the state of the art for automated network configuration and management? What systems and tools are available, either freely or commercially? Where are these issues being considered and discussed?

I'm not simply talking about network status monitoring systems like HP OpenView, or device configuration monitoring systems like RANCID, although those are certainly useful. Instead, I'm talking about systems that will start from a description of how a network ought to be configured, and then interact with the various devices on that network to make it so; something like cfengine for network devices.

Over the last 15 years or so, much of the research in the system administration field has focused on automation. It's now well accepted that a well-run operation doesn't manage 10,000 servers individually, but rather uses tools like cfengine to manage definitions of those servers and then create instances of those servers as needed. In the networking world, though, most of us seem to be still manually configuring (and reconfiguring) every device.