On Mon, Apr 18, 2005 at 09:21:12PM -0400, albert@research.att.com wrote:
> Hi Juergen, Other than the castrophic cases you called attention to
> (and others we can think of where you lose the router or the ability
> to reset the router), could you elaborate on the cases you worry about
> where the simple policy of rollback to the last known good config does
> not converge to an acceptable state?
Networked devices such as switches and routers usually maintain quite some
state that is dynamically learned (e.g. link state information or spanning
tree information). Some people also call this operational state.
The most simplistic solution to rollback is to reinitialize the device
with the last saved good configuration. This approach often implies
to loose dynamically learned operational state and is thus not really
desirable, especially if you do such a reload on a network-wide scale.
Suppose you do something like adding/removing acls on your boxes
regularily and the reaction to such a failed transaction is that 80%
of the boxes involved need to be reloaded and that you loose all the
operational state (which usually takes time to build up again). So
this approach only works in environments where the # of failed
transactions per time is tolerable and the impact of reloading
boxes more or less simultaneously is acceptable.
The more advanced solution is to figure out what needs to be done to
the machine in order to return from the current state (which might be
the original transaction carried out half ways) into the original state.
My experience is that this is not trivial to get right. Suppose you
want to configure a new VLAN across your campus. This requires to
send sequences of SNMP sets or CLI commands to the boxes involved.
If your network-wide transaction fails, you have to instruct the
devices to revert that change by sending another sequence of SNMP
sets or CLI commands. On the box that causes the transaction to
abort, you may have to roll back a half complete execution of the
SNMP sets of CLI commands. This in particular impacts the complexity
of the rollback code that you have to write. If you do not get it
right, you may leak incomplete VLAN configurations and your
network-wide configuration system may hit by this leakage at
some later time, causing another transaction to fail. In some
sense, you would need a smart garbage collector to remove things
in a smart way that should not be there ideally without loosing
any operational state.
Putting the rollback mechanism on the box pushes the problem to
the box vendor. The good news, however, is that the box vendor knows
the internals of how the box works and thus he has the means to
support rollbacks more efficiently and in a least disruptive manner.
Pushing this to the vendors also solves the issue with versioning
since the rollback code your write will end up being not only box
specific but actually box and version specific. Sure, this approach
to push the problem to the vendors requires serious work on the
side of the vendor, especially if a rollback capability was not
designed into the system.
Bottom line: Someone has to pay a price to support robust network
wide configuration transactions. Vendors have an opportunity to
differentiate themself here. From the system design point of view,
I love to make the assumption that rollback support is on the
devices. For devices that do not have this capability, introduce
proxies that help to provide rollback capability, possibly even
by reloading. (TMN people would call these proxies element managers
and I am basically proposing to integrate the element manager into
the boxes themself.) The network-wide configuration manager should
be isolated from these rollback details as much as possible as
it should only worry about generating transactions, running
transactions and dealing with the reporting/handling of failed
logical configuration change transactions.
/js
--
Juergen Schoenwaelder International University Bremen
<http://www.eecs.iu-bremen.de/> P.O. Box 750 561, 28725 Bremen, Germany
Follow-Ups:
References:
|
|