> Networked devices such as switches and routers usually maintain quite
some
> state that is dynamically learned (e.g. link state information or
spanning
> tree information). Some people also call this operational state.
also with link state, spanning tree and others, its possible that a
configuration change modifies not only the local state information, but
also for example, which switch is the stp root, which is not necessarily
changed back to the original when you roll back the offending switch's
configuration. After a rollback, the state of your network has changed
even though the configuration files are identical to the original.
This is one issue I have with configuration change management tools which
only deal with the configuration file - not all relevant data is captured.
On Tue, 19 Apr 2005, Juergen Schoenwaelder wrote:
> On Mon, Apr 18, 2005 at 09:21:12PM -0400, albert@research.att.com wrote:
>
> > Hi Juergen, Other than the castrophic cases you called attention to
> > (and others we can think of where you lose the router or the ability
> > to reset the router), could you elaborate on the cases you worry about
> > where the simple policy of rollback to the last known good config does
> > not converge to an acceptable state?
>
> Networked devices such as switches and routers usually maintain quite some
> state that is dynamically learned (e.g. link state information or spanning
> tree information). Some people also call this operational state.
>
> The most simplistic solution to rollback is to reinitialize the device
> with the last saved good configuration. This approach often implies
> to loose dynamically learned operational state and is thus not really
> desirable, especially if you do such a reload on a network-wide scale.
> Suppose you do something like adding/removing acls on your boxes
> regularily and the reaction to such a failed transaction is that 80%
> of the boxes involved need to be reloaded and that you loose all the
> operational state (which usually takes time to build up again). So
> this approach only works in environments where the # of failed
> transactions per time is tolerable and the impact of reloading
> boxes more or less simultaneously is acceptable.
>
> The more advanced solution is to figure out what needs to be done to
> the machine in order to return from the current state (which might be
> the original transaction carried out half ways) into the original state.
> My experience is that this is not trivial to get right. Suppose you
> want to configure a new VLAN across your campus. This requires to
> send sequences of SNMP sets or CLI commands to the boxes involved.
> If your network-wide transaction fails, you have to instruct the
> devices to revert that change by sending another sequence of SNMP
> sets or CLI commands. On the box that causes the transaction to
> abort, you may have to roll back a half complete execution of the
> SNMP sets of CLI commands. This in particular impacts the complexity
> of the rollback code that you have to write. If you do not get it
> right, you may leak incomplete VLAN configurations and your
> network-wide configuration system may hit by this leakage at
> some later time, causing another transaction to fail. In some
> sense, you would need a smart garbage collector to remove things
> in a smart way that should not be there ideally without loosing
> any operational state.
>
> Putting the rollback mechanism on the box pushes the problem to
> the box vendor. The good news, however, is that the box vendor knows
> the internals of how the box works and thus he has the means to
> support rollbacks more efficiently and in a least disruptive manner.
> Pushing this to the vendors also solves the issue with versioning
> since the rollback code your write will end up being not only box
> specific but actually box and version specific. Sure, this approach
> to push the problem to the vendors requires serious work on the
> side of the vendor, especially if a rollback capability was not
> designed into the system.
>
> Bottom line: Someone has to pay a price to support robust network
> wide configuration transactions. Vendors have an opportunity to
> differentiate themself here. From the system design point of view,
> I love to make the assumption that rollback support is on the
> devices. For devices that do not have this capability, introduce
> proxies that help to provide rollback capability, possibly even
> by reloading. (TMN people would call these proxies element managers
> and I am basically proposing to integrate the element manager into
> the boxes themself.) The network-wide configuration manager should
> be isolated from these rollback details as much as possible as
> it should only worry about generating transactions, running
> transactions and dealing with the reporting/handling of failed
> logical configuration change transactions.
>
> /js
>
> --
> Juergen Schoenwaelder International University Bremen
> <http://www.eecs.iu-bremen.de/> P.O. Box 750 561, 28725 Bremen, Germany
>
References:
|
|