Great Circle Associates Majordomo-Workers
(September 1996)
 

Indexed By Date: [Previous] [Next] Indexed By Thread: [Previous] [Next]

Subject: Patch for 1.94 to avoid starvation when looping for lock
From: John Gilmore <gnu @ toad . com>
Date: Wed, 25 Sep 1996 18:35:39 -0700
To: majordomo-workers @ greatcircle . com, gnu @ toad . com

I have been maintaining the cypherpunks@toad.com mailing list for some
months, using a majordomo originally set up by Hugh Daniel.  I
encountered and solved a problem in shlock.pl last month, which never
got forwarded to you-all.  (I also just looked at the 1.94b4 release
and it does not have this fix.  1.94b4 sleeps a random time up to ten
seconds each time, which will simply not help when you have 80
swapped-out majordomo processes fighting for a lock.  It takes each of
them more than ten seconds to swap in and look -- and only ONE of them
actually has the lock, so the other 79 are just getting in that one's
way.  They should all back off, rather than delaying the same amount
after each successive failure to grab the lock.)

Exponential backoff is the preferred approach -- e.g. back off 1
second the first time, 2 the second, 4 the third, 8 the fourth, etc.
I didn't implement that, largely because I didn't know how to
exponentiate in Perl!  You could simply multiply the sleep time by 2
each time around the loop to get the same effect, but being a perl neo
I wanted the *minimal* change, with my system balanced on the head of a
pin and the load average at 100.

Here's my message from the day of the problem, including the patch.

	John Gilmore

Date: Tue, 06 Aug 1996 15:11:43 -0700
From: John Gilmore <gnu@toad.com>

The shlock.pl code was looping at one-second intervals to see if it
could get an exclusive lock on the cypherpunks list. 

I updated this to 100 seconds plus the number of tries (101, 102, ...).
It's not as good as exponential backoff, but it probably keeps the
load average from hitting 100, which is where it was today on toad
while we were being mail-bombed with bogus subscriptions.  Most of
the load was swapped-out majordomo's which were trying to swap in to
see if they had the lock file yet.  Of course they didn't yet, because
only one of them had the lock, and there were eighty of them fighting
for the RAM and CPU time to try for it.

Completely crazy.  It should be using file-locking!

Corwin or Hugh, do you know who maintains majordomo?  Can you forward
them back the change?  My changes are based on 1.93.  The only line
changed, besides comments, is the sleep() call.

	John

diff -ruw majordomo-1.93/shlock.pl /u/majordom/majordomo-1.93/shlock.pl
--- majordomo-1.93/shlock.pl	Sat Dec 31 15:00:29 1994
+++ /u/majordom/majordomo-1.93/shlock.pl	Tue Aug  6 12:20:57 1996
@@ -167,7 +167,7 @@
     $FH =~ s/^[^']+$/$package'$&/;
 
     for ($tries = 0 ; $tries < 600 ; $tries++) {
-	# Try to obtain the lock 600 times, waiting 1 second after each try
+	# Try to obtain the lock 600 times, with linear backoff after each try
 	if (&main'shlock("$lockfile")) {
 	    # Got the lock; now try to open the file
 	    $status = open($FH, $fm);
@@ -181,8 +181,8 @@
 	    # return the success or failure of the open
 	    return($status);
 	} else {
-	    # didn't get the lock; wait 1 second and try again.
-	    sleep(1);
+	    # didn't get the lock; wait with linear backoff, and try again.
+	    sleep(100+$tries);
 	}
     }
     # If we get this far, we ran out of tries on the lock.


Follow-Ups:
Indexed By Date Previous: Re: Majordomo Bombing
From: Jason L Tibbitts III <tibbs@hpc.uh.edu>
Next: Strange alias problem?
From: Brian Abernathy <cba@wscoe1.atl.hp.com>
Indexed By Thread Previous: Majordomo 1.94 Beta 4. tiny knob tweaking.
From: Chan Wilson <cwilson@slurp.neu.sgi.com>
Next: Re: Patch for 1.94 to avoid starvation when looping for lock
From: Dave Wolfe <dwolfe@risc.sps.mot.com>

Google
 
Search Internet Search www.greatcircle.com