[LON-CAPA-dev] lonc/lond

Fri, 08 Mar 2002 09:29:21 -0500

This is a multi-part message in MIME format.
--------------192840D09D0E8527199A69AF
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Ron,

Thanks!!!

There is actually only one question: why did this ever work at all?

- Gerd.

Ron Fox wrote:

> Hi all,  I noticed there was not an open bugzilla entry for the lonc/lond
> issues I've been stalking so I created one and handed it to myself....
> I've committed a new version of lonc which fixes one of the many sources of
> instability in lonc, and simplifies some of the logic: Lonc is a network
> daemon with a parent process (called PARENT below) which manages a pool of
> child processes each connected to a specific lond representing a persistent
> connection to a remote lon-capa system (called either SPECIFIC CHILD or
> CHILDREN below).    In order to improve reliability, PARENT monitors the
> CHILDREN and attempts a limited number of restarts should a SPECIFIC CHILD
> die...It did this using a main loop which had (simplified) pseudo code like
> this: while forever   sleep until signal wakes me up   for all entries in
> peer to pid hash with no entry in the pid to peer hash      make_new_child
> for associated connection   end forendwhile make_new_child maintained two
> hashes.  A pid to peer hash and a peer to pid hash. a handler for the
> SIGCHILD signal (sent to a process when a child process exits) was
> established.  This handler executed pseudo code like: deadpid = wait    #
> wait returns pid of dead child, and frees it from zombie state.delete
> deadpid from pid to peer hash 1. This logic is overly complex for my taste
> and I've   a. removed the signal handler altogether.   b. replaced the
> parent main loop with logic shown in the pseudo code below: while forever
> deadpid = wait     # Note other signals can interrupt this wait hence:   if
> there's an entry for deadpid in the pid to peer hash       make_new_child
> for the peer associated with deadpid   endifendwhile 2. Discovered and fixed
> a defect in make_new_child which essentially threw away all entries in the
> peer to pid hash (wrong variable used as the hash key at insertion
> time). The defect in make_new_child was causing the following problem:  If a
> connection was lost or could not be formed, and the associated child
> exited,lonc would not only attempt to restart the child (ok) but believed
> that >all< children had died, and start new children for all connections.
> The duplicate children would soon discover that they could not create a
> server socket (the existing process already owned it), and exit starting the
> dance all over again, until the retry count was exhausted and lonc would
> give up on recreating all connections.  Subsequent to that, any child exit
> was total and unrecoverable (there's no current logic for resetting the
> retry count back to zero). There are more problems which have probably not
> been seen or are seen at a lower probability than this.  I was lucky enough
> to be debugging this on a system who's hosts.tab was 'too big' and therefore
> immediately and reproducably triggered this collapse. Stay tuned for more
> news about lonc/lond as it develops. Cheers,RF.

--------------192840D09D0E8527199A69AF
Content-Type: text/x-vcard; charset=us-ascii;
 name="korte.vcf"
Content-Transfer-Encoding: 7bit
Content-Description: Card for Gerd Kortemeyer
Content-Disposition: attachment;
 filename="korte.vcf"

begin:vcard 
n:Kortemeyer;Gerd
tel;fax:(517) 432-2175
tel;work:(517) 432-5468
x-mozilla-html:FALSE
url:http://www.lite.msu.edu/kortemeyer/
org:LITE Lab;DSME MSU
version:2.1
email;internet:korte@lite.msu.edu
title:Instructional Technology Specialist
adr;quoted-printable:;;123 North Kedzie Labs=0D=0AMichigan State University;East Lansing;MI;48824;USA
fn:Gerd Kortemeyer
end:vcard

--------------192840D09D0E8527199A69AF--