[LON-CAPA-dev] lonc/lond

Ron Fox lon-capa-dev@mail.lon-capa.org
Thu, 7 Mar 2002 23:23:34 -0500


This is a multi-part message in MIME format.

------=_NextPart_000_0021_01C1C62F.19D449B0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Hi all,
  I noticed there was not an open bugzilla entry for the lonc/lond =
issues I've been stalking so I created one and handed it to myself....

  I've committed a new version of lonc which fixes one of the many =
sources of instability in lonc, and simplifies some of the logic:

Lonc is a network daemon with a parent process (called PARENT below) =
which manages a pool of child processes each connected to a specific =
lond representing a persistent connection to a remote lon-capa system =
(called either SPECIFIC CHILD or CHILDREN below).
    In order to improve reliability, PARENT monitors the CHILDREN and =
attempts a limited number of restarts should a SPECIFIC CHILD die...It =
did this using a main loop which had (simplified) pseudo code like this:

while forever
   sleep until signal wakes me up
   for all entries in peer to pid hash with no entry in the pid to peer =
hash
      make_new_child for associated connection
   end for
endwhile

make_new_child maintained two hashes.  A pid to peer hash and a peer to =
pid hash.

a handler for the SIGCHILD signal (sent to a process when a child =
process exits) was established.  This handler executed pseudo code like:

deadpid =3D wait    # wait returns pid of dead child, and frees it from =
zombie state.
delete deadpid from pid to peer hash

1. This logic is overly complex for my taste and I've
   a. removed the signal handler altogether.
   b. replaced the parent main loop with logic shown in the pseudo code =
below:
  =20
while forever
   deadpid =3D wait     # Note other signals can interrupt this wait =
hence:
   if there's an entry for deadpid in the pid to peer hash
       make_new_child for the peer associated with deadpid
   endif
endwhile

2. Discovered and fixed a defect in make_new_child which essentially =
threw away all entries in the peer to pid hash (wrong variable used as =
the hash key at insertion time).

The defect  in make_new_child was causing the following problem:
  If a connection was lost or could not be formed, and the associated =
child exited,
lonc would not only attempt to restart the child (ok) but believed that =
>all< children had died, and start new children for all connections.  =
The duplicate children would soon discover that they could not create a =
server socket (the existing process already owned it), and exit starting =
the dance all over again, until the retry count was exhausted and lonc =
would give up on recreating all connections.  Subsequent to that, any =
child exit was total and unrecoverable (there's no current logic for =
resetting the retry count back to zero).

There are more problems which have probably not been seen or are seen at =
a lower probability than this.
  I was lucky enough to be debugging this on a system who's hosts.tab =
was 'too big' and therefore immediately and reproducably triggered this =
collapse.

Stay tuned for more news about lonc/lond as it develops.

Cheers,
RF.

------=_NextPart_000_0021_01C1C62F.19D449B0
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META content=3D"text/html; charset=3Diso-8859-1" =
http-equiv=3DContent-Type>
<META content=3D"MSHTML 5.00.3103.1000" name=3DGENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=3D#ffffff>
<DIV><FONT face=3DArial size=3D2>Hi all,</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>&nbsp;&nbsp;I noticed there was not an =
open=20
bugzilla entry for the lonc/lond issues I've been stalking so I created =
one and=20
handed it to myself....</FONT></DIV>
<DIV>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>&nbsp; I've committed a new version of =
lonc which=20
fixes one of the many sources of instability in lonc, and simplifies =
some of the=20
logic:</FONT></DIV>
<DIV>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>Lonc is a network daemon with a parent =
process=20
(called PARENT below) which manages a pool of child processes each =
connected to=20
a specific lond representing a persistent connection to a remote =
lon-capa system=20
(called either SPECIFIC CHILD or CHILDREN&nbsp;below).</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>&nbsp;&nbsp;&nbsp; In order to improve =
reliability,=20
PARENT monitors the CHILDREN and attempts a limited number of restarts =
should a=20
SPECIFIC CHILD die...It did this using&nbsp;a main loop which had =
(simplified)=20
pseudo code like this:</FONT></DIV>
<DIV>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>while forever</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>&nbsp;&nbsp; sleep until signal wakes =
me=20
up</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>&nbsp;&nbsp; for all entries in peer to =
pid hash=20
with no entry in the pid to peer hash</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
make_new_child for=20
associated connection</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>&nbsp;&nbsp; end for</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>endwhile</FONT></DIV>
<DIV>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>
<DIV><FONT face=3DArial size=3D2>make_new_child maintained two =
hashes.&nbsp; A pid=20
to peer hash and a peer to pid hash.</FONT></DIV></FONT></DIV>
<DIV>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>a handler for the SIGCHILD signal (sent =
to a=20
process when a child process exits) was established.&nbsp; This handler =
executed=20
pseudo code like:</FONT></DIV>
<DIV>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>deadpid =3D wait&nbsp;&nbsp;&nbsp; # =
wait returns pid=20
of dead child, and frees it from zombie state.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>delete deadpid from pid to peer =
hash</FONT></DIV>
<DIV>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>1. This logic is overly complex for my =
taste and=20
I've</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>&nbsp;&nbsp; a. removed the signal =
handler=20
altogether.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>&nbsp;&nbsp; b. replaced the parent =
main loop with=20
logic shown in the pseudo code below:</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>&nbsp;&nbsp;=20
<DIV><FONT face=3DArial size=3D2>while forever</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>&nbsp;&nbsp;&nbsp;deadpid =3D=20
wait&nbsp;&nbsp;&nbsp;&nbsp; # Note other signals can interrupt this =
wait=20
hence:</FONT></DIV>
<DIV>&nbsp;&nbsp; if there's an entry for deadpid in the pid to peer =
hash</DIV>
<DIV>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; make_new_child for the peer =
associated=20
with deadpid</DIV>
<DIV>&nbsp;&nbsp; endif</DIV>
<DIV><FONT face=3DArial size=3D2>endwhile</FONT></DIV></FONT></DIV>
<DIV>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>2. Discovered and fixed a defect in =
make_new_child=20
which essentially threw away all entries in the peer to pid hash (wrong =
variable=20
used as the hash key at insertion time).</FONT></DIV>
<DIV>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>The&nbsp;defect  in make_new_child was =
causing the=20
following problem:</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>&nbsp; If a connection was lost or =
could not be=20
formed, and the associated child exited,</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>lonc would not only attempt to restart =
the child=20
(ok) but believed that &gt;all&lt; children had died, and start new =
children for=20
all connections.&nbsp; The duplicate children&nbsp;would soon discover =
that they=20
could not create a server socket (the existing process already owned =
it), and=20
exit starting the dance all over again, until the retry count was =
exhausted and=20
lonc would give up on recreating all connections.&nbsp; Subsequent to =
that, any=20
child exit was total and unrecoverable (there's no current logic for =
resetting=20
the retry count back to zero).</FONT></DIV>
<DIV>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>There are more problems which have =
probably not=20
been seen or are seen at a lower probability than this.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>&nbsp; I was lucky enough to be =
debugging this on a=20
system who's hosts.tab was 'too big' and therefore immediately and =
reproducably=20
triggered this collapse.</FONT></DIV>
<DIV>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>Stay tuned for more news about =
lonc/lond as it=20
develops.</FONT></DIV>
<DIV>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>Cheers,</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>RF.</FONT></DIV></BODY></HTML>

------=_NextPart_000_0021_01C1C62F.19D449B0--