[LON-CAPA-admin] server goes unresponsive?

Tue Oct 23 12:59:07 EDT 2012

Todd,

These log items in lonnet.log are quite normal:

> Sun Oct  7 12:00:23 2012 (21138): Starting Shut down
> Sun Oct  7 12:00:23 2012 (21138): %badServerCache      is 7
> Sun Oct  7 12:00:23 2012 (21138): %homecache           is 13510
> Sun Oct  7 12:00:23 2012 (21138): %remembered          is 7
> Sun Oct  7 12:00:23 2012 (21138): kicks                is 0
> Sun Oct  7 12:00:23 2012 (21138): hits                 is 451259
> Sun Oct  7 12:00:23 2012 (21138): Flushing log buffers
> Sun Oct  7 12:00:23 2012 (21138): Shutting down

They are the result of this entry:
PerlChildExitHandler Apache::lonacc::goodbye

in /etc/httpd/conf/loncapa_apache.conf

The PerlChildExitHandler exits whenever an Apache child exits.
An Apache child will exit in the following circumstances:

(a) Apache is stopped
(b) Apache is restarted
(c) The maximum number of requests per Apache child has been reached  
-- the default in /etc/httpd/conf/httpd.conf is: MaxRequestsPerChild   
1000

The numbers logged provide information about cache usage.

%badServerCache is a lonnet.pm global which stores cases where a  
server for a particular domain has responded with no_host. The keys  
are the server lonHostIDs.

%homecache is a lonnet.pm global which stores the number of users  
(username:domain) for whom the home server has been determined (and is  
cached; i.e., found in memcache).

%remembered is a lonnet.pm global which stores items remembered after  
completion of an Apache request.

In /etc/httpd/conf/loncapa-apache.conf you will find this:
PerlCleanupHandler      Apache::lonacc::cleanup

lonacc::cleanup() calls lonnet::save_cache() which in turn calls  
lonnet::purge_remembered() to clean up the %remembered hash after  
processing of the request.

hits is the number of times items in %remembered which were used  
during processing.

kicks is the number of items removed from the %remembered hash during  
processing of the request (to make room for others). It will be zero,  
since the code to kick out items is not currently used.

> I strikes me as very much not good that my library server has been marked
> "DEAD".  Any ideas on what can cause a host to be so marked?

According to the documentation in loncnew:

If a socket timeout is detected the connection retries left is  
decremented. Once the number of retries left is zero, the host is  
marked as DEAD and no further attempts will be made by that child.

Is the situation and the logging you describe from your access server  
(i.e., the access server is unable to connect to your library server),  
or is this the library server trying to talk to itself via lonc/lond,  
and failing? I assume the former, but just checking.

Stuart Raeburn
LON-CAPA Academic Consortium

Quoting Todd Ruskell <todd.ruskell at gmail.com>:

> Hi,
>
> We've been experiencing a situation in which our library server seems to
> suddenly go unresponsive, without much warning.  In looking at logs, I see
> a couple things.
>
> First, in lonnet.log I see some entries like the following distributed
> throughout the log file.  I don't know how to interpret these entries, but
> they seem to indicate something isn't quite right:
>
> Sun Oct  7 12:00:23 2012 (21138): Starting Shut down
> Sun Oct  7 12:00:23 2012 (21138): %badServerCache      is 7
> Sun Oct  7 12:00:23 2012 (21138): %homecache           is 13510
> Sun Oct  7 12:00:23 2012 (21138): %remembered          is 7
> Sun Oct  7 12:00:23 2012 (21138): kicks                is 0
> Sun Oct  7 12:00:23 2012 (21138): hits                 is 451259
> Sun Oct  7 12:00:23 2012 (21138): Flushing log buffers
> Sun Oct  7 12:00:23 2012 (21138): Shutting down
>
> When the system seems to go unresponsive, lonnet.log has the following
> entries:
>
> Sun Oct  7 12:07:17 2012 (21568): <font color="blue">WARNING: Trying to get
> resource data for smarkoe at csm: con_lost</font>
> Sun Oct  7 12:07:38 2012 (21871): <font color="blue">WARNING: Trying to get
> resource data for gajohnso at csm: con_lost</font>
>
> ...above entry repeated several times and then several messages like ...
>
> Sun Oct  7 12:07:40 2012 (21871): Could not devalidate spreadsheet esease
> at csm
>  for
> uploaded/csm/6925421bc619b4f6bcsml1/default_1316619888.sequence___3___csm/c
> smphyslib/P200_Materials/StudioActivities/Block2-Circuits/EnergyStoredInCapacito
> r/readingQuestions.problem: no_such_host con_lost
> Sun Oct  7 12:07:41 2012 (21498): Could not devalidate spreadsheet jsingh
> at csm
>  for
> uploaded/csm/6925421bc619b4f6bcsml1/default_1317053874.sequence___15___csm/csmphyslib/P200_Materials/TestBank/Current_Resistance/RCPowerUpdated.problem:
> error: 100 tie(GDBM) Failed while attempting del con_lost
\
>
>
> I strikes me as very much not good that my library server has been marked
> "DEAD".  Any ideas on what can cause a host to be so marked?  It doesn't
> appear to be load based, as it seems to have happened at a variety of load
> levels, including pretty low.  Any help you can provide would be greatly
> appreciated.
>
> Thanks,
> Todd