[LON-CAPA-admin] server goes unresponsive?
Stuart Raeburn
raeburn at msu.edu
Tue Oct 23 12:59:07 EDT 2012
Todd,
These log items in lonnet.log are quite normal:
> Sun Oct 7 12:00:23 2012 (21138): Starting Shut down
> Sun Oct 7 12:00:23 2012 (21138): %badServerCache is 7
> Sun Oct 7 12:00:23 2012 (21138): %homecache is 13510
> Sun Oct 7 12:00:23 2012 (21138): %remembered is 7
> Sun Oct 7 12:00:23 2012 (21138): kicks is 0
> Sun Oct 7 12:00:23 2012 (21138): hits is 451259
> Sun Oct 7 12:00:23 2012 (21138): Flushing log buffers
> Sun Oct 7 12:00:23 2012 (21138): Shutting down
They are the result of this entry:
PerlChildExitHandler Apache::lonacc::goodbye
in /etc/httpd/conf/loncapa_apache.conf
The PerlChildExitHandler exits whenever an Apache child exits.
An Apache child will exit in the following circumstances:
(a) Apache is stopped
(b) Apache is restarted
(c) The maximum number of requests per Apache child has been reached
-- the default in /etc/httpd/conf/httpd.conf is: MaxRequestsPerChild
1000
The numbers logged provide information about cache usage.
%badServerCache is a lonnet.pm global which stores cases where a
server for a particular domain has responded with no_host. The keys
are the server lonHostIDs.
%homecache is a lonnet.pm global which stores the number of users
(username:domain) for whom the home server has been determined (and is
cached; i.e., found in memcache).
%remembered is a lonnet.pm global which stores items remembered after
completion of an Apache request.
In /etc/httpd/conf/loncapa-apache.conf you will find this:
PerlCleanupHandler Apache::lonacc::cleanup
lonacc::cleanup() calls lonnet::save_cache() which in turn calls
lonnet::purge_remembered() to clean up the %remembered hash after
processing of the request.
hits is the number of times items in %remembered which were used
during processing.
kicks is the number of items removed from the %remembered hash during
processing of the request (to make room for others). It will be zero,
since the code to kick out items is not currently used.
> I strikes me as very much not good that my library server has been marked
> "DEAD". Any ideas on what can cause a host to be so marked?
According to the documentation in loncnew:
If a socket timeout is detected the connection retries left is
decremented. Once the number of retries left is zero, the host is
marked as DEAD and no further attempts will be made by that child.
Is the situation and the logging you describe from your access server
(i.e., the access server is unable to connect to your library server),
or is this the library server trying to talk to itself via lonc/lond,
and failing? I assume the former, but just checking.
Stuart Raeburn
LON-CAPA Academic Consortium
Quoting Todd Ruskell <todd.ruskell at gmail.com>:
> Hi,
>
> We've been experiencing a situation in which our library server seems to
> suddenly go unresponsive, without much warning. In looking at logs, I see
> a couple things.
>
> First, in lonnet.log I see some entries like the following distributed
> throughout the log file. I don't know how to interpret these entries, but
> they seem to indicate something isn't quite right:
>
> Sun Oct 7 12:00:23 2012 (21138): Starting Shut down
> Sun Oct 7 12:00:23 2012 (21138): %badServerCache is 7
> Sun Oct 7 12:00:23 2012 (21138): %homecache is 13510
> Sun Oct 7 12:00:23 2012 (21138): %remembered is 7
> Sun Oct 7 12:00:23 2012 (21138): kicks is 0
> Sun Oct 7 12:00:23 2012 (21138): hits is 451259
> Sun Oct 7 12:00:23 2012 (21138): Flushing log buffers
> Sun Oct 7 12:00:23 2012 (21138): Shutting down
>
> When the system seems to go unresponsive, lonnet.log has the following
> entries:
>
> Sun Oct 7 12:07:17 2012 (21568): <font color="blue">WARNING: Trying to get
> resource data for smarkoe at csm: con_lost</font>
> Sun Oct 7 12:07:38 2012 (21871): <font color="blue">WARNING: Trying to get
> resource data for gajohnso at csm: con_lost</font>
>
> ...above entry repeated several times and then several messages like ...
>
> Sun Oct 7 12:07:40 2012 (21871): Could not devalidate spreadsheet esease
> at csm
> for
> uploaded/csm/6925421bc619b4f6bcsml1/default_1316619888.sequence___3___csm/c
> smphyslib/P200_Materials/StudioActivities/Block2-Circuits/EnergyStoredInCapacito
> r/readingQuestions.problem: no_such_host con_lost
> Sun Oct 7 12:07:41 2012 (21498): Could not devalidate spreadsheet jsingh
> at csm
> for
> uploaded/csm/6925421bc619b4f6bcsml1/default_1317053874.sequence___15___csm/csmphyslib/P200_Materials/TestBank/Current_Resistance/RCPowerUpdated.problem:
> error: 100 tie(GDBM) Failed while attempting del con_lost
\
>
>
> I strikes me as very much not good that my library server has been marked
> "DEAD". Any ideas on what can cause a host to be so marked? It doesn't
> appear to be load based, as it seems to have happened at a variety of load
> levels, including pretty low. Any help you can provide would be greatly
> appreciated.
>
> Thanks,
> Todd
More information about the LON-CAPA-admin
mailing list