Hi Stuart,<br><br><br><div class="gmail_quote">On Tue, Oct 23, 2012 at 10:59 AM, Stuart Raeburn <span dir="ltr"><<a href="mailto:raeburn@msu.edu" target="_blank">raeburn@msu.edu</a>></span> wrote:<br><br><snip><br>
<br>Thanks for the information about those variables and deciphering the log<br><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I strikes me as very much not good that my library server has been marked<br>
"DEAD". Any ideas on what can cause a host to be so marked?<br>
</blockquote>
<br></div><br>
According to the documentation in loncnew:<br>
<br>
If a socket timeout is detected the connection retries left is decremented. Once the number of retries left is zero, the host is marked as DEAD and no further attempts will be made by that child.<br>
<br>
Is the situation and the logging you describe from your access server (i.e., the access server is unable to connect to your library server), or is this the library server trying to talk to itself via lonc/lond, and failing? I assume the former, but just checking.<br>
</blockquote><div><br>That's the unfortunate thing. Everything reported here is on the library server. This is the library server trying to talk to itself. <br><br>Does that mean this effect is caused by some overloading of the server? This is possible, but as best as I can tell, it's happened at different mid to high server load levels, some of which we've experienced before without this behavior.<br>
<br>As you probably suspect, when this happens the usual login page is replaced by the "LON-CAPA is temporarily unavailable" page. And a simple restart of loncontrol does get things up and running again. I suppose I could set up a cron job to restart loncontrol every 5 minutes. But I'm also open to other suggestions.<br>
</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Stuart Raeburn<br>
LON-CAPA Academic Consortium<div><div class="h5"><br>
<br>
<br>
Quoting Todd Ruskell <<a href="mailto:todd.ruskell@gmail.com" target="_blank">todd.ruskell@gmail.com</a>>:<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hi,<br>
<br>
We've been experiencing a situation in which our library server seems to<br>
suddenly go unresponsive, without much warning. In looking at logs, I see<br>
a couple things.<br>
<br>
First, in lonnet.log I see some entries like the following distributed<br>
throughout the log file. I don't know how to interpret these entries, but<br>
they seem to indicate something isn't quite right:<br>
<br>
Sun Oct 7 12:00:23 2012 (21138): Starting Shut down<br>
Sun Oct 7 12:00:23 2012 (21138): %badServerCache is 7<br>
Sun Oct 7 12:00:23 2012 (21138): %homecache is 13510<br>
Sun Oct 7 12:00:23 2012 (21138): %remembered is 7<br>
Sun Oct 7 12:00:23 2012 (21138): kicks is 0<br>
Sun Oct 7 12:00:23 2012 (21138): hits is 451259<br>
Sun Oct 7 12:00:23 2012 (21138): Flushing log buffers<br>
Sun Oct 7 12:00:23 2012 (21138): Shutting down<br>
<br>
When the system seems to go unresponsive, lonnet.log has the following<br>
entries:<br>
<br>
Sun Oct 7 12:07:17 2012 (21568): <font color="blue">WARNING: Trying to get<br>
resource data for smarkoe at csm: con_lost</font><br>
Sun Oct 7 12:07:38 2012 (21871): <font color="blue">WARNING: Trying to get<br>
resource data for gajohnso at csm: con_lost</font><br>
<br>
...above entry repeated several times and then several messages like ...<br>
<br>
Sun Oct 7 12:07:40 2012 (21871): Could not devalidate spreadsheet esease<br>
at csm<br>
for<br>
uploaded/csm/<u></u>6925421bc619b4f6bcsml1/<u></u>default_1316619888.sequence___<u></u>3___csm/c<br>
smphyslib/P200_Materials/<u></u>StudioActivities/Block2-<u></u>Circuits/<u></u>EnergyStoredInCapacito<br>
r/readingQuestions.problem: no_such_host con_lost<br>
Sun Oct 7 12:07:41 2012 (21498): Could not devalidate spreadsheet jsingh<br>
at csm<br>
for<br>
uploaded/csm/<u></u>6925421bc619b4f6bcsml1/<u></u>default_1317053874.sequence___<u></u>15___csm/csmphyslib/P200_<u></u>Materials/TestBank/Current_<u></u>Resistance/RCPowerUpdated.<u></u>problem:<br>
error: 100 tie(GDBM) Failed while attempting del con_lost<br>
</blockquote></div></div>
\<div class="im"><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
<br>
I strikes me as very much not good that my library server has been marked<br>
"DEAD". Any ideas on what can cause a host to be so marked? It doesn't<br>
appear to be load based, as it seems to have happened at a variety of load<br>
levels, including pretty low. Any help you can provide would be greatly<br>
appreciated.<br>
<br>
Thanks,<br>
Todd<br>
</blockquote>
<br></div>
______________________________<u></u>_________________<br>
LON-CAPA-admin mailing list<br>
<a href="mailto:LON-CAPA-admin@mail.lon-capa.org" target="_blank">LON-CAPA-admin@mail.lon-capa.<u></u>org</a><br>
<a href="http://mail.lon-capa.org/mailman/listinfo/lon-capa-admin" target="_blank">http://mail.lon-capa.org/<u></u>mailman/listinfo/lon-capa-<u></u>admin</a><br>
</blockquote></div><br>