[LON-CAPA-dev] Crash Investigation

Gabriel Friedmann lon-capa-dev@mail.lon-capa.org
Mon, 16 Feb 2004 16:48:48 -0500 (EST)


Hello developers.

How can i investigate the cause of the following log entry? 
(attempting to trace the cause of a crash last night)

lonnet.log

--snip--

Sun Feb 15 14:40:02 2004 (31528): Starting Shut down
Sun Feb 15 14:40:02 2004 (31528): %badServerCache      is 0
Sun Feb 15 14:40:02 2004 (31528): %metacache           is 439/1024
Sun Feb 15 14:40:02 2004 (31528): %homecache           is 68/128
Sun Feb 15 14:40:02 2004 (31528): %titlecache          is 412/1024
Sun Feb 15 14:40:02 2004 (31528): %courseresdatacache  is 2/8
Sun Feb 15 14:40:02 2004 (31528): %userresdatacache    is 17/32
Sun Feb 15 14:40:02 2004 (31528): %usectioncache       is 0
Sun Feb 15 14:40:02 2004 (31528): %courseresversioncache is 38/64
Sun Feb 15 14:40:02 2004 (31528): %resversioncache     is 0
Sun Feb 15 14:40:02 2004 (31528): Flushing log buffers
Sun Feb 15 14:40:03 2004 (31528): Shutting down
Sun Feb 15 14:40:12 2004 (31531): Flushing log buffers

--snip--


Similar "Starting Shut down" messages appeared with increasing 
frequency over the next 10 minutes, at some points reaching a 
frequency of three or four "Shutdowns" per minute.

This is just before the server started thrashing!  Page Swaps went 
out of control and it was out of swap space soon after and the 
machine was almost totally unresponsive.


lond.log started complaining (after hours of silence)

--snip--

Sun Feb 15 15:05:07 2004 (5843): <font color="green"> Attempting 
to start child (IO::Socket::INET=GLOB(0x84a4f64))</font>
Sun Feb 15 15:05:42 2004 (5919): <font color=ref>CRITICAL: TIME 
OUT 5919</font>
Sun Feb 15 15:05:46 2004 (5919): <font color=red>CRITICAL: 
ABNORMAL EXIT. Child 5919 for server lon-capa.phy.cmich.edu died 
through a crash wi
th this error msg->[Timeout]</font>
Sun Feb 15 15:05:47 2004 (5919): Famous last words: Sun Feb 15 
15:05:46 2004: Catching exception - Sun Feb 15 15:05:46 2004: 
<font color=red>C
RITICAL: ABNORMAL EXIT. Child 5919 for server 
lon-capa.phy.cmich.edu died through a crash with this error 
msg->[Timeout]</font>
Sun Feb 15 15:05:54 2004 (5843): Child 5919 died

--snip--


Soon after, it seems that i lost the connection with myself.

lonc.log
--snip--

Sun Feb 15 15:13:02 2004 (16700): <font color=blue>WARNING: Trying 
to get resource data for paja1me at cmich: con_lost</font>
Sun Feb 15 15:13:04 2004 (12206): <font color=blue>WARNING: Trying 
to get resource data for mello1am at cmich: con_lost</font>

--snip--


I am led to believe that whatever caused the entries in lonnet.log 
may be the source of my thrashing crash.  About the same time that 
i started getting "Starting Shut down" entries, the count of the 
server's inactive dirty pages in memory started to soar, which 
signaled the start of my thrash-fest.

System Activity report:
/var/log/sar/sa15:

             pgpgin/s pgpgout/s  activepg  inadtypg  inaclnpg  
inatarpg

14:00:00         4.40     99.97     95872        20      2658     
23547
14:10:00         3.58     77.31     96189         0      2551     
23572
14:20:00         5.03     56.29     96289       129      2589     
23605
14:30:00         2.78     54.36     96185         0      2661     
23617
14:40:00         5.43     82.15     96347         5      2377     
23597
14:50:00        21.05     58.11     70347       873      1788     
16687
15:00:00         4.00     80.11    103570      1438      1779     
23346
15:11:14       327.18   1137.50     84224     16844      1660     
23705
15:20:41       297.91    506.64     84426     16572      2559     
23700

17:13:43       626.90    151.17    112834      2043      2483     
23540


I am led to believe that httpd was significantly involved in the 
server's downfall.  I think it required about half an hour for 
httpd to stop after commanded to!

I thought it would be worthwhile to document this on the 
mailing-list, as i don't have a specific bug to file in bugzilla.

Any pointers regarding how to further investigate this would be 
appreciated.


-- Gabriel Friedmann
   Central Michigan University
   Department of Physics