[LON-CAPA-dev] Crash Investigation
Gabriel Friedmann
lon-capa-dev@mail.lon-capa.org
Mon, 16 Feb 2004 16:48:48 -0500 (EST)
Hello developers.
How can i investigate the cause of the following log entry?
(attempting to trace the cause of a crash last night)
lonnet.log
--snip--
Sun Feb 15 14:40:02 2004 (31528): Starting Shut down
Sun Feb 15 14:40:02 2004 (31528): %badServerCache is 0
Sun Feb 15 14:40:02 2004 (31528): %metacache is 439/1024
Sun Feb 15 14:40:02 2004 (31528): %homecache is 68/128
Sun Feb 15 14:40:02 2004 (31528): %titlecache is 412/1024
Sun Feb 15 14:40:02 2004 (31528): %courseresdatacache is 2/8
Sun Feb 15 14:40:02 2004 (31528): %userresdatacache is 17/32
Sun Feb 15 14:40:02 2004 (31528): %usectioncache is 0
Sun Feb 15 14:40:02 2004 (31528): %courseresversioncache is 38/64
Sun Feb 15 14:40:02 2004 (31528): %resversioncache is 0
Sun Feb 15 14:40:02 2004 (31528): Flushing log buffers
Sun Feb 15 14:40:03 2004 (31528): Shutting down
Sun Feb 15 14:40:12 2004 (31531): Flushing log buffers
--snip--
Similar "Starting Shut down" messages appeared with increasing
frequency over the next 10 minutes, at some points reaching a
frequency of three or four "Shutdowns" per minute.
This is just before the server started thrashing! Page Swaps went
out of control and it was out of swap space soon after and the
machine was almost totally unresponsive.
lond.log started complaining (after hours of silence)
--snip--
Sun Feb 15 15:05:07 2004 (5843): <font color="green"> Attempting
to start child (IO::Socket::INET=GLOB(0x84a4f64))</font>
Sun Feb 15 15:05:42 2004 (5919): <font color=ref>CRITICAL: TIME
OUT 5919</font>
Sun Feb 15 15:05:46 2004 (5919): <font color=red>CRITICAL:
ABNORMAL EXIT. Child 5919 for server lon-capa.phy.cmich.edu died
through a crash wi
th this error msg->[Timeout]</font>
Sun Feb 15 15:05:47 2004 (5919): Famous last words: Sun Feb 15
15:05:46 2004: Catching exception - Sun Feb 15 15:05:46 2004:
<font color=red>C
RITICAL: ABNORMAL EXIT. Child 5919 for server
lon-capa.phy.cmich.edu died through a crash with this error
msg->[Timeout]</font>
Sun Feb 15 15:05:54 2004 (5843): Child 5919 died
--snip--
Soon after, it seems that i lost the connection with myself.
lonc.log
--snip--
Sun Feb 15 15:13:02 2004 (16700): <font color=blue>WARNING: Trying
to get resource data for paja1me at cmich: con_lost</font>
Sun Feb 15 15:13:04 2004 (12206): <font color=blue>WARNING: Trying
to get resource data for mello1am at cmich: con_lost</font>
--snip--
I am led to believe that whatever caused the entries in lonnet.log
may be the source of my thrashing crash. About the same time that
i started getting "Starting Shut down" entries, the count of the
server's inactive dirty pages in memory started to soar, which
signaled the start of my thrash-fest.
System Activity report:
/var/log/sar/sa15:
pgpgin/s pgpgout/s activepg inadtypg inaclnpg
inatarpg
14:00:00 4.40 99.97 95872 20 2658
23547
14:10:00 3.58 77.31 96189 0 2551
23572
14:20:00 5.03 56.29 96289 129 2589
23605
14:30:00 2.78 54.36 96185 0 2661
23617
14:40:00 5.43 82.15 96347 5 2377
23597
14:50:00 21.05 58.11 70347 873 1788
16687
15:00:00 4.00 80.11 103570 1438 1779
23346
15:11:14 327.18 1137.50 84224 16844 1660
23705
15:20:41 297.91 506.64 84426 16572 2559
23700
17:13:43 626.90 151.17 112834 2043 2483
23540
I am led to believe that httpd was significantly involved in the
server's downfall. I think it required about half an hour for
httpd to stop after commanded to!
I thought it would be worthwhile to document this on the
mailing-list, as i don't have a specific bug to file in bugzilla.
Any pointers regarding how to further investigate this would be
appreciated.
-- Gabriel Friedmann
Central Michigan University
Department of Physics