[LON-CAPA-dev] lonc connections dying

Gerd Kortemeyer lon-capa-dev@mail.lon-capa.org
Sat, 5 Jun 2010 07:23:44 -0400


Hi,

I believe we are seeing this and even able to reproduce it. This seems to involve transfer of large datasets.

Stay tuned (i.e., don't drop the connection).

- Gerd.

On Jun 2, 2010, at 11:11 PM, Mark Lucas wrote:

> Hi,
> 
> I've been fighting lonc connections dying all quarter. With a change in textbook, we've
> been using lots of problems from other domains in several of our courses.
> 
> We get the "Unable to find ......." error popping up a lot, and particularly a lot over
> the last week.
> 
> I'm finally diving into this and will be checking out logs over the next couple days.
> 
> In the meantime, can anyone tell me what can cause a "DEAD" lonc connection?
> I do a ps aux and find lonc DEAD for the offending connections. I also find some strange
> error messages in /var/log/httpd/errors.
> 
> Right now, I get in and do a loncontrol reload when I find a dead connection. What would
> happen if I just killed the dead lonc process - would it then try to restart?
> 
> 
> Here are some samples:
> from httpd/errors
> 
> [Wed Jun 02 22:05:11 2010] [error] access to /res/msu/physicslib/msuphysicslib/70_CircAC2_LRC_Power/msuprob04b.problem failed for 184.57.76.249, reason: Invali
> d symb for /res/msu/physicslib/msuphysicslib/70_CircAC2_LRC_Power/msuprob04b.problem: uploaded/ohiou/8j176084101734b7coucapa2/default_1236609481.sequence___19___msu/physicslib/msuphysicslib/70_CircAC2_LRC_Power/msuprob04b.problem
> [Wed Jun 02 22:05:11 2010] [error] access to /res/msu/physicslib/msuphysicslib/70_CircAC2_LRC_Power/msuprob04b.problem failed for 184.57.76.249, reason: Invalid Access for zm216307 domain ohiou access bre[Wed Jun 02 22:05:17 2010] [error] [client 132.235.42.74] Apache2::RequestIO::print: (103) Software caused connection abort at /home/httpd/lib/perl//Apache/lon
> homework.pm line 1010, referer: http://capa10.phy.ohiou.edu/res/ohiou/serwaylib/Chap29/Radioisotope.problem[Wed Jun 02 22:05:17 2010] [error] [client 132.235.42.74] Apache2::RequestIO::print: (103) Software caused connection abort at /home/httpd/lib/perl//Apache/lon
> errorhandler.pm line 53, referer: http://capa10.phy.ohiou.edu/res/ohiou/serwaylib/Chap29/Radioisotope.problem[Wed Jun 02 22:06:22 2010] [error] access to /res/msu/physicslib/msuphysicslib/70_CircAC2_LRC_Power/msuprob04b.problem failed for 184.57.76.249, reason: Invalid symb for /res/msu/physicslib/msuphysicslib/70_CircAC2_LRC_Power/msuprob04b.problem: uploaded/ohiou/8j176084101734b7coucapa2/default_1236609481.sequence___19_
> __msu/physicslib/msuphysicslib/70_CircAC2_LRC_Power/msuprob04b.problem[Wed Jun 02 22:06:22 2010] [error] access to /res/msu/physicslib/msuphysicslib/70_CircAC2_LRC_Power/msuprob04b.problem failed for 184.57.76.249, reason: Invali
> d Access for zm216307 domain ohiou access bre
> 
> 
> I also have a whole bunch of 
> Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client 0': GLOB(0xc7eb030) isn't a valid IO at /home/httpd/perl/loncnew line 645
> 
> and a few 
> 
> Event: trapped error in `Connection to lonc client 137': Event 'Connection to lonc client 0': GLOB(0xc7eb030) isn't a valid IO at /home/httpd/perl/loncnew line 645
> 
> showing up in lonc_error, though there aren't time stamps here.
> 
> 
> in lonc.log, this is the latest episode with s10 dropping out on capa10 (ohioua6)
> 
> Wed Jun  2 22:04:56 2010 (20029) [s10.lite.msu.edu] [Wed Jun  2 22:04:56 2010: s10.lite.msu.edu Connection count: 6 Retries remaining: 3 (insecure)] <font color='blue'>WARNING: A socket timeout was detected</font>
> Wed Jun  2 22:04:56 2010 (20029) [s10.lite.msu.edu] [Wed Jun  2 22:04:56 2010: s10.lite.msu.edu Connection count: 6 Retries remaining: 3 (insecure)] <font color='blue'>WARNING: Failing transaction sethost</font>
> Wed Jun  2 22:04:56 2010 (20029) [s10.lite.msu.edu] [Wed Jun  2 22:04:56 2010: s10.lite.msu.edu Connection count: 6 Retries remaining: 3 (insecure)] <font color='blue'>WARNING: Shutting down a socket</font>
> Wed Jun  2 22:04:56 2010 (20029) [s10.lite.msu.edu] [Wed Jun  2 22:04:56 2010: s10.lite.msu.edu Connection count: 5 Retries remaining: 2 (insecure)] <font color='blue'>WARNING: Lond connection lost.</font>
> font color='blue'>WARNING: Shutting down a socket</font>
> Wed Jun  2 22:05:11 2010 (20029) [s10.lite.msu.edu] [Wed Jun  2 22:05:11 2010: s10.lite.msu.edu Connection count: 5 Retries remaining: 1 (insecure)] <font color='blue'>WARNING: A socket timeout was detected</font>
> Wed Jun  2 22:05:11 2010 (20029) [s10.lite.msu.edu] [Wed Jun  2 22:05:11 2010: s10.lite.msu.edu Connection count: 5 Retries remaining: 1 (insecure)] <font color='blue'>WARNING: Failing transaction sethost</font>
> Wed Jun  2 22:05:11 2010 (20029) [s10.lite.msu.edu] [Wed Jun  2 22:05:11 2010: s10.lite.msu.edu Connection count: 5 Retries remaining: 1 (insecure)] <font color='blue'>WARNING: Shutting down a socket</font>
> Wed Jun  2 22:05:11 2010 (20029) [s10.lite.msu.edu] [Wed Jun  2 22:05:11 2010: s10.lite.msu.edu Connection count: 5 Retries remaining: 1 (insecure)] <font color='red'>CRITICAL: Host marked DEAD: s10.lite.msu.edu</font>
> Wed Jun  2 22:05:11 2010 (20029) [s10.lite.msu.edu] [Wed Jun  2 22:05:11 2010: s10.lite.msu.edu >> DEAD <<] <font color='blue'>WARNING: Lond connection lost.</font>
> Wed Jun  2 22:05:11 2010 (20029) [s10.lite.msu.edu] [Wed Jun  2 22:05:11 2010: s10.lite.msu.edu >> DEAD <<] <font color='blue'>WARNING: Shutting down a socket</font>
> Wed Jun  2 22:05:12 2010 (20029) [s10.lite.msu.edu] [Wed Jun  2 22:05:12 2010: s10.lite.msu.edu >> DEAD <<] <font color='blue'>WARNING: A socket timeout was detected</font>
> 
> and then DEAD warnings every second until I reset things at 22:16:37
> 
> Any insights welcome.
> 
> Mark
> 
> -- 
> Mark Lucas 								email: lucasm@ohiou.edu
> 252D Clippinger Lab						phone: (740)597-2984
> Department of Physics and Astronomy		fax: (740)593-0433
> Ohio University
> Athens, OH 45701
>