[LON-CAPA-admin] high memory usage by long lived lonc processes

Budzik, Michael J. mikeb at purdue.edu
Tue Oct 18 11:33:18 EDT 2016


Yes, purduel1 is our library node and load balancer node.  That's where that top output was from.

> unlike the RES value, the VIRT value is dependent on the Linux distro -- it's much lower for CentOS 5 than for 
> CentOS 6 or 7, even though RES is about the same for all).

RES only includes what is in physical memory. VIRT does include the size of shared libraries, but, more significantly for this issue, it also includes swap used by that process.  You can see that we are using about 1GB of SWAP for each of the lonc processes in question:

for file in /proc/*/status ; do awk '/VmSwap|Name/{printf $2 " " $3}END{ print ""}' $file; done | sort -k 2 -n -r | less
loncnew 1065732 kB
loncnew 1059156 kB
loncnew 1056384 kB
loncnew 1050452 kB
loncnew 1049516 kB

Is there a way to safely kill a lonc process without causing an error for the users?  I can't endlessly throw swap at it for the next month until I have an approved time for maintenance.

Could there be a memory leak hidden by the 5 minute idle timeout in lonc?  Since we almost never go 5 minutes of idle lonc, are we hitting it?  

> Do you find anything meaningful in /home/httpd/perl/logs/lonc_errors on your library server?

There are a LOT of lines like this:
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client 0': GLOB(0x1c51740) isn't a valid IO at /home/httpd/perl/loncnew line 647

There are several handfuls of lines like this:
Event: trapped error in `Connection to lonc client 194': Event 'Connection to lonc client 0': GLOB(0x1c77c50) isn't a valid IO at /home/httpd/perl/loncnew line 647

Those all seem to be in 2 or 3 clusters sort of near the top of the log. There are no timestamps, so I'm not sure of those are related to something like the 05:10 daily reloads.

There are a few lines like this:
Event: trapped error in `Connection to lonc client 14': Can't locate object method "Shutdown" via package "LondConnection=HASH(0x1c50ca8)" (perhaps you forgot to load "LondConnection=HASH(0x1c50ca8)"?) at /home/httpd/perl/loncnew line 754.

Thanks!
Mike B


-----Original Message-----
From: lon-capa-admin-bounces at mail.lon-capa.org [mailto:lon-capa-admin-bounces at mail.lon-capa.org] On Behalf Of Stuart Raeburn
Sent: Tuesday, October 18, 2016 10:50 AM
To: lon-capa-admin at mail.lon-capa.org
Subject: Re: [LON-CAPA-admin] high memory usage by long lived lonc processes

Mike,

>
> Are we the only ones seeing lonc processes last well over 20 days   
> and continue to allocate more and more RAM?
>

Yes, I suspect you may be the only one.

I'm not seeing long-lived lonc processes with high memory values reported for VIRT or RES in top on any of the LON-CAPA instances I manage (msu.edu, educog.com, loncapa.net).

The RES (Resident memeory) value is the one I am typically concerned about, and that is around 10 MB for each lonc process.  (I also see around 185 MB for VIRT for each lonc process, but unlike the RES value, the VIRT value is dependent on the Linux distro -- it's much lower for CentOS 5 than for CentOS 6 or 7, even though RES is about the same for all).

In any case, the RES values are also anomalous for the lonc processes for connections to your access servers (250 MB instead of 9 MB).

In the msu domain I expect to consistently see lonc connections between the LON-CAPA load balancer server and the access servers, when I check top, but on the library server I typically expect to see a lond connection to each access server.

If the top output is for your library server, is purduel1 also configured as a LON-CAPA load balancer?

If not, then you'd typically only see lonc connections initiated to your access servers when published resources are republished on the library server, and an "update" notification is sent to each access server which is subscribed to the resource.

Do you find anything meaningful in /home/httpd/perl/logs/lonc_errors on your library server?


Stuart Raeburn
LON-CAPA Academic Consortium

Quoting "Budzik, Michael J." <mikeb at purdue.edu>:

> Are we the only ones seeing lonc processes last well over 20 days   
> and continue to allocate more and more RAM?  We now have 5 lonc   
> processes that are each using over 1.5GB RAM.
> Mike B
>
> From: lon-capa-admin-bounces at mail.lon-capa.org   
> [mailto:lon-capa-admin-bounces at mail.lon-capa.org] On Behalf Of   
> Budzik, Michael J.
> Sent: Friday, October 14, 2016 1:04 PM
> To: 'lon-capa-admin at mail.lon-capa.org' 
> <lon-capa-admin at mail.lon-capa.org>
> Subject: [LON-CAPA-admin] high memory usage by long lived lonc 
> processes
>
>
> Our lonc processes that live a long time end up using a lot of RAM.   
>  Here are a few rows of output from top.  Check out the lonc   
> processes in the middle of the list that are each using 1.3GB ram   
> compared to the others that are around 180 MB.
>
>
>
> # top -cbn1 -u www | grep lonc
>
> 5058 www       20   0  184m 7604 1176 S  0.0  0.1   0:01.29 lonc:   
> capa9.phy.ohio.edu Connection count: 0 Retries remaining: 5 () Fri   
> Oct 14 11:35:09 2016
>
>  5103 www       20   0  184m 7564 1176 S  0.0  0.1   0:00.95 lonc:   
> meitner.physics.hope.edu Connection count: 0 Retries remaining: 5 ()  
> Fri Oct 14 11:35:09 2016
>
> 18053 www       20   0  180m 7384  920 S  0.0  0.1   0:10.15 lonc:   
> Parent keeping the flock Fri Oct 14 12:46:50 2016
>
> 18063 www       20   0 1321m 251m 1224 S  0.0  3.2  20:24.92 lonc:   
> loncapa02.purdue.edu Connection count: 2 Retries remaining: 5   
> (insecure) Fri Oct 14 12:49:42 2016
>
> 18067 www       20   0 1321m 250m 1224 S  0.0  3.2  21:45.86 lonc:   
> loncapa05.purdue.edu Connection count: 2 Retries remaining: 5   
> (insecure) Fri Oct 14 12:49:41 2016
>
> 21139 www       20   0 1321m 250m 1224 S  0.0  3.2  21:57.04 lonc:   
> loncapa07.purdue.edu Connection count: 2 Retries remaining: 5   
> (insecure) Fri Oct 14 12:49:41 2016
>
> 21150 www       20   0 1321m 248m 1224 S  0.0  3.2  22:11.91 lonc:   
> loncapa04.purdue.edu Connection count: 2 Retries remaining: 5   
> (insecure) Fri Oct 14 12:49:42 2016
>
> 21151 www       20   0 1321m 253m 1224 S  0.0  3.2  21:48.87 lonc:   
> loncapa06.purdue.edu Connection count: 2 Retries remaining: 5   
> (insecure) Fri Oct 14 12:49:42 2016
>
> 22900 www       20   0  182m 8756 1972 S  0.0  0.1   0:00.93 lonc:   
> loncapa03.purdue.edu Connection count: 1 Retries remaining: 5   
> (insecure) Fri Oct 14 12:49:41 2016
>
> 29226 www       20   0  184m 8900 2060 S  0.0  0.1   0:00.04 lonc:   
> capa4.phy.ohio.edu Connection count: 3 Retries remaining: 5   
> (insecure) Fri Oct 14 12:49:42 2016
>
> 29419 www       20   0  182m 8776 1972 S  0.0  0.1   0:00.11 lonc:   
> loncapa.purdue.edu Connection count: 1 Retries remaining: 5   
> (insecure) Fri Oct 14 12:49:41 2016
>
>
>
>
>
> Anyone else see this?
>
>
>
> Thanks,
>
> Mike Budzik
>
> Interim Manager, Student Systems and Web Services Admin
>
> IT Infrastructure - Purdue University

_______________________________________________
LON-CAPA-admin mailing list
LON-CAPA-admin at mail.lon-capa.org
http://mail.lon-capa.org/mailman/listinfo/lon-capa-admin


More information about the LON-CAPA-admin mailing list