[LON-CAPA-admin] high memory usage by long lived lonc processes

Tue Oct 18 18:10:39 EDT 2016

Excellent. Thank you.

Mike B

________________________________
From: Stuart Raeburn <raeburn at msu.edu>
Sent: Oct 18, 2016 5:38 PM
To: lon-capa-admin at mail.lon-capa.org
Subject: Re: [LON-CAPA-admin] high memory usage by long lived lonc processes

Mike,

>
> Is there a way to safely kill a lonc process without causing an
> error for the users?  I can't endlessly throw swap at it for the
> next month until I have an approved time for maintenance.
>

You can kill those lonc processes.

Since your library node is also a load-balancer then whenever a log-in
occurs, the access servers will be contacted to determine current load
so that sessions can be sent to the one with the lowest load. Making a
request for load information will require a lonc connection to the
access server.

After you have killed the long-running lonc process for a particular
access node, the parent lonc process on the library node should spawn
a new child lonc connection to the access node when the next user
logs-in.

However, if that doesn't work out you could kill the parent lonc
process too, and then do: /etc/init.d/loncontrol start to start a new
lonc parent etc.

>
> Could there be a memory leak hidden by the 5 minute idle timeout in
> lonc?  Since we almost never go 5 minutes of idle lonc, are we
> hitting it?
>

The item repeated many times in lonc_errors:

>
> There are a LOT of lines like this:
> Event: trapped error in `?? loncnew:444': Event 'Connection to lonc
> client 0': GLOB(0x1c51740) isn't a valid IO at
> /home/httpd/perl/loncnew line 647
>

suggests that the lonc process got into an error state from which it
has not recovered cleanly. I suspect that post-error state is the
reason why memory usage keeps climbing.  Killing the lonc process
would be a good solution.

Given the frequency of log-ins to the msu load-balancer node from 9000
LON-CAOA users at MSU, and also the fact that a monitoring service
completes a LON-CAPA log-in to that node every 5 minutes, it seems
likely that the 5 minute idle timeout is not encountered on the msu
load-balancer very frequently either.

For that node,

top -cbn1 -u www | grep lonc |grep msu

reports:

184m 10m 3172 S  0.0  0.3   0:30.16 lonc: s1.lite.msu.edu Connection
count: 1 Retries remaining: 5 (ssl) Tue Oct 18 16:33:08 2016

185m 10m 3172 S  0.0  0.3   0:31.48 lonc: s2.lite.msu.edu Connection
count: 1 Retries remaining: 5 (ssl) Tue Oct 18 16:33:09 2016

184m  10m 3172 S  0.0  0.3   0:04.91 lonc: s3.lite.msu.edu Connection
count: 2 Retries remaining: 5 (ssl) Tue Oct 18 16:33:08 2016

184m  10m 3172 S  0.0  0.3   0:28.85 lonc: s4.lite.msu.edu Connection
count: 1 Retries remaining: 5 (ssl) Tue Oct 18 16:33:08 2016

Stuart Raeburn
LON-CAPA Academic Consortium

Quoting "Budzik, Michael J." <mikeb at purdue.edu>:

> Yes, purduel1 is our library node and load balancer node.  That's
> where that top output was from.
>
>> unlike the RES value, the VIRT value is dependent on the Linux
>> distro -- it's much lower for CentOS 5 than for
>> CentOS 6 or 7, even though RES is about the same for all).
>
> RES only includes what is in physical memory. VIRT does include the
> size of shared libraries, but, more significantly for this issue, it
>  also includes swap used by that process.  You can see that we are
> using about 1GB of SWAP for each of the lonc processes in question:
>
> for file in /proc/*/status ; do awk '/VmSwap|Name/{printf $2 " "
> $3}END{ print ""}' $file; done | sort -k 2 -n -r | less
> loncnew 1065732 kB
> loncnew 1059156 kB
> loncnew 1056384 kB
> loncnew 1050452 kB
> loncnew 1049516 kB
>
> Is there a way to safely kill a lonc process without causing an
> error for the users?  I can't endlessly throw swap at it for the
> next month until I have an approved time for maintenance.
>
> Could there be a memory leak hidden by the 5 minute idle timeout in
> lonc?  Since we almost never go 5 minutes of idle lonc, are we
> hitting it?
>
>> Do you find anything meaningful in
>> /home/httpd/perl/logs/lonc_errors on your library server?
>
> There are a LOT of lines like this:
> Event: trapped error in `?? loncnew:444': Event 'Connection to lonc
> client 0': GLOB(0x1c51740) isn't a valid IO at
> /home/httpd/perl/loncnew line 647
>
> There are several handfuls of lines like this:
> Event: trapped error in `Connection to lonc client 194': Event
> 'Connection to lonc client 0': GLOB(0x1c77c50) isn't a valid IO at
> /home/httpd/perl/loncnew line 647
>
> Those all seem to be in 2 or 3 clusters sort of near the top of the
> log. There are no timestamps, so I'm not sure of those are related
> to something like the 05:10 daily reloads.
>
> There are a few lines like this:
> Event: trapped error in `Connection to lonc client 14': Can't locate
>  object method "Shutdown" via package
> "LondConnection=HASH(0x1c50ca8)" (perhaps you forgot to load
> "LondConnection=HASH(0x1c50ca8)"?) at /home/httpd/perl/loncnew line
> 754.
>
> Thanks!
> Mike B
>
>
> -----Original Message-----
> From: lon-capa-admin-bounces at mail.lon-capa.org
> [mailto:lon-capa-admin-bounces at mail.lon-capa.org] On Behalf Of
> Stuart Raeburn
> Sent: Tuesday, October 18, 2016 10:50 AM
> To: lon-capa-admin at mail.lon-capa.org
> Subject: Re: [LON-CAPA-admin] high memory usage by long lived lonc processes
>
> Mike,
>
>>
>> Are we the only ones seeing lonc processes last well over 20 days
>> and continue to allocate more and more RAM?
>>
>
> Yes, I suspect you may be the only one.
>
> I'm not seeing long-lived lonc processes with high memory values
> reported for VIRT or RES in top on any of the LON-CAPA instances I
> manage (msu.edu, educog.com, loncapa.net).
>
> The RES (Resident memeory) value is the one I am typically concerned
>  about, and that is around 10 MB for each lonc process.  (I also see
>  around 185 MB for VIRT for each lonc process, but unlike the RES
> value, the VIRT value is dependent on the Linux distro -- it's much
> lower for CentOS 5 than for CentOS 6 or 7, even though RES is about
> the same for all).
>
> In any case, the RES values are also anomalous for the lonc
> processes for connections to your access servers (250 MB instead of
> 9 MB).
>
> In the msu domain I expect to consistently see lonc connections
> between the LON-CAPA load balancer server and the access servers,
> when I check top, but on the library server I typically expect to
> see a lond connection to each access server.
>
> If the top output is for your library server, is purduel1 also
> configured as a LON-CAPA load balancer?
>
> If not, then you'd typically only see lonc connections initiated to
> your access servers when published resources are republished on the
> library server, and an "update" notification is sent to each access
> server which is subscribed to the resource.
>
> Do you find anything meaningful in /home/httpd/perl/logs/lonc_errors
>  on your library server?
>
>
> Stuart Raeburn
> LON-CAPA Academic Consortium
>
> Quoting "Budzik, Michael J." <mikeb at purdue.edu>:
>
>> Are we the only ones seeing lonc processes last well over 20 days
>> and continue to allocate more and more RAM?  We now have 5 lonc
>> processes that are each using over 1.5GB RAM.
>> Mike B
>>
>> From: lon-capa-admin-bounces at mail.lon-capa.org
>> [mailto:lon-capa-admin-bounces at mail.lon-capa.org] On Behalf Of
>> Budzik, Michael J.
>> Sent: Friday, October 14, 2016 1:04 PM
>> To: 'lon-capa-admin at mail.lon-capa.org'
>> <lon-capa-admin at mail.lon-capa.org>
>> Subject: [LON-CAPA-admin] high memory usage by long lived lonc
>> processes
>>
>>
>> Our lonc processes that live a long time end up using a lot of RAM.
>>  Here are a few rows of output from top.  Check out the lonc
>> processes in the middle of the list that are each using 1.3GB ram
>> compared to the others that are around 180 MB.
>>
>>
>>
>> # top -cbn1 -u www | grep lonc
>>
>> 5058 www       20   0  184m 7604 1176 S  0.0  0.1   0:01.29 lonc:
>> capa9.phy.ohio.edu Connection count: 0 Retries remaining: 5 () Fri
>> Oct 14 11:35:09 2016
>>
>>  5103 www       20   0  184m 7564 1176 S  0.0  0.1   0:00.95 lonc:
>> meitner.physics.hope.edu Connection count: 0 Retries remaining: 5 ()
>> Fri Oct 14 11:35:09 2016
>>
>> 18053 www       20   0  180m 7384  920 S  0.0  0.1   0:10.15 lonc:
>> Parent keeping the flock Fri Oct 14 12:46:50 2016
>>
>> 18063 www       20   0 1321m 251m 1224 S  0.0  3.2  20:24.92 lonc:
>> loncapa02.purdue.edu Connection count: 2 Retries remaining: 5
>> (insecure) Fri Oct 14 12:49:42 2016
>>
>> 18067 www       20   0 1321m 250m 1224 S  0.0  3.2  21:45.86 lonc:
>> loncapa05.purdue.edu Connection count: 2 Retries remaining: 5
>> (insecure) Fri Oct 14 12:49:41 2016
>>
>> 21139 www       20   0 1321m 250m 1224 S  0.0  3.2  21:57.04 lonc:
>> loncapa07.purdue.edu Connection count: 2 Retries remaining: 5
>> (insecure) Fri Oct 14 12:49:41 2016
>>
>> 21150 www       20   0 1321m 248m 1224 S  0.0  3.2  22:11.91 lonc:
>> loncapa04.purdue.edu Connection count: 2 Retries remaining: 5
>> (insecure) Fri Oct 14 12:49:42 2016
>>
>> 21151 www       20   0 1321m 253m 1224 S  0.0  3.2  21:48.87 lonc:
>> loncapa06.purdue.edu Connection count: 2 Retries remaining: 5
>> (insecure) Fri Oct 14 12:49:42 2016
>>
>> 22900 www       20   0  182m 8756 1972 S  0.0  0.1   0:00.93 lonc:
>> loncapa03.purdue.edu Connection count: 1 Retries remaining: 5
>> (insecure) Fri Oct 14 12:49:41 2016
>>
>> 29226 www       20   0  184m 8900 2060 S  0.0  0.1   0:00.04 lonc:
>> capa4.phy.ohio.edu Connection count: 3 Retries remaining: 5
>> (insecure) Fri Oct 14 12:49:42 2016
>>
>> 29419 www       20   0  182m 8776 1972 S  0.0  0.1   0:00.11 lonc:
>> loncapa.purdue.edu Connection count: 1 Retries remaining: 5
>> (insecure) Fri Oct 14 12:49:41 2016
>>
>>
>>
>>
>>
>> Anyone else see this?
>>
>>
>>
>> Thanks,
>>
>> Mike Budzik
>>
>> Interim Manager, Student Systems and Web Services Admin
>>
>> IT Infrastructure - Purdue University

_______________________________________________
LON-CAPA-admin mailing list
LON-CAPA-admin at mail.lon-capa.org
http://mail.lon-capa.org/mailman/listinfo/lon-capa-admin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.lon-capa.org/pipermail/lon-capa-admin/attachments/20161018/e7915cf4/attachment-0001.html>