[LON-CAPA-admin] high memory usage by long lived lonc processes

Fri Oct 21 15:00:14 EDT 2016

Thanks for sharing.  When we were having trouble, we were getting those log messages at the rate of a few per second. Since we killed the lonc processes that were in an atypical state, we have not seen that message in our logs and the processes are behaving as expected.

This happened on our load balancer node. The lonc processes for 5 of our 6 access nodes went into this state on the same date.  I have an unproven theory that our maintenance that day was a contributing factor.

We were doing kernel patching and had to reboot to pick up the new kernel patching.  We started with the Library node. Once it was up and running (including lon capa processes since they get started by init), we rebooted all of our access nodes.  That may have put the lonc processes into an unstable state as the access nodes disappeared suddenly.

Mike B

From: lon-capa-admin-bounces at mail.lon-capa.org [mailto:lon-capa-admin-bounces at mail.lon-capa.org] On Behalf Of Bob Gonzales
Sent: Friday, October 21, 2016 10:20 AM
To: list about administration and system updating <lon-capa-admin at mail.lon-capa.org>
Subject: Re: [LON-CAPA-admin] high memory usage by long lived lonc processes

Mike,

I've been following this thread because I do get entries in lonc_errors similar to those that you posted but I don't get large memory usage on the long lived processes, they remain about 10MB, like Stuart's.  I will see these lonc_error entries about once or twice a month on different servers.  I've always associate them with the server being really busy but this morning there were fresh entries in lonc_errors for one of my servers and it wasn't that busy yesterday.  The server's worst load was:

Active=16
Moderately Active=76
Inactive=191
loadavg=0.42 0.21 0.16 1/487 57695
Looking the the log entries, I noticed that the "client number" in subsequent log entries increases.  I don't know if that indicates anything useful in figuring out which process (the parent?) is having trouble.  Here is my log:

Event: trapped error in `Connection to lonc client 5': Can't locate object method "Shutdown" via package "LondConnection=HASH(0x27ba858)" (perhaps you forgot to load "LondConnection=HASH(
0x27ba858)"?) at /home/httpd/perl/loncnew line 754.
Event: trapped error in `Connection to lonc client 6': Can't locate object method "Shutdown" via package "LondConnection=HASH(0x27ba858)" (perhaps you forgot to load "LondConnection=HASH(
0x27ba858)"?) at /home/httpd/perl/loncnew line 754.
Event: trapped error in `Connection to lonc client 7': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `Connection to lonc client 8': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `Connection to lonc client 9': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `Connection to lonc client 10': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `Connection to lonc client 11': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `Connection to lonc client 12': Event 'Connection to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Bob Gonzales
Chemistry Dept
Binghamton University

On Wed, Oct 19, 2016 at 1:46 PM, Budzik, Michael J. <mikeb at purdue.edu<mailto:mikeb at purdue.edu>> wrote:
Stuart,
We killed the lonc processes that were behaving unexpectedly - just the individual ones, not the parent "flock keeper".  The replacement ones spawned as expected and now the log looks good.
Thanks for your help with this.

Mike B

-----Original Message-----
From: lon-capa-admin-bounces at mail.lon-capa.org<mailto:lon-capa-admin-bounces at mail.lon-capa.org> [mailto:lon-capa-admin-bounces at mail.lon-capa.org<mailto:lon-capa-admin-bounces at mail.lon-capa.org>] On Behalf Of Stuart Raeburn
Sent: Tuesday, October 18, 2016 5:22 PM
To: lon-capa-admin at mail.lon-capa.org<mailto:lon-capa-admin at mail.lon-capa.org>
Subject: Re: [LON-CAPA-admin] high memory usage by long lived lonc processes

Mike,

>
> Is there a way to safely kill a lonc process without causing an
> error for the users?  I can't endlessly throw swap at it for the
> next month until I have an approved time for maintenance.
>

You can kill those lonc processes.

Since your library node is also a load-balancer then whenever a log-in occurs, the access servers will be contacted to determine current load so that sessions can be sent to the one with the lowest load. Making a request for load information will require a lonc connection to the access server.

After you have killed the long-running lonc process for a particular access node, the parent lonc process on the library node should spawn a new child lonc connection to the access node when the next user logs-in.

However, if that doesn't work out you could kill the parent lonc process too, and then do: /etc/init.d/loncontrol start to start a new lonc parent etc.

>
> Could there be a memory leak hidden by the 5 minute idle timeout in
> lonc?  Since we almost never go 5 minutes of idle lonc, are we
> hitting it?
>

The item repeated many times in lonc_errors:

>
> There are a LOT of lines like this:
> Event: trapped error in `?? loncnew:444': Event 'Connection to lonc
> client 0': GLOB(0x1c51740) isn't a valid IO at
> /home/httpd/perl/loncnew line 647
>

suggests that the lonc process got into an error state from which it has not recovered cleanly. I suspect that post-error state is the reason why memory usage keeps climbing.  Killing the lonc process would be a good solution.

Given the frequency of log-ins to the msu load-balancer node from 9000 LON-CAOA users at MSU, and also the fact that a monitoring service completes a LON-CAPA log-in to that node every 5 minutes, it seems likely that the 5 minute idle timeout is not encountered on the msu load-balancer very frequently either.

For that node,

top -cbn1 -u www | grep lonc |grep msu

reports:

184m 10m 3172 S  0.0  0.3   0:30.16 lonc: s1.lite.msu.edu<http://s1.lite.msu.edu> Connection
count: 1 Retries remaining: 5 (ssl) Tue Oct 18 16:33:08 2016

185m 10m 3172 S  0.0  0.3   0:31.48 lonc: s2.lite.msu.edu<http://s2.lite.msu.edu> Connection
count: 1 Retries remaining: 5 (ssl) Tue Oct 18 16:33:09 2016

184m  10m 3172 S  0.0  0.3   0:04.91 lonc: s3.lite.msu.edu<http://s3.lite.msu.edu> Connection
count: 2 Retries remaining: 5 (ssl) Tue Oct 18 16:33:08 2016

184m  10m 3172 S  0.0  0.3   0:28.85 lonc: s4.lite.msu.edu<http://s4.lite.msu.edu> Connection
count: 1 Retries remaining: 5 (ssl) Tue Oct 18 16:33:08 2016

Stuart Raeburn
LON-CAPA Academic Consortium

Quoting "Budzik, Michael J." <mikeb at purdue.edu<mailto:mikeb at purdue.edu>>:

> Yes, purduel1 is our library node and load balancer node.  That's
> where that top output was from.
>
>> unlike the RES value, the VIRT value is dependent on the Linux
>> distro -- it's much lower for CentOS 5 than for CentOS 6 or 7, even
>> though RES is about the same for all).
>
> RES only includes what is in physical memory. VIRT does include the
> size of shared libraries, but, more significantly for this issue, it
>  also includes swap used by that process.  You can see that we are
> using about 1GB of SWAP for each of the lonc processes in question:
>
> for file in /proc/*/status ; do awk '/VmSwap|Name/{printf $2 " "
> $3}END{ print ""}' $file; done | sort -k 2 -n -r | less loncnew
> 1065732 kB loncnew 1059156 kB loncnew 1056384 kB loncnew 1050452 kB
> loncnew 1049516 kB
>
> Is there a way to safely kill a lonc process without causing an
> error for the users?  I can't endlessly throw swap at it for the
> next month until I have an approved time for maintenance.
>
> Could there be a memory leak hidden by the 5 minute idle timeout in
> lonc?  Since we almost never go 5 minutes of idle lonc, are we
> hitting it?
>
>> Do you find anything meaningful in
>> /home/httpd/perl/logs/lonc_errors on your library server?
>
> There are a LOT of lines like this:
> Event: trapped error in `?? loncnew:444': Event 'Connection to lonc
> client 0': GLOB(0x1c51740) isn't a valid IO at
> /home/httpd/perl/loncnew line 647
>
> There are several handfuls of lines like this:
> Event: trapped error in `Connection to lonc client 194': Event
> 'Connection to lonc client 0': GLOB(0x1c77c50) isn't a valid IO at
> /home/httpd/perl/loncnew line 647
>
> Those all seem to be in 2 or 3 clusters sort of near the top of the
> log. There are no timestamps, so I'm not sure of those are related
> to something like the 05:10 daily reloads.
>
> There are a few lines like this:
> Event: trapped error in `Connection to lonc client 14': Can't locate
>  object method "Shutdown" via package
> "LondConnection=HASH(0x1c50ca8)" (perhaps you forgot to load
> "LondConnection=HASH(0x1c50ca8)"?) at /home/httpd/perl/loncnew line
> 754.
>
> Thanks!
> Mike B
>
>
> -----Original Message-----
> From: lon-capa-admin-bounces at mail.lon-capa.org<mailto:lon-capa-admin-bounces at mail.lon-capa.org>
> [mailto:lon-capa-admin-bounces at mail.lon-capa.org<mailto:lon-capa-admin-bounces at mail.lon-capa.org>] On Behalf Of
> Stuart Raeburn
> Sent: Tuesday, October 18, 2016 10:50 AM
> To: lon-capa-admin at mail.lon-capa.org<mailto:lon-capa-admin at mail.lon-capa.org>
> Subject: Re: [LON-CAPA-admin] high memory usage by long lived lonc
> processes
>
> Mike,
>
>>
>> Are we the only ones seeing lonc processes last well over 20 days and
>> continue to allocate more and more RAM?
>>
>
> Yes, I suspect you may be the only one.
>
> I'm not seeing long-lived lonc processes with high memory values
> reported for VIRT or RES in top on any of the LON-CAPA instances I
> manage (msu.edu<http://msu.edu>, educog.com<http://educog.com>, loncapa.net<http://loncapa.net>).
>
> The RES (Resident memeory) value is the one I am typically concerned
> about, and that is around 10 MB for each lonc process.  (I also see
>  around 185 MB for VIRT for each lonc process, but unlike the RES
> value, the VIRT value is dependent on the Linux distro -- it's much
> lower for CentOS 5 than for CentOS 6 or 7, even though RES is about
> the same for all).
>
> In any case, the RES values are also anomalous for the lonc
> processes for connections to your access servers (250 MB instead of
> 9 MB).
>
> In the msu domain I expect to consistently see lonc connections
> between the LON-CAPA load balancer server and the access servers,
> when I check top, but on the library server I typically expect to
> see a lond connection to each access server.
>
> If the top output is for your library server, is purduel1 also
> configured as a LON-CAPA load balancer?
>
> If not, then you'd typically only see lonc connections initiated to
> your access servers when published resources are republished on the
> library server, and an "update" notification is sent to each access
> server which is subscribed to the resource.
>
> Do you find anything meaningful in /home/httpd/perl/logs/lonc_errors
> on your library server?
>
>
> Stuart Raeburn
> LON-CAPA Academic Consortium
>
> Quoting "Budzik, Michael J." <mikeb at purdue.edu<mailto:mikeb at purdue.edu>>:
>
>> Are we the only ones seeing lonc processes last well over 20 days and
>> continue to allocate more and more RAM?  We now have 5 lonc processes
>> that are each using over 1.5GB RAM.
>> Mike B
>>
>> From: lon-capa-admin-bounces at mail.lon-capa.org<mailto:lon-capa-admin-bounces at mail.lon-capa.org>
>> [mailto:lon-capa-admin-bounces at mail.lon-capa.org<mailto:lon-capa-admin-bounces at mail.lon-capa.org>] On Behalf Of
>> Budzik, Michael J.
>> Sent: Friday, October 14, 2016 1:04 PM
>> To: 'lon-capa-admin at mail.lon-capa.org<mailto:lon-capa-admin at mail.lon-capa.org>'
>> <lon-capa-admin at mail.lon-capa.org<mailto:lon-capa-admin at mail.lon-capa.org>>
>> Subject: [LON-CAPA-admin] high memory usage by long lived lonc
>> processes
>>
>>
>> Our lonc processes that live a long time end up using a lot of RAM.
>>  Here are a few rows of output from top.  Check out the lonc
>> processes in the middle of the list that are each using 1.3GB ram
>> compared to the others that are around 180 MB.
>>
>>
>>
>> # top -cbn1 -u www | grep lonc
>>
>> 5058 www       20   0  184m 7604 1176 S  0.0  0.1   0:01.29 lonc:
>> capa9.phy.ohio.edu<http://capa9.phy.ohio.edu> Connection count: 0 Retries remaining: 5 () Fri
>> Oct 14 11:35:09 2016
>>
>>  5103 www       20   0  184m 7564 1176 S  0.0  0.1   0:00.95 lonc:
>> meitner.physics.hope.edu<http://meitner.physics.hope.edu> Connection count: 0 Retries remaining: 5 ()
>> Fri Oct 14 11:35:09 2016
>>
>> 18053 www       20   0  180m 7384  920 S  0.0  0.1   0:10.15 lonc:
>> Parent keeping the flock Fri Oct 14 12:46:50 2016
>>
>> 18063 www       20   0 1321m 251m 1224 S  0.0  3.2  20:24.92 lonc:
>> loncapa02.purdue.edu<http://loncapa02.purdue.edu> Connection count: 2 Retries remaining: 5
>> (insecure) Fri Oct 14 12:49:42 2016
>>
>> 18067 www       20   0 1321m 250m 1224 S  0.0  3.2  21:45.86 lonc:
>> loncapa05.purdue.edu<http://loncapa05.purdue.edu> Connection count: 2 Retries remaining: 5
>> (insecure) Fri Oct 14 12:49:41 2016
>>
>> 21139 www       20   0 1321m 250m 1224 S  0.0  3.2  21:57.04 lonc:
>> loncapa07.purdue.edu<http://loncapa07.purdue.edu> Connection count: 2 Retries remaining: 5
>> (insecure) Fri Oct 14 12:49:41 2016
>>
>> 21150 www       20   0 1321m 248m 1224 S  0.0  3.2  22:11.91 lonc:
>> loncapa04.purdue.edu<http://loncapa04.purdue.edu> Connection count: 2 Retries remaining: 5
>> (insecure) Fri Oct 14 12:49:42 2016
>>
>> 21151 www       20   0 1321m 253m 1224 S  0.0  3.2  21:48.87 lonc:
>> loncapa06.purdue.edu<http://loncapa06.purdue.edu> Connection count: 2 Retries remaining: 5
>> (insecure) Fri Oct 14 12:49:42 2016
>>
>> 22900 www       20   0  182m 8756 1972 S  0.0  0.1   0:00.93 lonc:
>> loncapa03.purdue.edu<http://loncapa03.purdue.edu> Connection count: 1 Retries remaining: 5
>> (insecure) Fri Oct 14 12:49:41 2016
>>
>> 29226 www       20   0  184m 8900 2060 S  0.0  0.1   0:00.04 lonc:
>> capa4.phy.ohio.edu<http://capa4.phy.ohio.edu> Connection count: 3 Retries remaining: 5
>> (insecure) Fri Oct 14 12:49:42 2016
>>
>> 29419 www       20   0  182m 8776 1972 S  0.0  0.1   0:00.11 lonc:
>> loncapa.purdue.edu<http://loncapa.purdue.edu> Connection count: 1 Retries remaining: 5
>> (insecure) Fri Oct 14 12:49:41 2016
>>
>>
>>
>>
>>
>> Anyone else see this?
>>
>>
>>
>> Thanks,
>>
>> Mike Budzik
>>
>> Interim Manager, Student Systems and Web Services Admin
>>
>> IT Infrastructure - Purdue University

_______________________________________________
LON-CAPA-admin mailing list
LON-CAPA-admin at mail.lon-capa.org<mailto:LON-CAPA-admin at mail.lon-capa.org>
http://mail.lon-capa.org/mailman/listinfo/lon-capa-admin
_______________________________________________
LON-CAPA-admin mailing list
LON-CAPA-admin at mail.lon-capa.org<mailto:LON-CAPA-admin at mail.lon-capa.org>
http://mail.lon-capa.org/mailman/listinfo/lon-capa-admin

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.lon-capa.org/pipermail/lon-capa-admin/attachments/20161021/a5e5a7db/attachment-0001.html>