[LON-CAPA-admin] high memory usage by long lived lonc processes

Fri Oct 21 10:19:46 EDT 2016

Mike,

I've been following this thread because I do get entries in lonc_errors
similar to those that you posted but I don't get large memory usage on the
long lived processes, they remain about 10MB, like Stuart's.  I will see
these lonc_error entries about once or twice a month on different servers.
I've always associate them with the server being really busy but this
morning there were fresh entries in lonc_errors for one of my servers and
it wasn't that busy yesterday.  The server's worst load was:

Active=16
Moderately Active=76
Inactive=191
loadavg=0.42 0.21 0.16 1/487 57695

Looking the the log entries, I noticed that the "client number" in
subsequent log entries increases.  I don't know if that indicates anything
useful in figuring out which process (the parent?) is having trouble.  Here
is my log:

Event: trapped error in `Connection to lonc client 5': Can't locate object
method "Shutdown" via package "LondConnection=HASH(0x27ba858)" (perhaps you
forgot to load "LondConnection=HASH(
0x27ba858)"?) at /home/httpd/perl/loncnew line 754.
Event: trapped error in `Connection to lonc client 6': Can't locate object
method "Shutdown" via package "LondConnection=HASH(0x27ba858)" (perhaps you
forgot to load "LondConnection=HASH(
0x27ba858)"?) at /home/httpd/perl/loncnew line 754.
Event: trapped error in `Connection to lonc client 7': Event 'Connection to
lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at
/home/httpd/perl/loncnew line 647.
Event: trapped error in `Connection to lonc client 8': Event 'Connection to
lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at
/home/httpd/perl/loncnew line 647.
Event: trapped error in `Connection to lonc client 9': Event 'Connection to
lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at
/home/httpd/perl/loncnew line 647.
Event: trapped error in `Connection to lonc client 10': Event 'Connection
to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at
/home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client
0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client
0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client
0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client
0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client
0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client
0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client
0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client
0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client
0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client
0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client
0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client
0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client
0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client
0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client
0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client
0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client
0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client
0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client
0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client
0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `?? loncnew:444': Event 'Connection to lonc client
0': GLOB(0x27bcfc8) isn't a valid IO at /home/httpd/perl/loncnew line 647.
Event: trapped error in `Connection to lonc client 11': Event 'Connection
to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at
/home/httpd/perl/loncnew line 647.
Event: trapped error in `Connection to lonc client 12': Event 'Connection
to lonc client 0': GLOB(0x27bcfc8) isn't a valid IO at
/home/httpd/perl/loncnew line 647.

Bob Gonzales
Chemistry Dept
Binghamton University

On Wed, Oct 19, 2016 at 1:46 PM, Budzik, Michael J. <mikeb at purdue.edu>
wrote:

> Stuart,
> We killed the lonc processes that were behaving unexpectedly - just the
> individual ones, not the parent "flock keeper".  The replacement ones
> spawned as expected and now the log looks good.
> Thanks for your help with this.
>
> Mike B
>
>
> -----Original Message-----
> From: lon-capa-admin-bounces at mail.lon-capa.org [mailto:lon-capa-admin-
> bounces at mail.lon-capa.org] On Behalf Of Stuart Raeburn
> Sent: Tuesday, October 18, 2016 5:22 PM
> To: lon-capa-admin at mail.lon-capa.org
> Subject: Re: [LON-CAPA-admin] high memory usage by long lived lonc
> processes
>
> Mike,
>
> >
> > Is there a way to safely kill a lonc process without causing an
> > error for the users?  I can't endlessly throw swap at it for the
> > next month until I have an approved time for maintenance.
> >
>
> You can kill those lonc processes.
>
> Since your library node is also a load-balancer then whenever a log-in
> occurs, the access servers will be contacted to determine current load so
> that sessions can be sent to the one with the lowest load. Making a request
> for load information will require a lonc connection to the access server.
>
> After you have killed the long-running lonc process for a particular
> access node, the parent lonc process on the library node should spawn a new
> child lonc connection to the access node when the next user logs-in.
>
> However, if that doesn't work out you could kill the parent lonc process
> too, and then do: /etc/init.d/loncontrol start to start a new lonc parent
> etc.
>
> >
> > Could there be a memory leak hidden by the 5 minute idle timeout in
> > lonc?  Since we almost never go 5 minutes of idle lonc, are we
> > hitting it?
> >
>
> The item repeated many times in lonc_errors:
>
> >
> > There are a LOT of lines like this:
> > Event: trapped error in `?? loncnew:444': Event 'Connection to lonc
> > client 0': GLOB(0x1c51740) isn't a valid IO at
> > /home/httpd/perl/loncnew line 647
> >
>
> suggests that the lonc process got into an error state from which it has
> not recovered cleanly. I suspect that post-error state is the reason why
> memory usage keeps climbing.  Killing the lonc process would be a good
> solution.
>
> Given the frequency of log-ins to the msu load-balancer node from 9000
> LON-CAOA users at MSU, and also the fact that a monitoring service
> completes a LON-CAPA log-in to that node every 5 minutes, it seems likely
> that the 5 minute idle timeout is not encountered on the msu load-balancer
> very frequently either.
>
> For that node,
>
> top -cbn1 -u www | grep lonc |grep msu
>
> reports:
>
> 184m 10m 3172 S  0.0  0.3   0:30.16 lonc: s1.lite.msu.edu Connection
> count: 1 Retries remaining: 5 (ssl) Tue Oct 18 16:33:08 2016
>
> 185m 10m 3172 S  0.0  0.3   0:31.48 lonc: s2.lite.msu.edu Connection
> count: 1 Retries remaining: 5 (ssl) Tue Oct 18 16:33:09 2016
>
> 184m  10m 3172 S  0.0  0.3   0:04.91 lonc: s3.lite.msu.edu Connection
> count: 2 Retries remaining: 5 (ssl) Tue Oct 18 16:33:08 2016
>
> 184m  10m 3172 S  0.0  0.3   0:28.85 lonc: s4.lite.msu.edu Connection
> count: 1 Retries remaining: 5 (ssl) Tue Oct 18 16:33:08 2016
>
>
> Stuart Raeburn
> LON-CAPA Academic Consortium
>
> Quoting "Budzik, Michael J." <mikeb at purdue.edu>:
>
> > Yes, purduel1 is our library node and load balancer node.  That's
> > where that top output was from.
> >
> >> unlike the RES value, the VIRT value is dependent on the Linux
> >> distro -- it's much lower for CentOS 5 than for CentOS 6 or 7, even
> >> though RES is about the same for all).
> >
> > RES only includes what is in physical memory. VIRT does include the
> > size of shared libraries, but, more significantly for this issue, it
> >  also includes swap used by that process.  You can see that we are
> > using about 1GB of SWAP for each of the lonc processes in question:
> >
> > for file in /proc/*/status ; do awk '/VmSwap|Name/{printf $2 " "
> > $3}END{ print ""}' $file; done | sort -k 2 -n -r | less loncnew
> > 1065732 kB loncnew 1059156 kB loncnew 1056384 kB loncnew 1050452 kB
> > loncnew 1049516 kB
> >
> > Is there a way to safely kill a lonc process without causing an
> > error for the users?  I can't endlessly throw swap at it for the
> > next month until I have an approved time for maintenance.
> >
> > Could there be a memory leak hidden by the 5 minute idle timeout in
> > lonc?  Since we almost never go 5 minutes of idle lonc, are we
> > hitting it?
> >
> >> Do you find anything meaningful in
> >> /home/httpd/perl/logs/lonc_errors on your library server?
> >
> > There are a LOT of lines like this:
> > Event: trapped error in `?? loncnew:444': Event 'Connection to lonc
> > client 0': GLOB(0x1c51740) isn't a valid IO at
> > /home/httpd/perl/loncnew line 647
> >
> > There are several handfuls of lines like this:
> > Event: trapped error in `Connection to lonc client 194': Event
> > 'Connection to lonc client 0': GLOB(0x1c77c50) isn't a valid IO at
> > /home/httpd/perl/loncnew line 647
> >
> > Those all seem to be in 2 or 3 clusters sort of near the top of the
> > log. There are no timestamps, so I'm not sure of those are related
> > to something like the 05:10 daily reloads.
> >
> > There are a few lines like this:
> > Event: trapped error in `Connection to lonc client 14': Can't locate
> >  object method "Shutdown" via package
> > "LondConnection=HASH(0x1c50ca8)" (perhaps you forgot to load
> > "LondConnection=HASH(0x1c50ca8)"?) at /home/httpd/perl/loncnew line
> > 754.
> >
> > Thanks!
> > Mike B
> >
> >
> > -----Original Message-----
> > From: lon-capa-admin-bounces at mail.lon-capa.org
> > [mailto:lon-capa-admin-bounces at mail.lon-capa.org] On Behalf Of
> > Stuart Raeburn
> > Sent: Tuesday, October 18, 2016 10:50 AM
> > To: lon-capa-admin at mail.lon-capa.org
> > Subject: Re: [LON-CAPA-admin] high memory usage by long lived lonc
> > processes
> >
> > Mike,
> >
> >>
> >> Are we the only ones seeing lonc processes last well over 20 days and
> >> continue to allocate more and more RAM?
> >>
> >
> > Yes, I suspect you may be the only one.
> >
> > I'm not seeing long-lived lonc processes with high memory values
> > reported for VIRT or RES in top on any of the LON-CAPA instances I
> > manage (msu.edu, educog.com, loncapa.net).
> >
> > The RES (Resident memeory) value is the one I am typically concerned
> > about, and that is around 10 MB for each lonc process.  (I also see
> >  around 185 MB for VIRT for each lonc process, but unlike the RES
> > value, the VIRT value is dependent on the Linux distro -- it's much
> > lower for CentOS 5 than for CentOS 6 or 7, even though RES is about
> > the same for all).
> >
> > In any case, the RES values are also anomalous for the lonc
> > processes for connections to your access servers (250 MB instead of
> > 9 MB).
> >
> > In the msu domain I expect to consistently see lonc connections
> > between the LON-CAPA load balancer server and the access servers,
> > when I check top, but on the library server I typically expect to
> > see a lond connection to each access server.
> >
> > If the top output is for your library server, is purduel1 also
> > configured as a LON-CAPA load balancer?
> >
> > If not, then you'd typically only see lonc connections initiated to
> > your access servers when published resources are republished on the
> > library server, and an "update" notification is sent to each access
> > server which is subscribed to the resource.
> >
> > Do you find anything meaningful in /home/httpd/perl/logs/lonc_errors
> > on your library server?
> >
> >
> > Stuart Raeburn
> > LON-CAPA Academic Consortium
> >
> > Quoting "Budzik, Michael J." <mikeb at purdue.edu>:
> >
> >> Are we the only ones seeing lonc processes last well over 20 days and
> >> continue to allocate more and more RAM?  We now have 5 lonc processes
> >> that are each using over 1.5GB RAM.
> >> Mike B
> >>
> >> From: lon-capa-admin-bounces at mail.lon-capa.org
> >> [mailto:lon-capa-admin-bounces at mail.lon-capa.org] On Behalf Of
> >> Budzik, Michael J.
> >> Sent: Friday, October 14, 2016 1:04 PM
> >> To: 'lon-capa-admin at mail.lon-capa.org'
> >> <lon-capa-admin at mail.lon-capa.org>
> >> Subject: [LON-CAPA-admin] high memory usage by long lived lonc
> >> processes
> >>
> >>
> >> Our lonc processes that live a long time end up using a lot of RAM.
> >>  Here are a few rows of output from top.  Check out the lonc
> >> processes in the middle of the list that are each using 1.3GB ram
> >> compared to the others that are around 180 MB.
> >>
> >>
> >>
> >> # top -cbn1 -u www | grep lonc
> >>
> >> 5058 www       20   0  184m 7604 1176 S  0.0  0.1   0:01.29 lonc:
> >> capa9.phy.ohio.edu Connection count: 0 Retries remaining: 5 () Fri
> >> Oct 14 11:35:09 2016
> >>
> >>  5103 www       20   0  184m 7564 1176 S  0.0  0.1   0:00.95 lonc:
> >> meitner.physics.hope.edu Connection count: 0 Retries remaining: 5 ()
> >> Fri Oct 14 11:35:09 2016
> >>
> >> 18053 www       20   0  180m 7384  920 S  0.0  0.1   0:10.15 lonc:
> >> Parent keeping the flock Fri Oct 14 12:46:50 2016
> >>
> >> 18063 www       20   0 1321m 251m 1224 S  0.0  3.2  20:24.92 lonc:
> >> loncapa02.purdue.edu Connection count: 2 Retries remaining: 5
> >> (insecure) Fri Oct 14 12:49:42 2016
> >>
> >> 18067 www       20   0 1321m 250m 1224 S  0.0  3.2  21:45.86 lonc:
> >> loncapa05.purdue.edu Connection count: 2 Retries remaining: 5
> >> (insecure) Fri Oct 14 12:49:41 2016
> >>
> >> 21139 www       20   0 1321m 250m 1224 S  0.0  3.2  21:57.04 lonc:
> >> loncapa07.purdue.edu Connection count: 2 Retries remaining: 5
> >> (insecure) Fri Oct 14 12:49:41 2016
> >>
> >> 21150 www       20   0 1321m 248m 1224 S  0.0  3.2  22:11.91 lonc:
> >> loncapa04.purdue.edu Connection count: 2 Retries remaining: 5
> >> (insecure) Fri Oct 14 12:49:42 2016
> >>
> >> 21151 www       20   0 1321m 253m 1224 S  0.0  3.2  21:48.87 lonc:
> >> loncapa06.purdue.edu Connection count: 2 Retries remaining: 5
> >> (insecure) Fri Oct 14 12:49:42 2016
> >>
> >> 22900 www       20   0  182m 8756 1972 S  0.0  0.1   0:00.93 lonc:
> >> loncapa03.purdue.edu Connection count: 1 Retries remaining: 5
> >> (insecure) Fri Oct 14 12:49:41 2016
> >>
> >> 29226 www       20   0  184m 8900 2060 S  0.0  0.1   0:00.04 lonc:
> >> capa4.phy.ohio.edu Connection count: 3 Retries remaining: 5
> >> (insecure) Fri Oct 14 12:49:42 2016
> >>
> >> 29419 www       20   0  182m 8776 1972 S  0.0  0.1   0:00.11 lonc:
> >> loncapa.purdue.edu Connection count: 1 Retries remaining: 5
> >> (insecure) Fri Oct 14 12:49:41 2016
> >>
> >>
> >>
> >>
> >>
> >> Anyone else see this?
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Mike Budzik
> >>
> >> Interim Manager, Student Systems and Web Services Admin
> >>
> >> IT Infrastructure - Purdue University
>
> _______________________________________________
> LON-CAPA-admin mailing list
> LON-CAPA-admin at mail.lon-capa.org
> http://mail.lon-capa.org/mailman/listinfo/lon-capa-admin
> _______________________________________________
> LON-CAPA-admin mailing list
> LON-CAPA-admin at mail.lon-capa.org
> http://mail.lon-capa.org/mailman/listinfo/lon-capa-admin
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.lon-capa.org/pipermail/lon-capa-admin/attachments/20161021/efa51353/attachment-0001.html>