<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Exchange Server">
<!-- converted from text --><style><!-- .EmailQuote { margin-left: 1pt; padding-left: 4pt; border-left: #800000 2px solid; } --></style>
</head>
<body>
<div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12.0pt; line-height:1.3; color:#385623">
<div>Excellent. Thank you.<br>
</div>
<div><br>
</div>
<div id="x_signature-x" class="x_signature_editor" style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12.0pt; color:#385623">
<div>
<div>Mike B<br>
</div>
<br>
</div>
</div>
</div>
<div id="x_quoted_header" style="clear:both">
<hr style="border:none; height:1px; color:#E1E1E1; background-color:#E1E1E1">
<div style="border:none; padding:3.0pt 0cm 0cm 0cm"><span style="font-size:11.0pt; font-family:'Calibri','sans-serif'"><b>From:</b> Stuart Raeburn <raeburn@msu.edu><br>
<b>Sent:</b> Oct 18, 2016 5:38 PM<br>
<b>To:</b> lon-capa-admin@mail.lon-capa.org<br>
<b>Subject:</b> Re: [LON-CAPA-admin] high memory usage by long lived lonc processes<br>
</span></div>
</div>
<br type="attribution">
</div>
<font size="2"><span style="font-size:10pt;">
<div class="PlainText">Mike,<br>
<br>
><br>
> Is there a way to safely kill a lonc process without causing an <br>
> error for the users? I can't endlessly throw swap at it for the <br>
> next month until I have an approved time for maintenance.<br>
><br>
<br>
You can kill those lonc processes.<br>
<br>
Since your library node is also a load-balancer then whenever a log-in <br>
occurs, the access servers will be contacted to determine current load <br>
so that sessions can be sent to the one with the lowest load. Making a <br>
request for load information will require a lonc connection to the <br>
access server.<br>
<br>
After you have killed the long-running lonc process for a particular <br>
access node, the parent lonc process on the library node should spawn <br>
a new child lonc connection to the access node when the next user <br>
logs-in.<br>
<br>
However, if that doesn't work out you could kill the parent lonc <br>
process too, and then do: /etc/init.d/loncontrol start to start a new <br>
lonc parent etc.<br>
<br>
><br>
> Could there be a memory leak hidden by the 5 minute idle timeout in <br>
> lonc? Since we almost never go 5 minutes of idle lonc, are we <br>
> hitting it?<br>
><br>
<br>
The item repeated many times in lonc_errors:<br>
<br>
><br>
> There are a LOT of lines like this:<br>
> Event: trapped error in `?? loncnew:444': Event 'Connection to lonc <br>
> client 0': GLOB(0x1c51740) isn't a valid IO at <br>
> /home/httpd/perl/loncnew line 647<br>
><br>
<br>
suggests that the lonc process got into an error state from which it <br>
has not recovered cleanly. I suspect that post-error state is the <br>
reason why memory usage keeps climbing. Killing the lonc process <br>
would be a good solution.<br>
<br>
Given the frequency of log-ins to the msu load-balancer node from 9000 <br>
LON-CAOA users at MSU, and also the fact that a monitoring service <br>
completes a LON-CAPA log-in to that node every 5 minutes, it seems <br>
likely that the 5 minute idle timeout is not encountered on the msu <br>
load-balancer very frequently either.<br>
<br>
For that node,<br>
<br>
top -cbn1 -u www | grep lonc |grep msu<br>
<br>
reports:<br>
<br>
184m 10m 3172 S 0.0 0.3 0:30.16 lonc: s1.lite.msu.edu Connection <br>
count: 1 Retries remaining: 5 (ssl) Tue Oct 18 16:33:08 2016<br>
<br>
185m 10m 3172 S 0.0 0.3 0:31.48 lonc: s2.lite.msu.edu Connection <br>
count: 1 Retries remaining: 5 (ssl) Tue Oct 18 16:33:09 2016<br>
<br>
184m 10m 3172 S 0.0 0.3 0:04.91 lonc: s3.lite.msu.edu Connection <br>
count: 2 Retries remaining: 5 (ssl) Tue Oct 18 16:33:08 2016<br>
<br>
184m 10m 3172 S 0.0 0.3 0:28.85 lonc: s4.lite.msu.edu Connection <br>
count: 1 Retries remaining: 5 (ssl) Tue Oct 18 16:33:08 2016<br>
<br>
<br>
Stuart Raeburn<br>
LON-CAPA Academic Consortium<br>
<br>
Quoting "Budzik, Michael J." <mikeb@purdue.edu>:<br>
<br>
> Yes, purduel1 is our library node and load balancer node. That's <br>
> where that top output was from.<br>
><br>
>> unlike the RES value, the VIRT value is dependent on the Linux <br>
>> distro -- it's much lower for CentOS 5 than for<br>
>> CentOS 6 or 7, even though RES is about the same for all).<br>
><br>
> RES only includes what is in physical memory. VIRT does include the <br>
> size of shared libraries, but, more significantly for this issue, it <br>
> also includes swap used by that process. You can see that we are <br>
> using about 1GB of SWAP for each of the lonc processes in question:<br>
><br>
> for file in /proc/*/status ; do awk '/VmSwap|Name/{printf $2 " " <br>
> $3}END{ print ""}' $file; done | sort -k 2 -n -r | less<br>
> loncnew 1065732 kB<br>
> loncnew 1059156 kB<br>
> loncnew 1056384 kB<br>
> loncnew 1050452 kB<br>
> loncnew 1049516 kB<br>
><br>
> Is there a way to safely kill a lonc process without causing an <br>
> error for the users? I can't endlessly throw swap at it for the <br>
> next month until I have an approved time for maintenance.<br>
><br>
> Could there be a memory leak hidden by the 5 minute idle timeout in <br>
> lonc? Since we almost never go 5 minutes of idle lonc, are we <br>
> hitting it?<br>
><br>
>> Do you find anything meaningful in <br>
>> /home/httpd/perl/logs/lonc_errors on your library server?<br>
><br>
> There are a LOT of lines like this:<br>
> Event: trapped error in `?? loncnew:444': Event 'Connection to lonc <br>
> client 0': GLOB(0x1c51740) isn't a valid IO at <br>
> /home/httpd/perl/loncnew line 647<br>
><br>
> There are several handfuls of lines like this:<br>
> Event: trapped error in `Connection to lonc client 194': Event <br>
> 'Connection to lonc client 0': GLOB(0x1c77c50) isn't a valid IO at <br>
> /home/httpd/perl/loncnew line 647<br>
><br>
> Those all seem to be in 2 or 3 clusters sort of near the top of the <br>
> log. There are no timestamps, so I'm not sure of those are related <br>
> to something like the 05:10 daily reloads.<br>
><br>
> There are a few lines like this:<br>
> Event: trapped error in `Connection to lonc client 14': Can't locate <br>
> object method "Shutdown" via package <br>
> "LondConnection=HASH(0x1c50ca8)" (perhaps you forgot to load <br>
> "LondConnection=HASH(0x1c50ca8)"?) at /home/httpd/perl/loncnew line <br>
> 754.<br>
><br>
> Thanks!<br>
> Mike B<br>
><br>
><br>
> -----Original Message-----<br>
> From: lon-capa-admin-bounces@mail.lon-capa.org <br>
> [<a href="mailto:lon-capa-admin-bounces@mail.lon-capa.org">mailto:lon-capa-admin-bounces@mail.lon-capa.org</a>] On Behalf Of
<br>
> Stuart Raeburn<br>
> Sent: Tuesday, October 18, 2016 10:50 AM<br>
> To: lon-capa-admin@mail.lon-capa.org<br>
> Subject: Re: [LON-CAPA-admin] high memory usage by long lived lonc processes<br>
><br>
> Mike,<br>
><br>
>><br>
>> Are we the only ones seeing lonc processes last well over 20 days<br>
>> and continue to allocate more and more RAM?<br>
>><br>
><br>
> Yes, I suspect you may be the only one.<br>
><br>
> I'm not seeing long-lived lonc processes with high memory values <br>
> reported for VIRT or RES in top on any of the LON-CAPA instances I <br>
> manage (msu.edu, educog.com, loncapa.net).<br>
><br>
> The RES (Resident memeory) value is the one I am typically concerned <br>
> about, and that is around 10 MB for each lonc process. (I also see <br>
> around 185 MB for VIRT for each lonc process, but unlike the RES <br>
> value, the VIRT value is dependent on the Linux distro -- it's much <br>
> lower for CentOS 5 than for CentOS 6 or 7, even though RES is about <br>
> the same for all).<br>
><br>
> In any case, the RES values are also anomalous for the lonc <br>
> processes for connections to your access servers (250 MB instead of <br>
> 9 MB).<br>
><br>
> In the msu domain I expect to consistently see lonc connections <br>
> between the LON-CAPA load balancer server and the access servers, <br>
> when I check top, but on the library server I typically expect to <br>
> see a lond connection to each access server.<br>
><br>
> If the top output is for your library server, is purduel1 also <br>
> configured as a LON-CAPA load balancer?<br>
><br>
> If not, then you'd typically only see lonc connections initiated to <br>
> your access servers when published resources are republished on the <br>
> library server, and an "update" notification is sent to each access <br>
> server which is subscribed to the resource.<br>
><br>
> Do you find anything meaningful in /home/httpd/perl/logs/lonc_errors <br>
> on your library server?<br>
><br>
><br>
> Stuart Raeburn<br>
> LON-CAPA Academic Consortium<br>
><br>
> Quoting "Budzik, Michael J." <mikeb@purdue.edu>:<br>
><br>
>> Are we the only ones seeing lonc processes last well over 20 days<br>
>> and continue to allocate more and more RAM? We now have 5 lonc<br>
>> processes that are each using over 1.5GB RAM.<br>
>> Mike B<br>
>><br>
>> From: lon-capa-admin-bounces@mail.lon-capa.org<br>
>> [<a href="mailto:lon-capa-admin-bounces@mail.lon-capa.org">mailto:lon-capa-admin-bounces@mail.lon-capa.org</a>] On Behalf Of<br>
>> Budzik, Michael J.<br>
>> Sent: Friday, October 14, 2016 1:04 PM<br>
>> To: 'lon-capa-admin@mail.lon-capa.org'<br>
>> <lon-capa-admin@mail.lon-capa.org><br>
>> Subject: [LON-CAPA-admin] high memory usage by long lived lonc<br>
>> processes<br>
>><br>
>><br>
>> Our lonc processes that live a long time end up using a lot of RAM.<br>
>> Here are a few rows of output from top. Check out the lonc<br>
>> processes in the middle of the list that are each using 1.3GB ram<br>
>> compared to the others that are around 180 MB.<br>
>><br>
>><br>
>><br>
>> # top -cbn1 -u www | grep lonc<br>
>><br>
>> 5058 www 20 0 184m 7604 1176 S 0.0 0.1 0:01.29 lonc:<br>
>> capa9.phy.ohio.edu Connection count: 0 Retries remaining: 5 () Fri<br>
>> Oct 14 11:35:09 2016<br>
>><br>
>> 5103 www 20 0 184m 7564 1176 S 0.0 0.1 0:00.95 lonc:<br>
>> meitner.physics.hope.edu Connection count: 0 Retries remaining: 5 ()<br>
>> Fri Oct 14 11:35:09 2016<br>
>><br>
>> 18053 www 20 0 180m 7384 920 S 0.0 0.1 0:10.15 lonc:<br>
>> Parent keeping the flock Fri Oct 14 12:46:50 2016<br>
>><br>
>> 18063 www 20 0 1321m 251m 1224 S 0.0 3.2 20:24.92 lonc:<br>
>> loncapa02.purdue.edu Connection count: 2 Retries remaining: 5<br>
>> (insecure) Fri Oct 14 12:49:42 2016<br>
>><br>
>> 18067 www 20 0 1321m 250m 1224 S 0.0 3.2 21:45.86 lonc:<br>
>> loncapa05.purdue.edu Connection count: 2 Retries remaining: 5<br>
>> (insecure) Fri Oct 14 12:49:41 2016<br>
>><br>
>> 21139 www 20 0 1321m 250m 1224 S 0.0 3.2 21:57.04 lonc:<br>
>> loncapa07.purdue.edu Connection count: 2 Retries remaining: 5<br>
>> (insecure) Fri Oct 14 12:49:41 2016<br>
>><br>
>> 21150 www 20 0 1321m 248m 1224 S 0.0 3.2 22:11.91 lonc:<br>
>> loncapa04.purdue.edu Connection count: 2 Retries remaining: 5<br>
>> (insecure) Fri Oct 14 12:49:42 2016<br>
>><br>
>> 21151 www 20 0 1321m 253m 1224 S 0.0 3.2 21:48.87 lonc:<br>
>> loncapa06.purdue.edu Connection count: 2 Retries remaining: 5<br>
>> (insecure) Fri Oct 14 12:49:42 2016<br>
>><br>
>> 22900 www 20 0 182m 8756 1972 S 0.0 0.1 0:00.93 lonc:<br>
>> loncapa03.purdue.edu Connection count: 1 Retries remaining: 5<br>
>> (insecure) Fri Oct 14 12:49:41 2016<br>
>><br>
>> 29226 www 20 0 184m 8900 2060 S 0.0 0.1 0:00.04 lonc:<br>
>> capa4.phy.ohio.edu Connection count: 3 Retries remaining: 5<br>
>> (insecure) Fri Oct 14 12:49:42 2016<br>
>><br>
>> 29419 www 20 0 182m 8776 1972 S 0.0 0.1 0:00.11 lonc:<br>
>> loncapa.purdue.edu Connection count: 1 Retries remaining: 5<br>
>> (insecure) Fri Oct 14 12:49:41 2016<br>
>><br>
>><br>
>><br>
>><br>
>><br>
>> Anyone else see this?<br>
>><br>
>><br>
>><br>
>> Thanks,<br>
>><br>
>> Mike Budzik<br>
>><br>
>> Interim Manager, Student Systems and Web Services Admin<br>
>><br>
>> IT Infrastructure - Purdue University<br>
<br>
_______________________________________________<br>
LON-CAPA-admin mailing list<br>
LON-CAPA-admin@mail.lon-capa.org<br>
<a href="http://mail.lon-capa.org/mailman/listinfo/lon-capa-admin">http://mail.lon-capa.org/mailman/listinfo/lon-capa-admin</a><br>
</div>
</span></font>
</body>
</html>