[LON-CAPA-admin] lon-balancer

Thu Oct 8 09:09:17 EDT 2015

Maged,

The &check_loadbalancing() routine in  
/home/httpd/lib/perl/Apache/lonnet.pm is where the decision is made  
about which server will be sent a newly logged-in user.  That routine  
loops over the designated servers to try to determine the which has  
the lowest load. The &compare_server_load() routine is the one which  
actually retrieves load data from each server.

There are two load types which may be considered when determining a  
server's current load: the load average and the user load. The load  
average is the first value in /proc/loadavg (i.e., the 5 minute  
average), whereas the user load is the number of users with last  
modification times for their files in /home/httpd/lonIDs within the  
last 30 minutes (excluding "public" users).

These load average and user load values are converted to load percent  
and userload percent respectively by multiplying by 100 and dividing  
by the perl vars set for lonLoadLim and lonUserLoadLim (unless zero)  
in /etc/httpd/conf/loncapa.conf (CentOS/RHEL/Scientific Linux) or  
/etc/apache2/loncapa.conf (Ubuntu).  If lonUserLoadLim is zero then  
the userload percent is not considered.

The determination of least loaded server is made by looping over the  
servers defined for "primary" destinations, and then, if, and only if,  
none of those are at less than 100% load, looping over servers defined  
for "default" destinations.

The server with the lowest load percent will be the one selected. (If  
both load percent and userload percent are being considered, the load  
value compared with values from other servers is the larger of the two).

The simplest set up is to have all access servers in your domain  
(excluding the lonbalancer itself) identified as "primary", and have  
the same LoadLim defined for each (the default is 2.0).

One way to test this from the command line, would be to ssh into each  
of your access servers, and then use:

cat /proc/loadavg

in each server, immediately before logging into the load-balancer.   
You could then verify that your session was transferred to the access  
server with the lowest load percent. This would also be the access  
server with the lowest 5 minute average from /proc/loadavg if all  
access servers have the same lonLoadLim, and lonUserLoadLim is zero.

If you wanted to see how load balancing is behaving over a period of  
time you could use the monitoring script -- monitoring.pl -- available  
from the modules/raeburn directory in a check out of CVS (Lee Bynum at  
Illinois has this script).  If you wanted to use that with SSO, you  
might need to customize the robotic log-in to work with Shibboleth;  
the file as it exists currently works with MSU's CAS-like Sentinel  
SSO.  However, you can also set it to use /adm/login (i.e., not use  
SSO).

That script has been used at MSU for over a decade, with a robotic  
log-in to the load balancer every 5 minutes. One piece of data  
recorded by that script is the lonHostID of the access server to which  
the robotic user's session is transferred.

The percentages of all offloads to each of the four access servers  
recorded for 2015 are:

30%
29%
24%
17%

Somewhat surprisingly these are not all 25%.

However, real users can log-in directly to each of the MSU access  
servers currently, i.e., nothing is currently set in the "Login page  
requests redirected" configuration for the msu domain in "Set domain  
configuration" > "Log-in page options" so it is possible that users  
are directly logging in to a particular machine (and raising the load  
on that, on average).

I also use MRTG to display data on user load and load average gathered  
from all msu access servers using scripts which are run every 5  
minutes.  Those load values show some small (~ 5%) variations in load  
between servers.

If you wanted to record the load values that  
lonnet::compare_server_load() is working with each time it is called  
you could either add some &logthis() calls  (to log to  
/home/httpd/perl/logs/lonnet.log) or STDERR calls (to log to your  
Apache error log) to the compare_server_load() routine to capture  
values for $load and $try_server.

> Is there a way of checking if the balancer working right? And fixing  
>  it, if not?

You can manipulate the behavior of the balancer by setting different  
values for lonLoadLim on each of the access servers, and/or by setting  
different values for lonUserLoadLim.

Stuart Raeburn
LON-CAPA Academic Consortium

Quoting "Abdel Messeh, Maged" <mmesseh at illinois.edu>:

> Hi All,
>
> I have recently noticed that we have 4 or 5 times more requests   
> going to one of our access nodes than the other 3.
>
> Is there a way of checking if the balancer working right? And fixing  
>  it, if not?
>
> Many thanks,
>
> Maged
>
>
> --------------------------
> Maged Messeh, Ph.D.
> College of Liberal Arts & Sciences Infrastructure
> University of Illinois at Urbana-Champaign
>
>
> _______________________________________________
> LON-CAPA-admin mailing list
> LON-CAPA-admin at mail.lon-capa.org
> http://mail.lon-capa.org/mailman/listinfo/lon-capa-admin