cancel
Showing results for 
Search instead for 
Did you mean: 

Weird load balancing and paging: What is going on?

Former Member
0 Kudos

Hi,

We have EP 7 installed in our environment. The Portal has a Central Instance and two dialog instances. We are using NetScaler 7 as a load balancer. Our load balancer is currently configured with a weighted round-robin of 3:3:1. For all requests, 3 connections are set to each of the dialog instance with only 1 connection sent to the central instance.

We have recently received a monitoring report which indicates that, despite the load balancing configurations, one of the dialog instance (DI1) is having a higher CPU load. We assume this means that more requests are being sent to one particular instance. Comparing this report with the past reports we have received, this particular instance has been consistently getting higher CPU loads, which is puzzling. The dialog instances in our Portal setup do not do anything else, other than serving Portal requests. In other words, no third party applications are installed on these servers. We do not know why this particular instance is having a higher CPU load. Is there anything that we can do to find out why this instance keeps having a higher CPU load? Is there something we should be doing at the load balancer's end?

On a different note, the other dialog instance (DI2) is having high paging activities. From one of the posts in the forums, it seems to indicate that the high paging might be an indication that the Portal is undergoing some garbage collection activities. However, it does not make sense that only ONE instance is having high paging if garbage collection is carried out on. How is it that the rest of the instances are not having high paging issues? Has anyone come across this situation? What should we do to find out what is going on behind the screens?

Any suggestions/advice would be much appreciated.

Thank You.

Accepted Solutions (0)

Answers (2)

Answers (2)

Former Member
0 Kudos

Hello!

What was said above is definitely true. Additionally it might be worth checking in the portal activity reports for what is actually going on in your servernodes (number and type of requests etc.). It could be that some specific action (like TREX crawling or some scheduled task) is set up to always go to the same instance. This might not be directly related to requests coming from the users through the Loadbalancer, but still add siginificant load to your system. If you find a situation like this it might be an idea to use your CI (which has only one servernode anyway) for such tasks and take it out of the loadbalancing completely.

Regards,

Jörg

Former Member
0 Kudos

Thanks again to all who have replied/advised.

Well, I have finally received a copy of the NetScaler statistics my colleague collected. Unless I am interpreting the report wrongly, it seems that the weighted round-robin is not working. All instances are getting more or less the same number of requests.

I have asked my colleague to do more digging and clarify my doubts. With the numbers showing all instances getting the same number of requests, it still does not show why only ONE dialog instance is getting insanely higher CPU load and the other with HIGHER paging load.

I tried to use Activity Reports on our Portal to find out the each instance is doing, but the iView does data aggregation. Does anyone know how can we obtain statistics on dialog instance's activities? I was hoping the report would give me details on

1. how many requests did the instance server during a specified time frame

2. what activities did the instance do during a specified time frame (which could help me pinpoint the cause for a higher CPU load and high paging show)

Many thanks in advance.

Former Member
0 Kudos

Well, apparently we have been having a higher CPU load and higher Paging utilization because our servers were not configured with the correct paging file size.

After creating a 10GB paging file, we have neither seen any high CPU loads nor have we seen any signs of high paging utilization.

Our BASIS team is currently setting up CCMS monitoring on J2EE instances. Hopefully, the monitors in CCMS will help us debug any related performance issues better.

Many thanks again to those who have responded.

Former Member
0 Kudos

Hi,

I assume NetScaler has sticky user session to a server, in order to preserve the session. Usually, this stickyness is done on IP (all request from IP is sent to a certain server untill NetScaler session timesout) or with cookies.

If it is done on IP, this can give a wrong distribution if your clients are behind a proxy server, since they then have the same IP.

Question 1: How is NetScaler configured with regards to stickyness

The first thing you need to know is if the high CPU on one dialog instance can be explained by a skewed distribution on NetScaler.

Question 2: Can you get some data one how NetScaler is distributing the load in practice (not in theory, ie. configuration) ? (usually is a monitoring tool for this in such a product)

Note that garbage collection should be almost directly proportional to CPU load. This is because full garbage collection occurs when the heap is full of object that needs to be cleaned up and these objects come from objects dereferenced in the code (and the more code you are running, the more objects you dereference).

Therefore, high paging ought to be a bigger problem on the server with the highest load.

Question 3: How many server nodes do you have for the central instance and for the two dialog instances?

Regards

Dagfinn

Btw: you could probably do better than a weighted round-robin (for example some test which polls the backend server and measures the response time and gives a score based on that).

Former Member
0 Kudos

Hi Dagfinn,

Many thanks for your reply.

Question1: How is NetScaler configured with regards to stickyness?

We have NetScaler Persistence turned on, so that users' requests are served by the same dialog instance from the beginning to the end. We found that if Persistence is not enforced, a user's request could be sent to a different dialog instance, which will cause data inconsistencies. Can this be counted as stickyness? I am not sure how Persistence is maintained. It could be possible that it is done via IP...I will find out more from my colleague, who is maintaining NetScaler.

I guess we will need to have monitoring tools in place to determine the actual load distribution, as you have suggested in Question 2. I guess we will have to consider the viable actions to be taken from what we shall observe...

Question 2: Can you get some data one how NetScaler is distributing the load in practice (not in theory, ie. configuration) ? (usually is a monitoring tool for this in such a product)

I believe my colleague has such tools. I will try to obtain the relevant data and make the appropriate analysis.

Question 3: How many server nodes do you have for the central instance and for the two dialog instances?

We have 1 server node in the central instance and 3 server nodes for each of the dialog instance.

Many thanks again on your suggestion to choose a better score other than weighted round-robin. We'll do the necessary analysis and decide a better score to use on NetScaler.

Edited by: Voon Siong Lum on Jan 17, 2008 1:51 AM