cancel
Showing results for 
Search instead for 
Did you mean: 

Background Jobs running 40% slower on App servers vs. on the CI/DB server..

Former Member
0 Kudos

ESX 4 update 1 on HP c7000 enclosure, 8 ESX Hosts / SQL server 2005 / Windows 2003

We have a CI/DB VM on one host + 2 application (dialog) VM instances on another seperate host.

When we run an MRP or other background jobs on the CI/DB the runtimes are at least 40% faster than if we run the same jobs on the App servers. I/O stays about the same.....but the runtime is increased significantly.

An interesting point..... if we move the Dialog instance onto the same host as the CI/DB THEN we can achieve the same great runtimes as when we run the background jobs on the CI/DB.

any ideas are appreciated !

Ld

Accepted Solutions (0)

Answers (1)

Answers (1)

Former Member
0 Kudos

Hello,

This is probably due to the network overhead between both hosts.

Are both hosts located in the same datacenter and how are they connected (did you check the network throughput between the hosts) ?

Did you also analyze your response times in detail via ST03N to see where the problem is (DB, CPU, ...) ?

Wim Van den Wyngaert

Former Member
0 Kudos

All of the physical ESX hosts are located within the same enclosure and yes we have checked the network connection/bandwidth between all the ESX hosts...which is great.

At this point I am running traces on the jobs to help determine where the bottleneck is.

thank you !

Ld

Former Member
0 Kudos

Hello,

the fact that you achieve better results when both Virtual Machines are on the same ESX host is an indication for a network issue. The Virtual Network Adapter can make a significant difference. See following article:

[SAP Batch Job Performance on vSphere|http://blogs.vmware.com/performance/2010/02/sap-batch-job-performance-on-vsphere.html]

A quick suggestion: use the latest (vmxnet3) Virtual Network Adapter and update the network driver on your ESX host.

If this does not help, do a "performance snapshot" as of SAP Note 1158363 on both ESX hosts at the same time and open an SAP ticket.

Kind regards,

Matthias

Former Member
0 Kudos

Hi Matthias,

Yes, we are using the latest vmxnet3 driver on all ESX Hosts, vmtools are also installed.

I will do a performance snapshot as you mentioned and open an SAP ticket today....

I also will post back here once we find the issue...

thanks!

Linwood

Former Member
0 Kudos

Hi Linwood

I have an ecc 5 with 4 app servers plus the CI and im having the same issue with 3 app servers, lets call them 1, 2, 3 and 4.

If i run a job in server 4 i get a 7000 secs runtime, but if run it in the other 3 it jumps 14000... basically whatever the time is in the server4, in the other 3 jumps twice the time...

I took server 2 to do a direct comparisson with server 4.

the parameters are equal in both, except a couple of timeouts that are bigger in server 4.

The workload in both has been taken out from the scope cause while running the test job I took server 2 out from the logon groups and it was running the test job alone , while server 4 was with the usual workload, still server 4 behaved the same way finishing in 7000.

Have you solved your issue?

Could you please give me any other lead?

I already opened a ticket with SAP, but they replied is a consulting topic...

BTW, one fix we already tried was upgrading the OS kernel patch of the server2, cause it was older than server4.

I'll take a look with the OS team today about the NICs...

Thanks in advace for any feedback

brian_walker
Active Participant
0 Kudos

I noticed something odd a while back when we got a similar complaint from a user running a batch job to create a bunch of deliveries in a sandbox system. When I ran an ST12 trace, one thing that showed up was some time in function ENQUEUE_READ for the call to ENQUE_READ2.

Looking more closely at the ENQUEUE_READ function I noticed that it has an IF statement around the call to ENQUE_READ2. More specifically, in a non-HA system it will make the function call directly when running on the same instance as enque OR it will make an RFC call to the enque instance when running on the instance WITHOUT enque. This introduces some RFC overhead into each enque read.

What is even more interesting is that in an HA system where enque runs in the ASCS instance as stand-alone, the function call is also made directly and not through RFC (because the ASCS isn't running a disp+work so it can't receive the RFC). The result is that the calls are always made directly in our HA systems, despite which app server the background job runs on. The contrast in our non-HA systems is that the background job always runs faster when it runs on the instance with enque because it doesn't incur the overhead of an RFC call to the other instance.

I am not sure this is the same as your case, but some STAD records should at least tell you how many total enques are being done. If there are quite a few, you might do an ST12 trace to determine if your case is also spending a lot of time doing enque reads when it is run on the instance without the enque. I don't have the STAD records now, but I do recall wondering why there were so many and that got me on to the path of finding ENQUEUE_READ.

Brian

Former Member
0 Kudos

Hi Gabriel,

I wanted to update this thread to let everyone know we are still actively working the problem and will update once we have our findings.....thanks for your interest and input ...stay tuned !

Linwood

0 Kudos

Hi Folks,

i'm running into the same issue as you. Is there any solution, yet?

Kind reguards

Joe G.

Former Member
0 Kudos

Hi,

there is no simple solution. It's often a mix of misconfiguration, decreasing single computing unit performance and other issues in network and storage. Best is to open an SAP support ticket under the support component BC-OP-NT-ESX (Windows on VMware) or BC-OP-LNX-ESX (Linux on VMware).

Matthias

0 Kudos

We've identified the problem and solved the issue.

deactivating the Interrupt Coalescing in VMWare Host did it for us.

Former Member
0 Kudos

That's great news!

We agreed to add the following technical paper to the SAP on VMware Best Practices:

[Best Practices for Performance Tuning of Latency-Sensitive Workloads in vSphere VMs|http://www.vmware.com/resources/techresources/10220]

This paper describes the impact of coalescing features like interrupt throttling on the hardware NIC and large receive offload on the virtual NIC. We strongly recommend every customer to evaluate the recommendations of this document.

0 Kudos

Hello,

My system is experiecing a similar issue. Did you deactivate interrupt coalescing at the virtual or physical layer?

Thanks,

Noe Hoyos