[Oracle / SAP] Thinking clearly about performance ...

stefan_koehler · ‎10-03-2013

Introduction

Have you ever used ASH based samples or tried to use AWR reports for identifying and fixing a database performance issue in a specific business process like batch processes? Have you ever requested a SAP Technical Performance Optimization (TPO), because of some specific performance problems? Have you ever looked at "silver-bullet" ratio values like buffer cache quality / hit ratio and assumed that these values need to be improved to fix your performance problem?

If you answer one of these questions with yes, then you maybe have implemented all of your self-developed performance suggestions or recommended changes by a SAP TPO in the past, just to notice that the response time for the end user did not improve at all.

In consequence you and your end users are frustrated and maybe not satisfied with the provided services at the market? Is this end result caused by bad service quality or maybe just based on the used approach and not well defined tuning targets? From time to time i am hired for Oracle / SAP performance analysis after all of that previous steps were already taken and try to explain my clients why the previous used approaches have failed to improve the performance.

This blog post is about thinking clearly about Oracle / SAP performance (What is performance?) and why it is so important to remember that by using the right analysis and tuning approach.

Footnote: Performance analysis with ASH / AWR based data can be useful as well, but it all depends on the tuning targets. Pretty good examples for using AWR reports are comparing I/O response times after an I/O sub system replacement or reducing the CPU load on the database server, but due to its design AWR reports are pretty bad for identifying the root cause of a performance issue in specific business process.

Thinking clearly about performance

Before we start with examples, analysis or tuning approaches, let's think about performance and what it is all about for the end user.

So the question is: What do your end users really care about? Your end users do not get frustrated, because of some bad buffer cache hit ratio or something like that. They get frustrated because of their click, button, link, query, report or batch job does not finish quickly. This is all what performance is about.

"The click, button, link, query, report or batch job has to finish quickly."

All of us are end users and even if we just use Google. So we all know that performance can be a good or bad experience, but how do we define "system performance"?

"... Describe your system’s performance ...

Always either fast for everyone, or slow for everyone?

No.

Fast on some days, slow on others.

Fast for some people, slow for others.

Fast for some clicks, slow for others.

Performance is an attribute of individual experiences."

So let's keep in mind, that each performance experience is an individual experience for each end user by executing a specific task at a given time. In other words performance is a feature of the end user experience for a specific combination of user, executed task and point in time.

At this point you can sum up if ASH or AWR based data provides you the needed information for such individual user experiences. Does the buffer cache hit ratio or top 5 wait events section (for example) provide you some information about that? Does the ASH sampled data provide you information about each wait event and how long the end users task spent with that? The answer is pretty simple - No.

The approaches / methods

Let's take a step back and clarify the source and scope of ASH / AWR based data before we go on with the approaches and methods.

ASH (Active Session History) base data

ASH data is based on 1 second samples or on samples of the samples (1 out of 10)

V$ACTIVE_SESSION_HISTORY - "It contains snapshots of active database sessions taken once a second. A database session is considered active if it was on the CPU or was waiting for an event that didn't belong to the Idle wait class."
DBA_HIST_ACTIVE_SESS_HISTORY - "DBA_HIST_ACTIVE_SESS_HISTORY displays the history of the contents of the in-memory active session history of recent system activity. This view contains snapshots of V$ACTIVE_SESSION_HISTORY."

AWR (Automatic Workload Repository) reports

AWR reports are based on a concept of database time (DB time) which is defined as:

Total time in database calls by foreground sessions
Includes CPU time, IO time and non-idle wait time
DB Time <> Response Time (that means, that the DB Time is not necessarily the time, that the end user / client experiences as database response time)

The most important key points (for understanding performance) are mentioned above, but what does it mean for interpreting that provided data?

At first we need to remember at any time, that the DB time can be totally different from end user database response time (key word "SQL*Net message from client" as it is completely unaccounted and defined as an idle event, which is true from a database perspective, but not from an end user perspective).

You can not tell how long something took just by counting how many times it happened (e.g. counting the wait events in view V$ACTIVE_SESSION_HISTORY for a specific session).

Performance analysis and diagnosing performance issues can be very misleading and frustrating, if they are focused on system-wide statistics (e.g. in AWR report sections). How do you know that your performance issue in a specific business process is caused by "latch free" wait events (for example) - just because of "latch free" is in the top 5 wait event section and it should not be there? No, you can not get that information from ASH or AWR with safety.

ASH / AWR based data is not providing information about any asymmetrical data (e.g. you execute 100 database calls, but how much time is spent on the first database call and on every subsequent call or how does the detailed response time histogram look like for a specific wait event in the business process). You can not extrapolate details from an aggregate.

So far enough about ASH / AWR based data and why you need to be careful by using it for performance analysis. Let's go into the approaches and methods.

** The screenshots are from the book "Optimizing Oracle Performance" - for more details check the reference section **

Do you find your previous used approach for performance analysis in one of these two methods? From my experience most performance analysis are based on "Method C" (even SAP TPO in parts). Is it the best and most efficient way to improve the performance for the end user? In my opinion and from my experience it is not. Of course you can earn more money with "Method C" on client site, because of you need much more time to reduce wait event by wait event until you maybe hit the root cause of the specific performance problem. However you maybe have "tuned" all wait events, but the end user may not notice that all. My consulting philosophy (based on Method R) is to provide my clients the most effective and sustaining solution with an comprehensible explanation and this can be achieved with Method R in the best way.

"Method C" was also described with "Compulsive Tuning Disorder" in a different manner:

"The signs of this illness were the excessive checking of many performance- related statistics, most of them ratio-based, and the inability to focus on what was really important. They simply thought that by applying some “simple” rules, it was possible to tune their databases. History teaches us that results are not always as good as expected. Why was this the case? Well, all the rules used to check whether a given ratio (or value) was acceptable were defined independently of the user experience. In other words, false negatives or positives were the rule and not the exception. Even worse, an enormous amount of time was spent on these tasks."

So how does "Method R" look like in a SAP environment related to an Oracle database performance bottleneck (however the approach can be adapted to any kind of performance issue)?

Define the business processes that need to be improved (e.g. batch job or chains of batch jobs for a specific business process, reports, etc.)
Define a tuning target for these business processes (e.g. batch job need to be finished in 5 minutes, report need to be presented in 1 min, etc. and maybe with a fault tolerance)
Crosscheck the response time (STAD / ST03n) or trace it (SE30) in SAP to identify the component, that took the most time (let's assume that the bottleneck is the DB response time section in this case)
Enable a dedicated Oracle SQL trace (no SAP SQL trace) with wait events, binds and SQL execution plan dumps
Profile the raw SQL trace and improve the performance related to the findings (which maybe result in further SQL execution plan analysis for sure)
Measure the response time (STAD / ST03n) or trace it (SE30) in SAP again
Crosscheck response time with the defined tuning target and stop or repeat step 3 to 6 until it is reached

I am often asked how to profile the raw SQL trace file (step 5) and which tool i use. I usually use the (free) Trivadis Extended Tracefile Analysis Tool. Currently Oracle's Tkprof tool provides information about wait events, but it still has three major problems that are addressed in TVD$XTAT:

As soon as the argument sort is specified, the relationship between SQL statements is lost
Data is provided only in aggregated form. Consequently, useful information is lost
No information about bind variables is provided

In addition TVD$XTAT creates a HTML report with links, which makes the jumping between data and analysis much more comfortable and got some other nice enhancements as well. However i can also highly recommend the Method R Profiler, that currently starts by $1,180 USD.

An example

The following simple example illustrates a pretty common case in which ASH / AWR based analysis will always fail and why "Method R" should be the way to go (in my opinion). Let's assume the following scenario. We have an ABAP application that calls an ABAP function module frequently to get some detailed data about a bill. This is a pretty common scenario, when custom code is written and SAP standard function modules are used to retrieve trivial dependent information. In my case i rebuilt that scenario in Java as i have no SAP system right here and i needed the usual client / server architecture. I am not a J2EE developer, so please don't be cruel to me, because of the bad java coding style (however the SQL part is knowingly written that bad).

The following demo was run on Oracle database (11.2.0.3.6) on OEL 6.4 (2.6.39-400.109.1.el6uek.x86_64) and the Java program was executed on Mac OS (Darwin Kernel Version 11.4.2) with Java 1.6.0_51.

SQL> create table BILL (BILLNR number, TEXT varchar(20));

SQL> create table BILLPOS (BILLNR number, BILLPOSNR number, BILLDTEXT varchar(20));

SQL> create unique index BILL_I on BILL(BILLNR);

SQL> create unique index BILLPOS_I on BILLPOS(BILLNR, BILLPOSNR);

SQL> begin

for i in 1..20000 loop

    insert into BILL values(i,'TEST DESCRIPTION');

   end loop;

end;

/

SQL> begin

for i in 1..20000 loop

    insert into BILLPOS values(i,1,'TEST DESCRIPTION');

    insert into BILLPOS values(i,2,'TEST DESCRIPTION');

   end loop;

end;

/

SQL> commit;

SQL> exec DBMS_STATS.GATHER_TABLE_STATS(USER, 'BILL');

SQL> exec DBMS_STATS.GATHER_TABLE_STATS(USER, 'BILLPOS');

These are the base Oracle database objects and the Java application code looks like this.

import java.sql.Connection;

import java.sql.DriverManager;

import java.sql.PreparedStatement;

import java.sql.ResultSet;

import java.sql.SQLException;

import java.util.Properties;

public class OracleJdbcExample {

    public static void main(String args[]) throws SQLException {

        //URL of Oracle database server

        String url = "jdbc:oracle:thin:@OEL:1522:T11DB";

        //properties for creating connection to Oracle database

        Properties props = new Properties();

        props.setProperty("user", "SYS");

        props.setProperty("password", "<PASS>");

        props.setProperty("internal_logon","sysdba");

        //creating connection to Oracle database using JDBC

        Connection conn = DriverManager.getConnection(url,props);

        String sql  ="select * from BILL";

        String sql2 ="select BILLNR, BILLPOSNR, BILLDTEXT from BILLPOS where BILLNR = ?";

        //creating PreparedStatement object to execute query

        PreparedStatement preStatement = conn.prepareStatement(sql);

        PreparedStatement preStatement2 = conn.prepareStatement(sql2);

        ResultSet result = preStatement.executeQuery();

        while(result.next()){

            int billnr = result.getInt("BILLNR");

            preStatement2.setInt(1, billnr); 

            ResultSet result2 = preStatement2.executeQuery();

            while(result2.next()){ 

               String billnr2   = result2.getString("BILLNR");

               String billposnr = result2.getString("BILLPOSNR");

               String billdtext = result2.getString("BILLDTEXT");

               System.out.println(billnr2 + " " + billposnr + " " + billdtext);

            }

        }

    }

}

Let's run this Java program now, enable an Oracle SQL trace for this J2EE connections / modules and create and AWR report for comparison of the performance measured data.

SQL> exec DBMS_WORKLOAD_REPOSITORY.CREATE_SNAPSHOT();

SQL> exec DBMS_MONITOR.SERV_MOD_ACT_TRACE_ENABLE('SYS$USERS','JDBC Thin Client',

          waits => TRUE, binds => TRUE, plan_stat => 'ALL_EXECUTIONS');

shell> time java -cp /Users/Downloads/ojdbc6.jar:. OracleJdbcExample

real    0m29.658s

user    0m3.782s

sys    0m0.880s

SQL> exec DBMS_MONITOR.SERV_MOD_ACT_TRACE_DISABLE('SYS$USERS','JDBC Thin Client');

SQL> exec DBMS_WORKLOAD_REPOSITORY.CREATE_SNAPSHOT();

We can see that the end user has waited round about 29.7 seconds until the program has finished. This is the performance target value, i usually talk about at a client site, because of this time is what the end users usually care about.

At first let's take a look at the AWR report for that measuring period.

The AWR report states, that the SQLs 6r6dng606gx8p and 5206f148n2vdu took only 4.57 seconds ("DB Time <> Response Time" is demonstrated clearly), but the end user experiences a response time of round about 29.7 seconds. These two SQLs statements were the only logic, that was processed by the Java code. So what do you want to optimize based on that information? Yes - the SQL 6r6dng606gx8p is running fully on CPU, but even if you were able to reduce the CPU usage of this SQL statement, you would reduce only max. 4.44 seconds from the end user response time of round about 29.7 second.

Now let's take a look at the raw SQL trace file and the generated report with SQL profiler TVD$XTAT.

Now you can see clearly, that round about 24.283 seconds (from an end user response time of 29.7 seconds) are spent on the whole database layer (JDBC / OCI - OPI - DB engine) from an application point of view, but you only see round about 4.6 seconds as database time in an AWR report. You maybe wonder why? Well, it is based on how the database time is measured by Oracle. Remember that the DB time includes CPU time, IO time and non-idle wait time, but the main wait event in our example here is "SQL*Net message from client", which is defined as an idle event by Oracle and so it is completely unaccounted.

This example was designed to be that extreme of course, but i wanted to be sure, that it is very obvious, that looking at ASH sampled data or an AWR report maybe not sufficient to diagnose the root cause of a database performance issue in a specific business process. The solution for this performance issue from above should be obvious too of course. You can use the Java code from above and re-write it on your own to a simple 2 table join and measure the performance again.

Summary

I hope you get the main idea behind "Method R" and why it is much more productive than a trial and error method ("Method C") like reducing the system wide top 5 wait events for example. Method R can also be adapted, if you want to reduce your average database response time in a SAP system for example. You just start with the most database time consuming SAP program (STAD / ST03n = End user performance experience) and go on step by step until you have reached your tuning target. No more guessing - just hard facts and measured performance data.

If you have any further questions - please feel free to ask or get in contact directly, if you need assistance by troubleshooting Oracle / SAP performance issues.