cancel
Showing results for 
Search instead for 
Did you mean: 

Performance: reading huge amount of master data in end routine

Former Member
0 Kudos

In our 7.0 system, each day a full load runs from DSO X to DSO Y in which from six characteristics from DSO X master data is read to about 15 fields in DSO Y contains about 2mln. records, which are all transferred each day. The master data tables all contain between 2mln. and 4mln. records. Before this load starts, DSO Y is emptied. DSO Y is write optimized.

At first, we designed this with the standard "master data reads", but this resulted in load times of 4 hours, because all master data is read with single lookups. We redesigned and fill all master data attributes in the end routine, after fillilng internal tables with the master data values corresponding to the data package:


*   Read 0UCPREMISE into temp table
    SELECT ucpremise ucpremisty ucdele_ind
      FROM /BI0/PUCPREMISE
      INTO CORRESPONDING FIELDS OF TABLE lt_0ucpremise
      FOR ALL ENTRIES IN RESULT_PACKAGE
      WHERE ucpremise EQ RESULT_PACKAGE-ucpremise.

And when we loop over the data package, we write someting like:


    LOOP AT RESULT_PACKAGE ASSIGNING <fs_rp>.
      READ TABLE lt_0ucpremise INTO ls_0ucpremise
        WITH KEY ucpremise = <fs_rp>-ucpremise
        BINARY SEARCH.
      IF sy-subrc EQ 0.
        <fs_rp>-ucpremisty = ls_0ucpremise-ucpremisty.
        <fs_rp>-ucdele_ind = ls_0ucpremise-ucdele_ind.
      ENDIF.
*all other MD reads
ENDLOOP.

So the above statement is repeated for all master data we need to read from. Now this method is quite faster (1,5 hr). But we want to make it faster. We noticed that reading in the master data in the internal tables still takes a long time, and this has to be repeated for each data package. We want to change this. We have now tried a similar method, but now load all master data in internal tables, without filtering on the data package, and we do this only once.


*   Read 0UCPREMISE into temp table
    SELECT ucpremise ucpremisty ucdele_ind
      FROM /BI0/PUCPREMISE
      INTO CORRESPONDING FIELDS OF TABLE lt_0ucpremise.

So when the first data package starts, it fills all master data values, which 95% of them we would need anyway. To accomplish that the following data packages can use the same table and don't need to fill them again, we placed the definition of the internal tables in the global part of the end routine. In the global we also write:


DATA: lv_data_loaded TYPE C LENGTH 1.

And in the method we write:


IF lv_data_loaded IS INITIAL.
  lv_0bpartner_loaded = 'X'.
* load all internal tables
lv_data_loaded = 'Y'.

WHILE lv_0bpartner_loaded NE 'Y'.
  Call FUNCTION 'ENQUEUE_SLEEP'
  EXPORTING
     seconds = 1.
ENDWHILE.

LOOP AT RESULT_PACKAGE
* assign all data
ENDLOOP.

This makes sure that another data package that already started, "sleeps" until the first data package is done with filling the internal tables.

Well this all seems to work: it takes now 10 minutes to load everything to DSO Y. But I'm wondering if I'm missing anything. The system seems to work fine loading all these records in internal tables. But any improvements or critic remarks are very welcome.

Accepted Solutions (1)

Accepted Solutions (1)

esjewett
Active Contributor
0 Kudos

This is a great question, and you've clearly done a good job of investigating this, but there are some additional things you should look at and perhaps a few things you have missed.

At first, we designed this with the standard "master data reads", but this resulted in load times of 4 hours, because all master data is read with single lookups.

This is not accurate. After SP14, BW does a prefetch and buffers the master data values used in the lookup. Note [1092539|https://service.sap.com/sap/support/notes/1092539] discusses this in detail. The important thing, and most likely the reason you are probably seeing individual master data lookups on the DB, is that you must manually maintain the MD_LOOKUP_MAX_BUFFER_SIZE parameter to be larger than the number of lines of master data (from all characteristics used in lookups) that will be read. If you are seeing one select statement per line, then something is going wrong.

You might want to go back and test with master data lookups using this setting and see how fast it goes. If memory serves, the BW master data lookup uses an approach very similar to your second example (1,5 hrs), though I think that it first loops through the source package and extracts the lists of required master data keys, which is probably faster than your statement "FOR ALL ENTRIES IN RESULT_PACKAGE" if RESULT_PACKAGE contains very many duplicate keys.

I'm guessing you'll get down to at least the 1,5 hrs that you saw in your second example, but it is possible that it will get down quite a bit further.

This makes sure that another data package that already started, "sleeps" until the first data package is done with filling the internal tables.

This sleeping approach is not necessary as only one data package will be running at a time in any given process. I believe that the "global" internal table is not be shared between parallel processes, so if your DTP is running with three parallel processes, then this table will just get filled three times. Within a process, all data packages are processed serially, so all you need to do is check whether or not it has already been filled. Or are you are doing something additional to export the filled lookup table into a shared memory location?

Actually, you have your global data defined with the statement "DATA: lv_data_loaded TYPE C LENGTH 1.". I'm not completely sure, but I don't think that this data will persist from one data package to the next. Data defined in the global section using "DATA" is global to the package start, end, and field routines, but I believe it is discarded between packages. I think you need to use "CLASS-DATA: lv_data_loaded TYPE C LENGTH 1." to get the variables to persist between packages. Have you checked in the debugger that you are really only filling the table once per request and not once per package in your current setup? << This is incorrect - see next posting for correction.

Otherwise the third approach is fine as long as you are comfortable managing your process memory allocations and you know the maximum size that your master data tables can have. On the other hand, if your master data tables grow regularly, then you are eventually going to run out of memory and start seeing dumps.

Hopefully that helps out a little bit. This was a great question. If I'm off-base with my assumptions above and you can provide more information, I would be really interested in looking at it further.

Edited by: Ethan Jewett on Feb 13, 2011 1:47 PM

esjewett
Active Contributor
0 Kudos

Just to correct myself so that you don't have to go through the pain I just did to verify this: the contents of all variables declared in either global section of a the transformation routines appear to be available to all packages processed by a single DTP process, whether or not they are declared as "class-data". In fact, declaring a variable as class-data is only allowed in the first global section. I've crossed out the incorrect section in the previous post.

Discussion: What this means is that the DTP process will instantiate the transformation class only once for each process and then use that instance to process all packages. This is a different behavior than is described in the documentation [here (click link here for documentation)|http://help.sap.com/saphelp_nw70ehp2/helpdata/en/43/857adf7d452679e10000000a1553f7/frameset.htm], so I would still recommend using "class-data" in case SAP decides to change the implementation to match the documentation. But at the moment it doesn't make any difference and your internal table declared in the global section will indeed only be filled once per DTP process.

Information is not shared between processes with either type of declaration.

Cheers,

Ethan

Former Member
0 Kudos

Well that's a really helpful answer, many thanks for that. I have learned now that SAP actually does support master data fetching for master data reads, but that system settings can prevent SAP from applying this.

I read the first note you mentioned and looked up the settings. MD_LOOKUP_PREFETCH_LOGIC was set to 'X', as it is a standard setting. However, the setting for MD_LOOKUP_MAX_BUFFER_SIZE was not initialized. This means, according to the note:

If the parameter is undefined in the table RSADMIN or has a value which is SPACE, the program automatically uses the datapackage size (setting of the DTP) to limit the memory allocated to the prefetched data.

After the number of lines equal to the package size are buffered, the process continues with direct reads. Since the master data tables have almost unique records for each of the records in the data package, and since from several master data characteristics was read, only a small part of the master data was fetched. I have now changed the buffer size (to 5.000.000) and now most of the records seem to get buffered. The load takes now about one hour, which is perfectly fine.

However, it seems that at some point, the direct reads kick in again. I am under the impression that the full master data tables are read into an internal table without a filter on the characteristics of the master data that is in the data package, since one data package of 50.000 records does not correspond to more than 5.000.000 master data records.

Another noteworthy observation is that at one time the first two data packages (with which the process starts executing the transformation) took about 20 minutes to process (of which 19 minutes consist of sequential reads of master data), but all data packages after that were processed in a few minutes only. It seems the master data reads of previous packages were re-used. What I don't understand is that this only happened one time I executed the load. Now I execute it for the second time, and it reloads the master data for each package.

Looking at another (simpler) transformation I see the same pattern each time the load is executed: the first two data packages take about 3 minutes for the transformation start and rules, but the ones after that only take about 12 seconds. So the re-use of the buffer on a global level seems to be there. I'll try to set the buffer limits a bit higher to see if it will happen again for the larger transformation as well

I´m trying to figure out how the SAP functionality works exactly by reading through the generated program of the transformation and looking into the PREFETCH method that SAP wrote in the class CL_RSDMD_LOOKUP_MASTER_DATA, but haven't figured it out yet. The load takes quite some time, so it also takes time to test different settings.

Thanks a lot for your thorough answer, it's really new insight for me.

Edited by: Zephania Wilder on Feb 14, 2011 12:42 AM

esjewett
Active Contributor
0 Kudos

I haven't dug into this as far as you have, but the behavior where all master data seems to be loading in the first package seems like a bug. Theoretically the master data reader should be clearing out the previous buffer table every time it buffers master data for a new package, so the buffer shouldn't really grow much from package to package. At least that is how I read the code in the CL_RSDMD_LOOKUP_MASTER_DATA class (the PREFETCH method is the one called from the transformation program).

You could just make the buffer parameter enormous (bigger than all your master data tables combined) and see what happens. If it is still doing individual selects at some point, then this is definitely a bug and you could open a customer message.

If you can figure out exactly where the cutoff is and it is much larger than the master data required for any given package, then I think this is also a bug and SAP would probably be interested in it.

On the other hand, I think there have been some fixes to this functionality at least between SP14 and SP19. What support package are you on?

Answers (0)