cancel
Showing results for 
Search instead for 
Did you mean: 

Approach on loading multiple data sources sequentially within one job

Former Member
0 Kudos

Hi there,

I'm working on a job which need to load four datasource by order and any one of them failed, the job should still continually go forward to the end.

In order to do that, I use a series of workflows which connected sequentially. Within each of workflow, there is one dataflow to load one data source to the target table. The target table is composed by those four datasource, so, for each of dataflow, I have to use Table_Comparison to make sure the fields are filled based on the  primary key. For this approach, it satisfies on those requirements, however, for each of workflow, they are processing for a while. The whole chain takes too much time. Does anyone have idea on how to improve on the performance or any other strategy to handle this case? Thanks a lot!

The screenshot for each of the layer as below

WF chain:

Within each of WF,

Within each of DF

Accepted Solutions (1)

Accepted Solutions (1)

former_member187605
Active Contributor
0 Kudos

Check for a discussion on optimal settings for Table_Comparison.

Former Member
0 Kudos

Hi Dirk,

Actually I have already checked the doc you mentioned above. I ordered the source based on Primary key and then chose the sorted input in T_C. However, it did not help. My data set is not as huge as 1million. In fact, it's just 300000 roughly. I have no idea why the T_C took so long.

Thanks for your help!

Cindy

former_member187605
Active Contributor
0 Kudos

I am not convinced the TC is the culprit.

  1. Is the sort pushed down to the source? If not, is is performed by DS, which may be time-consuming.
  2. How much faster is the dataflow without the TC? The extraction process might be the bottleneck.
Former Member
0 Kudos

Hi Dirk,

How the sort pushed down to the source? I check the source by "View data" and found that the source data already sorted by one of PK. They way I implemented is Query-> Order by -> two PKs.And in the T_C, I put the two PKs on the left, and other fields on the right as the compare columns .  Is this the correct way?

former_member187605
Active Contributor
0 Kudos

That seems like the right approach indeed. In fact, that's how it always works for me with larger data volumes.

You mentioned your input is 300K records. Could you post some extra absolute figures?

  • execution time of every data flow
  • time it takes to load the first set of data without Table_Comparison

Also, can you remove the temp table from your data flow? That might be another bottleneck.

Former Member
0 Kudos

Hi Dirk,

I tested the load without T_C. Apparently it went much faster than I used T_C. I also removed the temp table. With T_C, it did not help a lot.

I checked some docs, (actually a lot) many people mentioned index the fields, however, I haven't find more details on how to do it. Would you pl give me some hints on index? Thanks so much!

Former Member
0 Kudos

I also attached one of the screenshot on the log, it has been running for more than one hour. The second WF is still processing. It's like running forever.

former_member187605
Active Contributor
0 Kudos

From my document on Table_Comparison:

Sorted Input:

Often the most efficient solution when dealing with large data sources, because DS reads the comparison table only once. This option can only be selected when it is guaranteed that the incoming data are sorted in exactly the same order as the primary key in the comparison table. In most cases incoming data must be pre-sorted, e.g. using a Query transform with an Order-by (that may be pushed down to the underlying database), to take advantage of this functionality.

From the log, it seems:

  1. You don't have a primary key in the comparison table. Therefore DS has to sort it, first, caching all in memory, a very time-consuming process.
  2. Your source is a content extractor, isn't it? And it doesn't produce the data in the correct order. Therefore DS has to perform another sort operation.

Stage your input data in a database table with the correct primary key (you can use the Data_Transfer transform to that extent) and make sure your comparsion table has the same primary key column(s). You're job will termiate in seconds.

Former Member
0 Kudos

Hi Dirk,

Really appreciated your help. The reason I did not put primary key in the conparison table is that one of data source include NULL value in one field of PK. So, I removed the PK in the target table, and use 'Identify key' as the PK. Now, I just took out that datasource and put the PK back on the comparison table.

In addition, I used a stage table to sort the datasource and after that, use the sorted stage table do T_C. However, I got error message on it. It seems like 'Order by' does not work during the first DF. The stage table is still not sorted. Please look at the below .In order to solve this error, I have to put another Query in front of T_C ( Order by one more time). The error has gone. But the no improvement on the performance compare with the old solution.  Would you know where I did wrong? Thanks again!

former_member187605
Active Contributor
0 Kudos

Sorting in the first data flow is useless. You need the Order by in the 2nd, before the TC.

Can you try and do so? And post the generated SQL (from DS Designer menu: Validation > Display Optimized SQL...)?

Set the Degree of Parallellism (in Dataflow Properties) to 1, run the job and post the Monitor output again, too, if you want.

Former Member
0 Kudos

Hi Dirk,

Thanks for your patience. I added the Order by in the second DF, before the TC. and also change the Degree of Parallellism to 1. Unfortunately, the performance has not been improved. Checked the generated SQL, it looks like below,

in the first DF, load to Stage table

SELECT MATNR , WERKS , PSTAT , LVORM , BWTTY , XCHAR , MMSTA , MMSTD , MAABC , (.................all of fields)  ODQ_CHANGEMODE , ODQ_ENTITYCNTR

FROM PLANT_ATTR_PJ_ECC_DATA_SQL

in the second DF, two SQLs

SELECT  "TCRdr_4"."MATNR"  ,  "TCRdr_4"."WERKS"  ,  "TCRdr_4"."PSTAT"  ,  "TCRdr_4"."LVORM"  ,  "TCRdr_4"."BWTTY"  ,  "TCRdr_4"."XCHAR"  ,  "TCRdr_4"."MMSTA"  (xxxx..................................................................................................................................................................................................................................................................................................)

FROM "SAPSTG"."MaterialPlant" "TCRdr_4"

ORDER BY  "TCRdr_4"."MATNR" COLLATE Latin1_General_BIN ASC ,  "TCRdr_4"."WERKS" COLLATE Latin1_General_BIN ASC

SELECT  "StagingPlantAttr"."MATNR"  ,  "StagingPlantAttr"."WERKS"  ,  "StagingPlantAttr"."PSTAT"  ,  "StagingPlantAttr"."LVORM"  ,  "StagingPlantAttr"."BWTTY"  ,  "StagingPlantAttr"."XCHAR"  ,  "StagingPlantAttr"."MMSTA"  ,  (xxxxxx..................................................................................................................)

ORDER BY  "StagingPlantAttr"."MATNR" COLLATE Latin1_General_BIN ASC ,  "StagingPlantAttr"."WERKS" COLLATE Latin1_General_BIN ASC

Thanks again!

former_member187605
Active Contributor
0 Kudos

This seems 100% correct. Unless  there's something wrong/inconsistent within your code page settings? The sort is pushed down to your data sources as binary. And (maybe - I cannot see, because you haven't posted the monitor output) DS sorts again because key columns are character.

Former Member
0 Kudos

Dirk,

I have replaced the 'Query' in the first DF with 'Data_Transfer'. It looks a little bit better. But overall, it's still around 30 mins for the two data source. I attached the monitor,  Apparently, the fist DF which load the data into staging table(use Data Transfer) took a while. would you pl suggest on any more improvement if possible? Really appreciated!

former_member187605
Active Contributor
0 Kudos

Yes, from that log output it seems the extraction process is terribly slow. That probably means there are no issues with the TC and the sorts are OK.

It's your BCE that's causing the problem. There's nothing you can do in DS about this. You should check with your Basis team whether they can improve the extractor performance.

Former Member
0 Kudos

Hi Dirk,

Thank you so much! With all of attempts, now it looks okay. The load speed is acceptable. I really appreciate your help!

Answers (5)

Answers (5)

Former Member
0 Kudos

Did you tried with Auto Correct Load option?

former_member199543
Contributor
0 Kudos

Hi

1. make 2 separate flows, the first flow is Source table >Query> target table, and the second is with TC. By doing so the first flow will be pushed down to DB, thus no Job server is involved in calculations.

2. How do you to optimize TC transform? This can not be pushed down to the DB, thus local resources are used. It depends on Source and target table, if both are properly indexed according to your Comparison options, then performance should be fine. The best bet here would be to avoid TC transform and simpy use script instead, which compares your 2 tables, seeing that there is no SCD2 needed, then this is very easy to maintain.

Custom scripts are always faster and better than predefined transformations.

former_member187605
Active Contributor
0 Kudos

This is definitely not true! I have literally hundreds of examples with transforms largely outperforming any script.

former_member199543
Contributor
0 Kudos

it depends on who did the scripting, right   ?  BODS is good in general, but when it comes to more compex stuff, then as you well might know, we might hit the limitations of graphical tools - BODS, SSIS, Informatica..

former_member187605
Active Contributor
0 Kudos

IMHO, it rather depends on who does the DS design .

DS is not a graphical tool at all, only DS Designer is. At runtime, "real" code is generated and executed. DS has been optimised for generating very efficient code, not always so easy to produce manually.

0 Kudos

You can also try to use sorted input in TC (remember to sort the data before the TC )

Former Member
0 Kudos

Hi Allan,

Thanks for reply.Already applied sorted. But it doesn't help:(

former_member198401
Active Contributor
0 Kudos

Can you let me know what is the count of the Target table for all the data sources. If Amount of data is not too huge then try using 'Cached Comparison Table ' in Table Comparison.

Performance is okay as long as the amount of data is not too huge.

Regards

Arun Sasi

Former Member
0 Kudos

Hi Arun,

Thanks for the response. The records is more than 300K. I understand 'Cashed Comparson Table' is supposed to be used in a dataset much less than that. Is that right?

former_member198401
Active Contributor
0 Kudos

Yeah Cindy!!

Better not to use Cached table comparison method in this case.. Do you have any lookups in your job. Also try creating indexes on the target table if possible. If there are more updates than inserts then Indexes play a very important role in performance tuning.

Regards

Arun Sasi

Former Member
0 Kudos

Hi Cindy,

Please refer the below document to optimize BODS job

Thanks,

Surya B.