cancel
Showing results for 
Search instead for 
Did you mean: 

Join optimization

rajarshi_muhuri
Active Participant
0 Kudos

This is a new thread opened for a continuing discussion

currently join is needed between two huge fact tables . If the join is performed on the fly on HANA , the performance is degraded and HANA sometimes crashes .

Thus a stored procedure chunks the data and then performs the join on the smaller chunk and then dumps the results in a physical table .  The chunking is inefficient and but the data inserts is even more . The majority of time is taken in the delta merges ( data inserts) .

Here is the code

essentially , there are two tables  Table_A , and Table_B and an analytical view is created from them .

we pick data between 2 timestamps from  analytical view made from Table_A

and within a loop we join data from Table_A and Table_B

and dump the results to a physical table .

What can be done to optimize .

unfortunately I cant seem to hold in Memory the results of the while loop like

A= A union ( results of while loop) 

CREATE PROCEDURE CPM_SAP.SP_ZZFSL_CE11_REV_JOIN_V4
(IN MIN_TIMESTAMP NVARCHAR(14), IN MAX_TIMESTAMP NVARCHAR(14),IN MIN_PERIOD NVARCHAR(6), IN MAX_PERIOD NVARCHAR(6))
LANGUAGE SQLSCRIPT AS

XPERIOD INT  := :MIN_PERIOD;
XYEAR   INT  := 0;
XMONTH  INT  := 0;

BEGIN

VAR_Z_NAME_OBSCURED = SELECT "RYEAR", "RTCUR", "RUNIT", "POPER", "RBUKRS", "RACCT", "RCNTR", "RPRCTR", "RZZSITE",
"RZZPOSID", "RZZFKBER", "DOCTY", "REFDOCNR", "AUFNR", "ZZLDGRP", "VBUND", "ZZBUZEI", "AWORG", "CPUDT", "CPUTM",
"ZZBUZEI_RPOSN", "PERIOD_ID", "TIMESTAMP",
"TSL" AS "TSL", "HSL" AS "HSL", "KSL" AS "KSL", "MSL" AS "MSL"
FROM "_SYS_BIC"."wlclose/AN_Z_NAME_OBSCURED_REV_KSLNN"
where "TIMESTAMP" > :MIN_TIMESTAMP  AND "TIMESTAMP" <= :MAX_TIMESTAMP ;

WHILE :XPERIOD <= :MAX_PERIOD DO

VAR1_Z_NAME_OBSCURED = SELECT "RYEAR", "RTCUR", "RUNIT", "POPER", "RBUKRS", "RACCT", "RCNTR", "RPRCTR", "RZZSITE",
"RZZPOSID", "RZZFKBER", "DOCTY", "REFDOCNR", "AUFNR", "ZZLDGRP", "VBUND", "ZZBUZEI", "AWORG", "CPUDT", "CPUTM",
"ZZBUZEI_RPOSN", "PERIOD_ID", "TIMESTAMP",
"TSL" AS "TSL", "HSL" AS "HSL", "KSL" AS "KSL", "MSL" AS "MSL"
FROM :VAR_Z_NAME_OBSCURED
where "PERIOD_ID" = CAST(:XPERIOD AS VARCHAR(6));

VAR_CE11000 = SELECT "PALEDGER","KSTAR", "PERIOD_ID","BUKRS", "COPA_AWORG",  "ZZBUZEI_RPOSN", "RBELN", "PERIO",
"GJAHR", "PERDE", "RPOSN", "PRCTR", "PPRCTR", "SKOST", "RKAUFNR", "KURSF", "VV005_ME", "VV006_ME", "WW004","WW011" , "WW012", "WW013",
"WW008", "WW023", "WW010", "WW020", "WW015", "WW006", "WW024", "WW026",
"VV005" AS "VV005", "VV006" AS "VV006" FROM "_SYS_BIC"."wlclose/AN_CE11000_REV_LDG01"
where "PERIOD_ID" = CAST(:XPERIOD AS VARCHAR(6));

VAR_JOIN =  SELECT Z.RACCT AS RACCT, Z.RBUKRS AS RBUKRS, Z."PERIOD_ID" AS ZPERIOD, Z."AWORG" AS AWORG, Z."ZZBUZEI",
Z."REFDOCNR" as REFDOCNR, Z."RCNTR" as RCNTR, Z."RPRCTR" as RPRCTR, Z."RZZSITE" AS RZZSITE, Z."RZZPOSID" AS RZZPOSID,
Z."RZZFKBER" AS RZZFKBER, Z."DOCTY" AS DOCTY, Z."AUFNR" AS AUFNR, Z."ZZLDGRP" AS ZZLDGRP, Z."VBUND"AS VBUND,
Z."TIMESTAMP" AS TIMESTAMP,Z."RYEAR" AS RYEAR, Z."POPER" AS POPER ,Z."TSL" AS TSL, Z."HSL" AS HSL,  Z."RTCUR" AS RTCUR,
Z."KSL" AS KSL, Z."MSL" AS MSL, Z."RUNIT" AS RUNIT, C."PALEDGER" AS PALEDGER ,C."KSTAR" AS KSTAR, C."BUKRS" AS BUKRS,
C."PERIOD_ID" AS CPERIOD,C."COPA_AWORG" AS COPA_AWORG,C."RBELN" AS RBELN,C."ZZBUZEI_RPOSN" AS RPOSN, 
C."WW004" AS WW004, C."WW011" as WW011,  C."WW012" AS WW012 , "WW013" AS WW013  , C."WW008" AS WW008,  C."WW023" AS WW023, C."WW010" AS WW010,
C."WW020" AS WW020, C."WW015" AS WW015, C."WW006" AS WW006, C."VV005" AS VV005,C."VV005_ME" AS VV005_ME, 
C."VV006" AS VV006, C."VV006_ME" AS VV006_ME
FROM :VAR1_Z_NAME_OBSCURED  Z
LEFT OUTER JOIN :VAR_CE11000 C ON
                        Z."ZZBUZEI_RPOSN"=C."ZZBUZEI_RPOSN" AND
                        Z."REFDOCNR"=C."RBELN" AND
                        Z."AWORG"=C."COPA_AWORG" AND
                        Z."RBUKRS"=C."BUKRS" AND
                        Z."PERIOD_ID"=C."PERIOD_ID" AND
                        Z.RACCT = C.KSTAR;


INSERT INTO CPM_SAP.ZZFSL_CE11_REV_JOIN (SELECT * FROM :VAR_JOIN );

XYEAR  := (:XPERIOD/100);
XMONTH :=  :XPERIOD -  (:XYEAR*100) ;

IF :XMONTH =12
THEN
XPERIOD := ((:XYEAR+ 1) * 100) +1;
ELSE
XPERIOD := :XPERIOD +1;
END IF;

END WHILE;


END;

Accepted Solutions (1)

Accepted Solutions (1)

Former Member
0 Kudos

My apologies as I don't care to rake through the code and figure out what exactly is going on. So, fully admitting my ignorance, I have to ask -

If you're wanting to combine two fact tables, why aren't you performing a "union with constant values" (UCV) on-the-fly? This approach is doc'ed here (slide 15) by Werner if you're not familiar with this approach: http://www.sdn.sap.com/irj/sdn/go/portal/prtroot/docs/library/uuid/6056911a-07cc-2e10-7a8a-ffa9b8cf5...

If both fact tables don't share exact same dimensions, sometimes you can still build an intermediate table with required dimensions, join that to the fact table - and then do the UCV.

Former Member
0 Kudos

Great point 🙂

I just suffered a similar problem with two 2.5bn row fact tables that I wanted to join. Performance was poor - 300 seconds or more for any query.

They didn't have a natural key so I created a composite key which I then added as an additional field to both tables, with a dimension table which was normalized and contained keys for both tables.

Then I created two Analytic Views, one for each fact table, which presents the data in the same format.

Then I created a Calc view with UCV based on the analytic views.

This increased aggregation performance with no filter from 300 seconds to 60 seconds, but more importantly when you filter, the filter is pushed down into both Analytic Views. The moment you filter on country, or date, you get < 5 second response times.

John

rajarshi_muhuri
Active Participant
0 Kudos

what is UCV ?

Former Member
0 Kudos
former_member184768
Active Contributor
0 Kudos

Hi Rajarshi,

Ctrl + C, Ctrl + V from Jody's post above:

My apologies as I don't care to rake through the code and figure out what exactly is going on. So, fully admitting my ignorance, I have to ask -

If you need the KPIs from both the tables along with dimensions in a single row, I don't think the union will work. In my opinion, Union with Constant is good if you have "either this or that" kind of situation. Here since you need both the tables and the data in a single row, join is the option.

And I fully agree with Lars, to get the better performance, smaller the data set to be operated upon, better is the performance from join perspective.

Did you try the performance trace for the above procedure and tried to locate the bottlenecks. May be trace will provide you some valuable inputs on data transfer and memory utilization while the procedure is getting executed.

Regards,

Ravi

Former Member
0 Kudos

Hi Ravi,

Curious why you don't recommend UCV in this case? So long as dimensions are shared, it's easy to get all required data in one row.

Granted, for "this or that" type situation, filter pushdown works well. But even without a filter, UCV exploits parallelism quite well, and should pretty much always perform better than a join (by orders of magnitude).

Former Member
0 Kudos

Hey Ravi,

I think you misunderstand how UCV works. If we have:

Table1

Year Measure1

2001  10

2002  15

Table2

Year Measure2

2001  15

2002  20

And we do a UCV with "null" then we get the following:

Year Measure1 Measure2

2001 10                null

2002 15                null

2001 null               15

2002 null               20

Which is what you seem to be mentioning. But now, we do an aggregation (manually, or using the output node if we don't need to reuse the dataset elsewhere on SUM(Measure1) and SUM(Measure2). This returns:

Year Measure1 Measure2

2001  10             15

2002  15             20

This is semantically equivalent to:

SELECT T1.YEAR, T1.MEASURE1, T2.MEASURE2 FROM TABLE1 T1 JOIN TABLE2 T2 ON T1.YEAR=T2.YEAR

But, when you have large data volumes it is orders of magnitude faster - 5-10x in most cases.

Hope this helps!

John

former_member184768
Active Contributor
0 Kudos

Hi John,

Thanks for the clarification. I think I understand the UCV, atleast I did when I tried to put an example in the document .

As I mentioned I did not really go in details through the code posted by Rajarshi, but I think his requirement is to get the data from both the tables on the characteristics which are NOT COMMON in the tables involved in the join. So in that case, the characteristics which are not available in one table will have NULL value leading to different lines and hence the KFs will also be split in two  records.

But may be, Rajarshi should comment on this if his requirements can really be addressed by UCV.

Regards,

Ravi

former_member184768
Active Contributor
0 Kudos

Hi Jody,

I think Rajarshi's requirement is to get the attributes from both the tables and not all the required attributes are available in both the tables. So the records will result in NULL values and aggregation may not happen.

As I mentioned in the reply to John's mail, I think Rajarshi is the right person to comment on the feasibility of implementing UCV in his scenario.

Regards,

Ravi

Former Member
0 Kudos

Agreed -

Rajarshi, you're quite likely to get the most helpful feedback if you can provide a very simplified version of your scenario - and only include the required logic.

SCN isn't the place for on-demand code review.

I've faced situation where one table has fields A, B, C and another table has C. C is the lowest granluarity (I would say the leaf in a hierarchy but it's not always a strictly hierarchical relationships).

In case like these, in SQLScript, one can create an intermediate table of A, B, C from table A (which is efficient given HANA's columnar structure), join that to the second table - and then UCV is easy.

Often times I've found that this approach is often applicable to situations that at first glance make it look like a join between the two fact tables is the only option.

(It's been a long time since I've had to write up any kind of proof, but something tells me one should be able to mathematically prove that if you're combining two fact tables, one could join just the distinct dimensions of each table into some intermediate table, join this back to each fact table, and then UCV.)

Answers (4)

Answers (4)

rajarshi_muhuri
Active Participant
0 Kudos

Thanks everyone for the effort and analysis they have put in this query of mine . I had not had the time , but will now test out the ideas presented here , and update you guys .

rajarshi_muhuri
Active Participant
0 Kudos

Every body is right, but we are not on the same page , but I agree more with Ravi. So I am taking a simpler example and I agree with Jody ..I was too lazy earlier. I too knew UCV but not familiar with the acronym.

We all like CARS !! so

Table: Dealership Car Sales Amounts

Table : Dealership Car Quantity sold

The classic Inner Join would be

Union with Constants (focusing on the 3rd row of the sql inner join) :

I want them in one line , but with dis-simmilar dimensions they will come in two rows. So not acceptable .

However if I have a need for the output where I take only the dimensions which are also the join columns , then i get the required result . (I can also simulate left or right join with a flag-ing trick)

So to conclude , and to revert back to the original problem of mine , The nature of requirements is that I cant do much filtering but need to do this left outer massive join . 

This is why I was using the SP to have the data lifted in chunks and joined  and spit to a physical table . It performs relatively well when taking data between say August 2013 to Nov 2013 .  when running from 2011 - 2013 , it takes 50 min .

I had long since had looked at John Appleby's blog , and took two ideas

1. partitioning the table

2. turn off delta merge and manually merging the data while exiting the SP.

Unfortunately I tested in DEV  , where we have small data set . In DEV the SP without Johns Ideas took 2 min , and with John's idea took the same or maybe couple of nano seconds longer.

I am sure that the real advantage of his ideas would come , if I could be given a chance to implement it in PROD where we have a huge data.  But since I had implemented it 2 years back  and it works , I am not being allowed to tweak it lest something breaks

Former Member
0 Kudos

I'm quite confident you can solve this much more quickly via UCV.

I'll try to get a solution together by EOD, unless someone beats me to it.

The trick is to create COMBINED_DIM table, as an intermediate table in SQLScript, by joining DISTINCT fields YEAR, DEALER, CAR_COMPANY from each fact table, and then also include SALES_DESCRIPTION and PRODUCT_DETAIL in the SELECT. HANA should be smart enough to query the dictionary - thus this join, despite coming from two massive fact tables - should execute much faster. Fingers crossed.

Then, re-join this table to the fact tables.

Then, UCV.

Left or Right Outer can be simulated by a WHERE clause against NULL values of dimensions in right or left table.

Former Member
0 Kudos

So my solution would be slightly different 🙂 Think we have slightly different development methodologies. I think both are right. Still, here is my way:

I'd create two analytic views to normalize each result set with the same dimensions, each with the correct measure like you want (you can filter here as well if you need).


Then I would do a UCV against those analytic views in a graphical calculation view. This will FLY and you will get filter pass down to the analytic views.

John

rajarshi_muhuri
Active Participant
0 Kudos

Hi Jon

From what I understand from your idea ..

1. I add an calculated column in Analytical_View_A for the dimension that exist in the other table and mark it null.

and I do the same from the other analytical view . and union them

but which is same as going to manage mapping and doing the same (without explicitly creating the normalizing columns) e.g

But in either case , I dont get them in one line like a join . But given your experience I think I am missing your point . Perhaps you could clarify more.

rajarshi_muhuri
Active Participant
0 Kudos

I am really dumb .. not sure what you meant .. incase you can explain a bit more .

I am pasting the script for the tables , in case you would like to use them

Finally in SP6 studio , in UNION, I mark the measures that does not exist in the analytical view as zero, but the measure gets converted into a attribute. I actually have to open in SP5 Studio to mark as zero and still retain it as a measure. Does anyone else has this problem , or its a bug in my studio .

create column table SALES_DETAIL

(

YEAR INT,

DEALER NVARCHAR(10),

CAR_COMPANY NVARCHAR(10),

SALES_AMOUNT DECIMAL,

PRODUCT_DETAIL NVARCHAR(10)

);

CREATE COLUMN TABLE SALES_QUANTITY

(

YEAR INT,

DEALER NVARCHAR(10),

CAR_COMPANY NVARCHAR(10),

SALES_QUANTITY DECIMAL,

SALES_DESCRIPTION NVARCHAR(10)

);

INSERT INTO SALES_DETAIL VALUES ( 2010, '100', 'BMW' , 10000, '321');

INSERT INTO SALES_DETAIL VALUES ( 2010, '100', 'MERCEDES' , 11000, 'CLK');

INSERT INTO SALES_DETAIL VALUES ( 2010, '101', 'LOTUS' , 5000, 'MASER');

INSERT INTO SALES_DETAIL VALUES ( 2011, '102', 'BMW' , 1000, '321');

INSERT INTO SALES_DETAIL VALUES ( 2011, '103', 'LOTUS' , 1000, '321');

INSERT INTO SALES_DETAIL VALUES ( 2010, '104', 'MERCEDES' , 2000, '321');

INSERT INTO SALES_DETAIL VALUES ( 2012, '100', 'BMW' , 35000, 'X1');

INSERT INTO SALES_QUANTITY VALUES (2010, '100','BMW',  2, '3 SERIES');

INSERT INTO SALES_QUANTITY VALUES (2010, '100', 'MERCEDES', 3,  'C CLASS');

INSERT INTO SALES_QUANTITY VALUES (2011, '102', 'BMW', 2, '3 SERIES');

INSERT INTO SALES_QUANTITY VALUES (2011, '104', 'MERCEDES',2, '3 CLASS');

INSERT INTO SALES_QUANTITY VALUES (2012, '100', 'BMW', 1, 'X1 SUV');

SELECT * FROM SALES_DETAIL;

SELECT * FROM SALES_QUANTITY;

SELECT A.YEAR, A.DEALER, A.CAR_COMPANY, A.PRODUCT_DETAIL, B.SALES_DESCRIPTION, A.SALES_AMOUNT, B.SALES_QUANTITY

FROM SALES_DETAIL A  INNER JOIN SALES_QUANTITY B ON

A.YEAR=B.YEAR AND

A.DEALER=B.DEALER AND

A.CAR_COMPANY=B.CAR_COMPANY;

Former Member
0 Kudos

That's OK, check these screenshots. I have 3 analytic views for 4 different measures:

I then map them the same way that you describe:

Note that like you say, the UNION interleaves the rows. But by default, that leaves lots of NULLs which is exactly what we want. So we just aggregate! Note that in my example I'm also filtering on IS NOT NULL which means I only return rows for which all four measures are not null. That's deliberate (see that I have a difference calculated column, so I never want to include rows with nulls) but you may not want that behavior. Note that the aggregation type is always SUM().

Finally in my example I re-aggregate by SUM() and COUNT() and have calculated columns for AVERAGE(), but that's specific to my use case. In your example you can probably just aggregate it once in the output node.

Hope it helps,

John

Former Member
0 Kudos

I see your code now and see why it's returning on more than one line. This is fixable, give me a few.

Former Member
0 Kudos

So I this is an odd situation because your join predicate is just weird: Year, Dealer and Car_Company. Assuming that your SQL query returns the right data is worrying - not sure that would be the case. But let's assume it would be.

The problem we then face is that you have measures in each fact table that aren't in the other fact table. This is best fixed by normalizing your data model with proper master data objects and attributes into star schema. This will always perform best. If you can't do that then you can create a composite dimension table that has primary keys back to each fact table.

If you don't want to do that then you use the join predicate in the UCV (YEAR, DEALER, CAR_COMPANY) and then join back onto the fact tables to get the additional measures (PRODUCT_DETAIL, SALES_DESCRIPTION). This is much faster than the other join.

John

Former Member
0 Kudos

If normalization is not an option, then it looks like your solution is now the same as mine John.

That being said - the data above looks very much like it should be massaged in an ETL process to give both tables the same structure. Sounds like your tables are rather large, but deltas on such an ETL shouldn't be too bad.

Now, if you're stuck with your structures (i.e. if it's SAP source data via SLT), then following is sample SQL to show you what John and I are talking about.

Please note there's plenty of further opportunities to optimize this code: using CE functions (or modeling graphically) is one example, and duplicating source nodes for the different branches of the data flow graph is another.

Nonetheless:

DROP PROCEDURE UCV_EX;

CREATE PROCEDURE UCV_EX READS SQL DATA AS

BEGIN

          -- build a single dimension table from the two fact tables

          SINGLE_DIM =

                    SELECT DISTINCT

                              T1.YEAR,

                              T1.DEALER,

                              T1.CAR_COMPANY,

                              T1.PRODUCT_DETAIL,

                              T2.SALES_DESCRIPTION

                    FROM

                              SALES_DETAIL T1

                    INNER JOIN

                              SALES_QUANTITY T2

                              ON

                                        T1.YEAR = T2.YEAR AND

                                        T1.DEALER = T2.DEALER AND

                                        T1.CAR_COMPANY = T2.CAR_COMPANY;

 

          -- "star schema" #1

          F1 =

                    SELECT

                              SUM(SALES_AMOUNT) AS SALES_AMOUNT,

                              0.0 AS SALES_QUANTITY, -- constant values

                              F.YEAR,

                              F.DEALER,

                              F.CAR_COMPANY,

                              D.PRODUCT_DETAIL,

                              D.SALES_DESCRIPTION

                    FROM

                              SALES_DETAIL F

                    LEFT OUTER JOIN

                              :SINGLE_DIM D

                              ON

                                        F.YEAR = D.YEAR AND

                                        F.DEALER = D.DEALER AND

                                        F.CAR_COMPANY = D.CAR_COMPANY

                    GROUP BY

                              F.YEAR,

                              F.DEALER,

                              F.CAR_COMPANY,

                              D.PRODUCT_DETAIL,

                              D.SALES_DESCRIPTION;

 

          -- "star schema" #2

          F2 =

                    SELECT

                              0.0 AS SALES_AMOUNT, -- constant values

                              SUM(SALES_QUANTITY) AS SALES_QUANTITY,

                              F.YEAR,

                              F.DEALER,

                              F.CAR_COMPANY,

                              D.PRODUCT_DETAIL,

                              D.SALES_DESCRIPTION

                    FROM

                              SALES_QUANTITY F

                    LEFT OUTER JOIN

                              :SINGLE_DIM D

                              ON

                                        F.YEAR = D.YEAR AND

                                        F.DEALER = D.DEALER AND

                                        F.CAR_COMPANY = D.CAR_COMPANY

                    GROUP BY

                              F.YEAR,

                              F.DEALER,

                              F.CAR_COMPANY,

                              D.PRODUCT_DETAIL,

                              D.SALES_DESCRIPTION;

 

          -- Union with Constant Values (UCV)

          var_out =

                    SELECT

                              SUM(SALES_AMOUNT) AS SALES_AMOUNT,

                              SUM(SALES_QUANTITY) AS SALES_QUANTITY,

                              YEAR,

                              DEALER,

                              CAR_COMPANY,

                              PRODUCT_DETAIL,

                              SALES_DESCRIPTION

                    FROM

                    (

                              SELECT * FROM :F1 UNION ALL

                              SELECT * FROM :F2

                    )

                    WHERE

                              PRODUCT_DETAIL IS NOT NULL AND -- adding these filters simulates INNER join

                              SALES_DESCRIPTION IS NOT NULL  --

                    GROUP BY

                              YEAR,

                              DEALER,

                              CAR_COMPANY,

                              PRODUCT_DETAIL,

                              SALES_DESCRIPTION

                    ORDER BY

                              YEAR,

                              DEALER,

                              CAR_COMPANY;

 

          -- check the result

          SELECT 'UCV' AS MODEL_DESIGN, * FROM :var_out;

 

          -- compare against inner join approach

          SELECT

                    'INNER JOIN' AS MODEL_DESIGN,

                    SUM(SALES_AMOUNT) AS SALES_AMOUNT,

                    SUM(SALES_QUANTITY) AS SALES_QUANTITY,

                    T1.YEAR,

                    T1.DEALER,

                    T1.CAR_COMPANY,

                    T1.PRODUCT_DETAIL,

                    T2.SALES_DESCRIPTION

          FROM

                    SALES_DETAIL T1

          INNER JOIN

                    SALES_QUANTITY T2

                    ON

                              T1.YEAR = T2.YEAR AND

                              T1.DEALER = T2.DEALER AND

                              T1.CAR_COMPANY = T2.CAR_COMPANY

          GROUP BY

                    T1.YEAR,

                    T1.DEALER,

                    T1.CAR_COMPANY,

                    T1.PRODUCT_DETAIL,

                    T2.SALES_DESCRIPTION

          ORDER BY

                    YEAR,

                    DEALER,

                    CAR_COMPANY;

END;

CALL UCV_EX;

former_member184768
Active Contributor
0 Kudos

Hi All,

Trying to put forward a completely different point of view:

I am still not convinced that the UCV is the right approach. Here we are forcefully trying to fit a union by creating the single dimension table. It can be acceptable in the car dealer scenario since the example tables are small, but in real life practical scenarios with large tables, such single dimension table might be as big as the other fact tables.

From the purely database perspective, I would expect HANA to allow me to perform table joins with the acceptable performance. If the performance is slow, then SAP should fix it. Unfortunately we try to create workarounds for simple business requirements and spend quite much time and effort to maintain such workarounds. At times it gets hard to convince the business users why their simple requirements cannot be met due to limitations of the technology.

So, I think Rajarshi should raise OSS note, take this example to SAP and let them figure out a way to optimize the HANA engine to allow large table joins. May be, SAP will make it available in SPS08 or 09

Regards,

Ravi

lbreddemann
Active Contributor
0 Kudos

Hi Ravi

I wholeheartedly agree with that bugs should be raised with SAP support to get them fixed and that product limitations that restrict the usability of the platform to the business scenarios should be addressed as well.

No question about that.

And SAP HANA really is a prime example for that this is what's happening.

In fact the adaptions, bug fixes and enhancements that are build into SAP HANA based on direct customer feedback are delivered that often, that some customer cannot keep up with this and need to skip revisions rather regularly.

Sure enough, if you're active in your one super large project that by its very design challenges the boundaries of SAP HANA you'll sometimes will feel that your requests and requirements take too long to get into the platform.

Still, there's no other platform available that provides these feature turnaround times.

Also, it always must be checked and analyzed closely whether a re-modelling of the solution isn't the better solution.

Just taking a model (often with preconceptions on how to properly model on system abc still in mind) and implement this 1:1 in SAP HANA will not deliver the wanted performance.

Sometimes, an adaption in the view of the modeler is required.

- Lars

lbreddemann
Active Contributor
0 Kudos

Rajarshi Muhuri wrote:

I had long since had looked at John Appleby's blog , and took two ideas

1. partitioning the table

2. turn off delta merge and manually merging the data while exiting the SP.

Sorry, but point 2 is very likely not leading anywhere.

Your code does not wait for the merge to finish - merges don't block your queries/transactions.

They just don't.

What they could block is, well, other merges.

You wrote earlier, that writing your data to the target table takes so long.

One way you could approach this would be to store the data in a temporary table first and copy the content into the final table afterwards in one go.

If you run into memory issues during the merge  due to the size of the table, then partitioning it might be a viable approach to overcome this (however, you might see increased column unloading/loading then).

- Lars

Former Member
0 Kudos

I'm not with you here Ravi. Joins work well, but they are mathematically expensive with very large data models. Go and try and do this join on Oracle - it will collapse in a heap! HANA makes things possible... but it doesn't mean they are a good idea!


UCV is mathematically much simpler, and so runs faster with larger volumes. That's life, and if you want super-fast aggregation and join performance then you need to normalize your model and use a UCV.

You choose your design - normalize your model and use a UCV, or deal with the performance hit of doing an expensive multi-column join.

HANA can't change the laws of physics!

John

Former Member
0 Kudos

Hey Lars,

We might be talking at crossed purposes but I find controlling the partition and delta merge process can be very helpful.

If you are loading large volumes of data then you need to make sure that your delta store doesn't get too big, for sure, and I've had problems where I tried to load tons of data into lots of small partitions, and the delta store RAM became greater than available RAM, because I didn't do any interim merges.

So I try to setup a partition strategy where I load a partition at a time from CSV, and do a manual delta merge in between CSV file loads. So:

- Partition by RANGE(YEAR)

- LOAD 2011, MERGE

- LOAD 2012, MERGE

- LOAD 2013, MERGE

Using this strategy you get smoother loads and sometimes up to 2x performance increase.

In addition, you may find that if load performance is critical, increasing the number of partitions that can be written to in parallel helps. You get an increase with up to 10 partitions in a hash.

- Partition by HASH(MEASURE), RANGE(YEAR)

- LOAD 2011, MERGE

- LOAD 2012, MERGE

- LOAD 2013, MERGE

If you have multiple nodes, then HASH partitioning with 1-3 partitions per node can be beneficial.

More partitions that are always accessed can negatively impact query performance, because it has to reassemble the data from partitions. Swings and roundabouts!


John

former_member184768
Active Contributor
0 Kudos

Agree with John. Although Delta merge and partition was not the original topic, but doesn't mean the discussion should be restricted.

Recently we saw significant improvement in the data loading process on BW on HANA with the partitioning of not only Active data, but also change log and New data tables. The delta merges were comparatively smaller and we did not face out of memory scenarios, as we did before partitioning.

Regards,

Ravi

former_member184768
Active Contributor
0 Kudos

Well, here is the scenario.

We have a one transactional application on SAP ERP which caters to most of the big markets in the world and equivalent applications on Non SAP platform for smaller / budget conscious countries. Both the applications capture KPIs at document level. The master data is maintained in ERP as well as Non SAP application with a mapping maintained in the ERP application. These applications source data to the BW landscape. We calculate some KPIs in SAP BW, which are dependent on the KPIs coming from ERP and Non SAP applications. In the current Oracle based BI landscape, the calculated KPIs are persisted with a weekly job executed on the weekend. Any change in any of the KPIs in SAP or Non SAP application, missed out / back dated transactions, Clean up of Error log result into re-validation of the calculated KPIs and hence re-generation of the persisted KPIs. This ends up into high data processing impacting the data availability SLAs.

Now comes HANA with the promise of on-the-fly-calculation, no-need-for-redundant-persistence, reduced-application-complexitiy, focus-on-business-not-technology etc. This is how it was sold to the senior IT management and this is how it was perceived by the IT architecture team. So we propose the architecture to have on-the-fly-join since the KPIs are volatile and business wants the calculated KPI information at the document level. The data volume from both of the SAP and Non SAP application is high, hence the need for high-volume-data-join scenario. It is practically NOT possible to model it in the UNION mode due to the complexity of deriving the KPIs on a single level. Any de-normalization to the underlying data model to get both the datasets at the same attribute level for union would result into a significantly high effort and data redundancy.

But then we faced huge performance issues. We tried all possible combinations with the help from HANA experts from SAP, but could not get the required performance on the join model. We had to go back to the previous persistence model with the all known issues of data discrepancy and re-validation process.

I am not saying HANA is the solution to everything, but it surely has potential to deliver what promised and very high opportunity to improve in the coming days. It is very promising and exciting technology to work on, but at the same time the expectations set are bit higher.

So finally the point is, Union is great, definitely improves the performance and highly recommended for most of the business scenarios. At the same time, expectation from the Join is that it should be possible with the acceptable performance. Not expecting it to happen tomorrow, but should be made available in the near future.

Regards,

Ravi

Former Member
0 Kudos

Right that's a great point, only those partitions where there are changes will need to be merged, which massively reduces the merge cost for large tables with a good partition strategy.

John

lbreddemann
Active Contributor
0 Kudos

Hey there,

I didn't wanted to imply that partitioning and synchronizing data loads and merge activities (like we do it in SAP BW with the smart merge) don't help to improve overall system resource usage and performance.

But for the example given I highly doubt that the actual merge process was adding anything to the runtime.

- Lars

Former Member
0 Kudos

Got it, thought we might be at crossed-purposes.

Hmm if you include a merge and it hasn't finished an automerge then it would add to the runtime? Otherwise you could let the merge carry on in the background.

However I have found that doing LOAD -> MERGE -> BIG QUERY is usually faster than doing LOAD -> BIG QUERY because the query can be slow if there is a large delta store that hasn't completed a merge.

John

former_member184768
Active Contributor
0 Kudos

I too agree with this. The experience was, even though HANA decides that Merge is not required because the delta size has not reached the threshold set by the system (what ever formula is used by HANA to decide the Delta size for merge), a forced Delta merge improved the performance of the query execution.

Regards,

Ravi

lbreddemann
Active Contributor
0 Kudos

John, spot on with the observation.

Clearly, running over a huge delta store is less efficient than reading the same data from the highly compressed main store.

That's the reason for having the merge in the first place...

The thing with the manual merge is that opposed to the auto merge, your transaction does wait for it to return (also, it doesn't do

As there can usually only be 2 parallel merges going on in the system in total, this could mean that it will even have to wait for unrelated merges to finish.

As I see it the idea here was that if you use MERGE or SMART MERGE you basically accept all that for the sake of reducing the number of unnecessary 'in-between' merges.

Clearly, in straight forward mass load scenarios there are big improvements possible.

You could even further tune by loading directly to the correct host in a multi-node system and/or setting up indexes/constraints only after data loading.

Hmm... so many knobs to turn and buttons to push and so few time....

Cheers, Lars

lbreddemann
Active Contributor
0 Kudos

Hey Ravi,

that's the thing with one-fits-all-heuristics:

The merge decision rules clearly can only aim at evening out the performance impact of performing the delta merge over the whole system.

It will always be possible to find cases where we will know better.

But it's very easy to go wrong with outsmarting the heuristics. E.g. putting in forced merges into parallel loading threads for the same object doesn't scale and doesn't improve things.

And in my experience, people tend to overlook the impact of their optimizations to the whole system and when used in parallel.

So, call me conservative, but I prefer to be reluctant with these kinds of optimizations.

Also: they are so 'invisible', if you know what I mean.

Two months from now I won't remember what I coded this morning and things like custom merge themes really don't stick out to me when I look at the system.

Hmm.. this discussion really branched out of the orignial thread. Can't say I dislike it though

Cheers, Lars

rajarshi_muhuri
Active Participant
0 Kudos

individual optimization is always more efficient than generic whole optimization. for example in C language , the programmers could allocate and the de-allocate a chunk of memory for a process , where as more modern languages does automatic garbage collections and memory management,  but nothing is more tight than a optimized c code.

However reading Lars comment , that the system waits for an unrelated merge to finish before it starts another merge thread (during a manual merge) makes me think that manual merges is pointless . as SLT's is constantly merging different SAP tables  .

In my SP table , I had not put any constraints on the table level either . so that the data inserts dont have to waste effort trying to do integrity checks . So when the data load is finished I put constraints manually  by the alter command. 

I know partitioning definitely helps in performance , I am not sure now about exclusive locks  or manual merges . But I guess John Appleby has done some comparison studies on really BIG data . Our DEV and QA has smaller data sets , having manual merges actually hinders performance then.

10muscat
Explorer
0 Kudos

What if you put an AGGREGATION after the union

lbreddemann
Active Contributor
0 Kudos

Hi there,

indeed this is a typical situation with DB development in SAP HANA.

There are large tables that are joined with other large tables and/or with information models (in this case an analytic view) and they should produce a large joined result set.

One key point has been overlooked here: to get the super short response times, it's absolutely required to reduce the amount of data that is worked upon as early as possible.

Very likely the usage scenario for the result of this join will be further analysis steps - not the pure list output.

If this is the case, then the models need to be created appropriately. Most of the times, the 'one-model-fits-all-reports' approach won't lead to satisfying results.

To really gain performance for your design, let us know the reporting/analysis requirement(s) you want to solve. Based on that one or several models can be created to accommodate them.

- Lars

Former Member
0 Kudos

Can you paste the table DDL as well, as well as some sample data that we could explode into a big dataset? How big are the tables? What is the performance of the join and the timing of this procedure?

This is a fairly typical challenge with HANA. A few comments:

- It surprises me that a view can't be made to perform well

- CE Functions will typically improve the performance of this sort of code

- Even still, your insert cost will be substantial if there are a lot of rows - you should be able to get 1m rows/sec at best

- Consider partitioning your target table by TIMESTAMP/PERIOD so that your MERGE DELTA doesn't affect the same data that is already stored. This will massively increase INSERT/MERGE DELTA performance

- You need to nest your SQLScript functions so there are no scalars in your SELECT statements. This code will all run in a single thread which isn't helping

John

rajarshi_muhuri
Active Participant
0 Kudos
CE11000 Table count      = 365,501,068
Z_SP_FIN table count     = 349,816,890
We have to join the two tables , and we take some 10+ dimensions and 2 measures from Table CE11000 ,  15+ dimensions   and 4 measures from Z_SP_FIN table.
The join fields is
  1.                         Z."ZZBUZEI_RPOSN"=C."ZZBUZEI_RPOSN" AND  
  2.                         Z."REFDOCNR"=C."RBELN" AND  
  3.                         Z."AWORG"=C."COPA_AWORG" AND  
  4.                         Z."RBUKRS"=C."BUKRS" AND 
  5.                         Z."PERIOD_ID"=C."PERIOD_ID" AND 
  6.                         Z.RACCT = C.KSTAR

We have a reconciliation process , where I use UNION with constants as we need company, period , racct  and all the 5 measures .  But since the join fields are the only ones used in  the final report , I was able to do a union ( with a trick having a additional column for values 1 or 0 ) to get same results as left outer join.

But in the other process we need dimensions and measures in the same line from both tables , thus an union with constants would not suffice. 

John A : This is a process created in 2012 March , even though its slow , and many improvements in HANA , I never got around changing it , except for a slight more optimized SP , that is used when smaller increments are run , but when its used to pick up data from 2011 till date , this SP is used . 

I can share garbled data with you in email and along with model meta data . Let me know . My mail is rajarshi dot muhuri at Accenture dot com