cancel
Showing results for 
Search instead for 
Did you mean: 

HANA POC - Generating big data

Former Member
0 Kudos

Hi All,

We are in the process of developing POC in SAP HANA space. We need big data(volumes of data). We are planning to generate this data ourselves.

Do we have any SAP standard programs for generating volumes of data or if we can develop one; what approach we can follow.

Regards,

Ganga

Accepted Solutions (1)

Accepted Solutions (1)

Former Member
0 Kudos

Hi Ganga,

I'm not aware of any SAP Standard programs to generate large data. I've done something similar for my own knowledge building in the past. The following scripts create some sample dimension and fact tables and then populate the tables either by direct insert, or through a stored procedure where you enter an integer stating how large the sample dataset should be. Note that this may not be the most optimal method as creating a fact table with 10m rows took about 50 minutes on our internal box. If you were to create records in the billions, then a day(s) using the approach below. Another option would be to create excel spreadsheets and then import.

Regards,

Gary Elliott

CREATE COLUMN TABLE "SALES_F" ("SALES_ORDER_NBR" BIGINT CS_FIXED NOT NULL ,

       "CALENDAR_DAY" DAYDATE CS_DAYDATE,

       "BUSINESS_UNIT_ID" BIGINT CS_FIXED,

       "MATERIAL_ID" BIGINT CS_FIXED,

       "SUPPLIER_ID" BIGINT CS_FIXED,

       "UNIT_PRICE" DOUBLE CS_DOUBLE,

       "QUANTITY_SOLD" DOUBLE CS_DOUBLE,

       PRIMARY KEY ("SALES_ORDER_NBR"));


CREATE COLUMN TABLE "BUSINESS_UNIT_D" ("BUSINESS_UNIT_ID" BIGINT CS_FIXED NOT NULL ,

       "BUSINESS_UNIT_CODE" NVARCHAR(5),

       "BUSINESS_UNIT_DESC" NVARCHAR(256),

       "PARENT_BUSINESS_UNIT_ID" BIGINT CS_FIXED,

       "PARENT_BUSINESS_UNIT_CODE" NVARCHAR(5),

       PRIMARY KEY ("BUSINESS_UNIT_ID"));


CREATE COLUMN TABLE "SUPPLIER_D" ("SUPPLIER_ID" BIGINT CS_FIXED,

       "SUPPLIER_DESC" VARCHAR(60),

       PRIMARY KEY("SUPPLIER_ID"));


CREATE COLUMN TABLE "MATERIAL_D" ("MATERIAL_ID" BIGINT CS_FIXED,

       "SKU" VARCHAR(16),

       "MATERIAL_GROUP" VARCHAR(60),

       PRIMARY KEY("MATERIAL_ID"));

INSERT INTO "BUSINESS_UNIT_D"

VALUES(1,'BU1','Business Unit 1',0,'');

INSERT INTO "BUSINESS_UNIT_D"

VALUES(2,'BU2','Business Unit 2',1,'BU1');

INSERT INTO "BUSINESS_UNIT_D"

VALUES(3,'BU3','Business Unit 3',1,'BU1');

INSERT INTO "BUSINESS_UNIT_D"

VALUES(4,'BU4','Business Unit 4',2,'BU2');

INSERT INTO "BUSINESS_UNIT_D"

VALUES(5,'BU5','Business Unit 5',3,'BU3');

INSERT INTO "BUSINESS_UNIT_D"

VALUES(6,'BU6','Business Unit 6',3,'BU4');

INSERT INTO "BUSINESS_UNIT_D"

VALUES(7,'BU7','Business Unit 7',4,'BU4');

INSERT INTO "BUSINESS_UNIT_D"

VALUES(8,'BU8','Business Unit 6',4,'BU4');

CREATE COLUMN TABLE ADJECTIVE (ID INTEGER, WORD VARCHAR(60), PRIMARY KEY ("ID"));

CREATE COLUMN TABLE NOUN (ID INTEGER, WORD VARCHAR(60), PRIMARY KEY ("ID"));

CREATE COLUMN TABLE SUP_TYPE (ID INTEGER, WORD VARCHAR(60), PRIMARY KEY ("ID"));

INSERT INTO ADJECTIVE VALUES(1, 'Great');

INSERT INTO ADJECTIVE VALUES(2, 'Modern');

INSERT INTO ADJECTIVE VALUES(3, 'Fast');

INSERT INTO ADJECTIVE VALUES(4, 'Proud');

INSERT INTO ADJECTIVE VALUES(5, 'Solid');

INSERT INTO ADJECTIVE VALUES(6, 'Broad');

INSERT INTO ADJECTIVE VALUES(7, 'Elegant');

INSERT INTO ADJECTIVE VALUES(8, 'Fancy');

INSERT INTO ADJECTIVE VALUES(9, 'Mysterious');

INSERT INTO ADJECTIVE VALUES(10, 'Fantastic');

INSERT INTO NOUN VALUES(1, 'Factory');

INSERT INTO NOUN VALUES(2, 'Offices');

INSERT INTO NOUN VALUES(3, 'Industry');

INSERT INTO NOUN VALUES(4, 'Station');

INSERT INTO NOUN VALUES(5, 'Restaurant');

INSERT INTO NOUN VALUES(6, 'Buildings');

INSERT INTO NOUN VALUES(7, 'Mall');

INSERT INTO NOUN VALUES(8, 'Studio');

INSERT INTO NOUN VALUES(9, 'Stockbrokers');

INSERT INTO NOUN VALUES(10, 'Academy');

INSERT INTO SUP_TYPE VALUES(1, 'Limited');

INSERT INTO SUP_TYPE VALUES(2, 'Pty Ltd');

INSERT INTO SUP_TYPE VALUES(3, 'Partnership');

INSERT INTO SUP_TYPE VALUES(4, 'Group');

INSERT INTO SUP_TYPE VALUES(5, 'Trust');

INSERT INTO SUP_TYPE VALUES(6, 'Collective');

INSERT INTO SUP_TYPE VALUES(7, 'Consortium');

INSERT INTO SUP_TYPE VALUES(8, 'Inc.');

INSERT INTO SUP_TYPE VALUES(9, 'Traders');

INSERT INTO SUP_TYPE VALUES(10, 'Franchise');

CREATE SEQUENCE seq START WITH 1;

CREATE PROCEDURE BUILD_SUPPLIER_TABLE (IN NMBR INT) LANGUAGE SQLSCRIPT AS

CNTR INTEGER;

BEGIN

CNTR := 0;

WHILE CNTR < :NMBR DO

INSERT INTO SUPPLIER_D

SELECT seq.NEXTVAL,

            (SELECT TOP 1 WORD FROM ADJECTIVE WHERE ID = SUBSTR(ROUND(RAND() * 9, 0 ),1,1) + 1 ORDER BY WORD)  || ' ' ||

            (SELECT TOP 1 WORD FROM NOUN WHERE ID = SUBSTR(ROUND(RAND() * 9, 0 ),1,1) + 1 ORDER BY WORD) ||  ' ' ||

            (SELECT TOP 1 WORD FROM SUP_TYPE WHERE ID = SUBSTR(ROUND(RAND() * 9, 0 ),1,1) + 1 ORDER BY WORD)  AS SUPDESC

FROM DUMMY;      

CNTR := CNTR + 1;

END WHILE;

END;

CALL BUILD_SUPPLIER_TABLE(1000);

CREATE COLUMN TABLE MAT_GROUP (ID INTEGER, WORD VARCHAR(60), PRIMARY KEY ("ID"));

INSERT INTO MAT_GROUP VALUES(1, 'Engine');

INSERT INTO MAT_GROUP VALUES(2, 'Exterior');

INSERT INTO MAT_GROUP VALUES(3, 'Interior');

INSERT INTO MAT_GROUP VALUES(4, 'Accesories');

INSERT INTO MAT_GROUP VALUES(5, 'Electrical');

INSERT INTO MAT_GROUP VALUES(6, 'Components');

INSERT INTO MAT_GROUP VALUES(7, 'Finishing');

INSERT INTO MAT_GROUP VALUES(8, 'Hydraulics');

INSERT INTO MAT_GROUP VALUES(9, 'Liquids');

INSERT INTO MAT_GROUP VALUES(10, 'Extras');


CREATE PROCEDURE BUILD_MAT_GROUP_TABLE (IN NMBR INT) LANGUAGE SQLSCRIPT AS

CNTR INTEGER;

BEGIN

CNTR := 0;

WHILE CNTR < :NMBR DO

INSERT INTO MATERIAL_D

SELECT :CNTR,

       'SKU' || LPAD(ROUND((RAND() * 1000000),0),7,'0000000') as SKU,

            (SELECT TOP 1 WORD FROM MAT_GROUP WHERE ID = SUBSTR(ROUND(RAND() * 9, 0 ),1,1) + 1 ORDER BY WORD)  AS MATERIAL

FROM DUMMY;      

CNTR := CNTR + 1;

END WHILE;

END;

CALL BUILD_MAT_GROUP_TABLE(10000);

CREATE PROCEDURE BUILD_FACT_TABLE (IN NMBR INT) LANGUAGE SQLSCRIPT AS

CNTR INTEGER;

BEGIN

CNTR := 0;

WHILE CNTR < :NMBR DO

INSERT INTO SALES_F

SELECT :CNTR,

       ADD_DAYS (TO_DATE ('2011-01-01', 'YYYY-MM-DD'), RAND() * 730),

         ROUND((RAND() * (SELECT COUNT(*) FROM BUSINESS_UNIT_D)), 0 ),

         ROUND((RAND() * (SELECT COUNT(*) FROM MATERIAL_D)), 0 ),

         ROUND((RAND() * (SELECT COUNT(*) FROM SUPPLIER_D)), 0 ),

         ROUND(RAND() * 1000,2),

         ROUND(RAND() * 100,0)

FROM DUMMY;      

CNTR := CNTR + 1;

END WHILE;

END;

CALL BUILD_FACT_TABLE(10000000);

lbreddemann
Active Contributor
0 Kudos

Hi Gary,

this is a nice little tool to create mass data.

And your code actually works straight with copy/paste - I so wish that would be the case much more often here in the forums .

However, two remarks need to be made:

  • RAND() is available only as of revision 46 (documented with rev. 47) - so this code won't work on older revisions.
  • This kind of "looped"-data sets do create lots of data, but don't create real data distributions.

The latter point can become very important once actual data analysis is done based on the data.

Depending on data distribution the same implementation of models can become slow and e.g. partitioning schemes can turn to be sub-optimal.

But keeping this in mind, the code posted delivers a nice set of play data.

Thanks for sharing,

Lars

Former Member
0 Kudos

Thanks Lars,

I should've added that RAND() will only be available from revision 46 onwards. Glad to see it's finally being included as it's something I've missed for a while (a long with window functions which are now included, but that's a different story ).

For the data distributions, I did play around with the idea of creating some fake data distributions. Something along the lines of a weighting ratio in the dimension tables to skew the allocation to show something more meaningful when reporting.  I might revisit the code and look at adding this in and creating a blog here. Sample datasets seem to be something that is asked for a lot so I think it will be worthwhile to expand on this.

Regards,

Gary

lbreddemann
Active Contributor
0 Kudos

Just a little addition to the above:

When running the huge insert statement one might wonder "how far is my insert loop done by now?".

As we cannot just "look into" our running procedure this question is not too easy to be answered.

Luckily it's very easy to provide some status information: via the session variables!

CREATE PROCEDURE BUILD_FACT_TABLE (IN NMBR INT) LANGUAGE SQLSCRIPT AS

CNTR INTEGER;

BEGIN

CNTR := 0;

WHILE CNTR < :NMBR DO

INSERT INTO SALES_F

SELECT :CNTR,

       ADD_DAYS (TO_DATE ('2011-01-01', 'YYYY-MM-DD'), RAND() * 730),

         ROUND((RAND() * (SELECT COUNT(*) FROM BUSINESS_UNIT_D)), 0 ),

         ROUND((RAND() * (SELECT COUNT(*) FROM MATERIAL_D)), 0 ),

         ROUND((RAND() * (SELECT COUNT(*) FROM SUPPLIER_D)), 0 ),

         ROUND(RAND() * 1000,2),

         ROUND(RAND() * 100,0)

FROM DUMMY;

    -- monitoring via m_session_context

    if MOD(:cntr, 1000) = 0 then

        EXEC  'set session ''BUILD_FACT_TABLE PROGRESS''= '''||:CNTR ||' of '|| :NMBR ||''' ';

    end if;

    -- monitoring via m_session_context

CNTR := CNTR + 1;

END WHILE;


END;

The little IF construct updates the session variable BUILD_FACT_TABLE PROGRESS after every 1000 rows that have been processed.

To check the current value one can use the following statement:

select * from "PUBLIC"."M_SESSION_CONTEXT" where key = 'BUILD_FACT_TABLE PROGRESS';


Pretty easy and handy.

One thing though: if you run the procedure in multiple sessions, you should filter the select against the M_SESSION_CONTEXT table by the current CONNECTION_ID.

- Lars

lbreddemann
Active Contributor
0 Kudos

Yep - sample data sets are really important.

I guess it will be quite difficult to come up with a "general" solution that could take some data distribution information (maybe stuff like 5-point summary + skewness ? ) from present business data and create the appropriate dummy data that matches this.

An alternative approach maybe could be to setup a couple of rules for different domains (FLAG status, DATEs, TYPE data) and create data based on them...

In any way, you're example is a really good starting point!

- Lars

Answers (4)

Answers (4)

Former Member
0 Kudos

Don't forget about 2 billion table limitation.

Former Member
0 Kudos

Hi Mikhail,

    For COLUMN tables, you can get around the 2 billion row limit for the table by setting up a partition. Generally, you'll do this across a multi-node cluster for the best performance. Regarding Gary's solution for generating the data this way, this is not likely to be representative of real data as the fact table will have less unique values and will result in a great compression ratio. This is why I like the TPC-DS benchmark over the TPC-H benchmark. TPC-H goes too far in generating unique values for the fact table. On the other hand, more thought by the committee went into the DB design and data generation to be more real-life. More importantly, you can really crank up the size of the data set. Check out this great read on the making of the TPC-DS benchmark - http://www.vldb.org/conf/2006/p1049-othayoth.pdf and http://www.tpc.org/tpcds/presentations/tpcds_workload_analysis.pdf. You can find the generator at http://www.tpc.org/tpcds/default.asp and look for the DSGen link. Note, it's critical to use the scale factors designated for official benchmark runs that creates both query files and the data files. If you don't, the DSGen program errors out. There is a scale factor of 1 available for a 1GB of data for getting the queries right. You can then crank it up to 100GB, 300GB, 1TB, 3TB, 30TB and 100TB.

    TPC-DS is a brutal benchmark - notice no vendors have published results yet for it. It's not possible for a pure columnar database to game this benchmark, so it requires a engine that can balance ROW and COLUMN tables for the best performance. I'm guessing the HANA dev team in Germany is looking at hard to make this work. 🙂

Regards,

Bill

Former Member
0 Kudos

Hi ganga,

Started the process of developing POC in SAP HANA space. can you please guide me .please share if you have any helpful doc .

thank you

Edit by Marcel Salein (moderator): Please share the documents on SCN, so that all other communtiy members can benefit from it.

Former Member
0 Kudos

Hi,

if You want to have it much faster and without codeing:

After GaryÄs initial setup of the tables:

CREATE COLUMN TABLE <mytable>...;

and after Gary's initial Inserts:

INSERT INTO <mytable>...;

just execute a couple of times:

Insert into <mytable> SELECT * FROM <mytable>;

and issues after each Insert a

MERGE DELTA...;

and have in mind that the data is highly compresssed now !


Martin

Former Member
0 Kudos

Hi Ganga,

   If you want to generate multiple GB of data for query tests, check out the TPC.Org web site. They have data generators used for the TPC-H and the new TPC-DS benchmark tests along with queries that can be used to test performance. You can find the data generator on the http://tpc.org/tpce/default.asp page. There is a simple form to fill out when you go to http://tpc.org/tpce/egen-download-request.asp to request the download. There are Windows and Linux versions of the data generator that are written in C++. The Linux version is handy as you can run it on your HANA instance on AWS without having to upload the resulting files. You can get the DSGen tool at this link as well - http://tpc.org/tpcds/dsgen-download-request.asp. The DS benchmark is a better data warehousing benchmark IMHO as it's schema is more real-world and it's designed for more mixed workloads. With TPC-E, pure-column oriented database vendors were able to game the benchmark to perform much better that traditional databases. The DS benchmark is harder to game - it could be the reason that no vendors have published numbers.

  Net-net - the generators and sample queries are good to use for getting more insight as to how well HANA can perform with larger data sets.

Regards,

Bill