Currently Being Moderated

Data Hackery

 

Data Hackery is a passion of mine, and as long as I have been involved in SAP and OpenSource, I have delighted in finding new and interesting ways of getting more out of what we already have, or making connections between data in SAP and things that haven't been apparent to me before.

 

To me - data hackery is typified by curiosity, and the need/desire to iteratively reshape and query data to gain new insights, and to satisfy questions and problems that we are burning to resolve.

 

 

In-Memory Databases

 

This Christmas just passed, I started investigating the In-Memory database landscape - in particular in the OpenSource world, and found that there is already quite a lot to see out there eg: http://en.wikipedia.org/wiki/In-memory_database, but what I really wanted was to explore the paradigm of column oriented in-memory stores: http://en.wikipedia.org/wiki/Column-oriented_DBMS  - and in this I found one that really stuck out to me - MonetDB .

 

MonetDB is a column oriented in-memory database, that has a very simple implementation model and respects SQL'99 to SQL'03.  Performance is primarily tuned in the OS, so the main thing you have to look out for is that you have enough memory to do what you are attempting to do.  The second thing is to think carefully about your queries - don't use SELECT * when you don't need to - target the columns required.  Tuning based on indexes are mostly meaningless, but you do need to consider how Bitmap Indexes work.  To me,  this is a brilliant low-cost solution to a fast data manipulation and aggregation problem.

 

 

Mixing it all together

 

With this, I get to the point where I have a personal column oriented In-Memory database with scalability (mostly) limited by how much memory I can throw at it, R and SAP  - this is my idea of fun, and all I need to do is mix them together.

 

I want to be able to take data from SAP, optionally augment it from external sources, and then rapidly perform data analysis in an iterative fashion - the key thing about 'rapid' is that there is low impedance to the process of discovery - and this is where I think there is a good mix in:

 

  • SAP as the authoritative data store
  • R and RStudio are the workbench
  • MonetDB is the lightning fast query engine

 

We can achieve the coupling of this using two connectors - RSAP, and RMonet. (Note: this was updated from RMonetDB to RMonet as the package was renamed due to a naming clash - 9/02/2013).

 

I have covered the process for installing and using RSAP in a previous post here.

 

RMonetDB as a database connector is a new subject and the installation process is described here.  First, you must install MonetDB itself, along with the client tools.  These are available at the download page.  Installation for Ubuntu/Debian is very simple and described here.

 

Before you go much further, I recommend looking through (and trying) the Getting Started tutorial - this steps you through the basic administration process (quick and easy), and handling some simple queries.

 

 

Voilà

 

To demonstrate my point - my example is based on the generated test data in tables SFLIGHT2, and SBOOK.  The data is generated by running report SAPBC_DATA_GENERATOR.

 

As with all data analysis problems, I start with a question - "What were the worst flight to travel on based on predicted mood"?

 

I've equated mood of the flight to the effect that smokers are likely to have on the atmosphere - I'm assuming that the flights are non-smoking, and that the length of flight will have the effect of exacerbating withdrawal symptoms using a weighting metric - smokers are a function of time 0.5 * time * smokers (guessing that a smoker will want to light up every half hour when stressed, so their stress level goes up every 30 mins).

 

The basic process is to source the data in SAP, load it into MonetDB, let MonetDB do the heavy lifting in terms of processing and manipulation, and then use R to martial the process and take care of any presentation or visualisation requirements.

 

Extract the SFLIGHT2 and SBOOK tables from SAP:

 

# load my dependent libraries

library(RMonet)

library(RSAP)

 

# connect to SAP and pull the data I require into data.frames

conn <- RSAPConnect("sap.yml")

print(system.time((sflights <- RSAPReadTable(conn, "SFLIGHTS2"))))

print(system.time((sbook <- RSAPReadTable(conn, "SBOOK", fields=list('CARRID', 'CONNID', 'FLDATE', 'SMOKER', 'CANCELLED')))))

RSAPClose(conn)

 

We now have two data frames sflights, and sbook that contain 4,880, and 1,370,282 rows respectively.

Which took:

SFLIGHTS2

   user  system elapsed

  1.908   0.100   4.015

 

SBOOK

   user  system elapsed

106.559  19.661 347.879

 

These are actually shockingly bad times - almost 6 minutes for 1.3 million rows out of the SAP test drive NPL system?  Well, I only have an Dell XPS-17 with 8Gb ram to run everything on ....

 

 

Load the flight data.frames into MonetDB:

 

# connect to MonetDB and load my data.frames into tables

conn <- RMonetConnect(dbhost='localhost', dbuser='voc', dbpass='voc', dbname='voc')

print(system.time((res = dbwrite(conn, sflights, "sflights", drop=TRUE))))

print(system.time((res = dbwrite(conn, sbook, "sbook", drop=TRUE, tmp.file=TRUE))))

 

The two data frames (sflights, and sbook) are now loaded into MonetDB.

 

sflights loaded in:

   user  system elapsed

  0.388   0.008   0.999

 

sbook loaded in:

   user  system elapsed

16.177   0.432  20.204

 

Not too bad for the old XPS ...

 

 

Run the query (split over lines to make it readable):

 

# find the worst flights to travel - mood due to smokers

print(system.time((worst <- dbquery(conn, 'SELECT "bs"."CARRID", "bs"."CONNID", "bs"."FLDATE", "f"."FLTIME", "bs"."SMOKERS",

                                                                                            ("bs"."SMOKERS" * ("f"."FLTIME" / 30)) AS "STRESS_POINTS"

                                                                             FROM (SELECT "b"."CARRID", "b"."CONNID", "b"."FLDATE", COUNT(*) AS "SMOKERS"

                                                                                           FROM voc.sbook b WHERE "b"."SMOKER" <> \'\' AND "b"."CANCELLED" = \'\'

                                                                                          GROUP BY "b"."CARRID", "b"."CONNID", "b"."FLDATE") bs

                                                                             LEFT JOIN voc.sflights f ON "bs"."CARRID" = "f"."CARRID" AND

                                                                                                                                "bs"."CONNID" = "f"."CONNID"  AND

                                                                                                                                "bs"."FLDATE" = "f"."FLDATE"

                                                                             ORDER BY "STRESS_POINTS" DESC'))))

RMonetClose(conn)

 

We now have a data.frame (worst) that has the results of bookings summarised by flights with a calculation of stress points relative to journey length and number of smokers on board (completely fictional, I know).

 

Query run time:

   user  system elapsed

  0.036   0.000   0.363

 

Well - I'm impressed.  .36 of a second to JOIN, GROUP, SORT those two tables (plus an inner SELECT), on my limited resources.

 

The result set looks like:

> head(worst)

   CARRID CONNID   FLDATE FLTIME SMOKERS STRESS_POINTS

1      SQ     15 20120404   1125      49        1837.5

2      SQ     15 20130227   1125      49        1837.5

3      SQ     15 20120926   1125      49        1837.5

4      SQ     15 20120725   1125      49        1837.5

5      SQ     15 20120606   1125      49        1837.5

6      SQ     15 20110914   1125      48        1800.0

7      SQ     15 20111116   1125      48        1800.0

8      SQ     15 20121010   1125      48        1800.0

9      SQ     15 20120201   1125      48        1800.0

10     SQ     15 20111130   1125      48        1800.0

 

And the conclusion is flying long haul with lots of smokers going rabid is not great - especially on SQ-15 on the 4/4/2012.

 

 

In Summary

 

Exploration of data and the time it takes, is a key concern to developing insights (especially in a world of shrinking deadlines, and times to market, with a growing global attention deficit disorder).  This problem is magnified many fold when it comes to dealing with 'Big Data'.  One way of dealing with this is through breaking enormous problems into little pieces and farming it out to cheap parallel processing, which is what Hadoop and MapReduce give you.  However, this requires careful and elaborate planning which is costly, and I rather get the feeling that the winners of the big data race will be those who can take advantage of tool-sets like R and MonetDB (and others yet to come - and yes, SAP-HANA is right in there) that reduce that impedance to getting from question to answer in a cheap and accessible manner.

Comments

Actions

Filter Blog

By author: By date:
By tag: