An Integration of Apache Hadoop, SAP HANA, and SAP...

Technology Blogs by Members

Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the mix!

Last week was very busy for attendees at SAP TechEd Las Vegas, so fortunately SAP has made some recordings of sessions available. I watched An Integration of Apache Hadoop, SAP HANA, and SAP BusinessObjects, session EA204 today with SAP's Anthony Waite.

Text Analysis came up a few times last week plus I am familiar with Data Services Text Analysis features. First, a review of how text mining fits in.

Figure 1: Source: SAP

Figure 1 shows we have "lots of unstructured data", with 80% unstructured. This is data that you cannot run through a business process, according to Anthony Waite.

Unstructured gets messy; think of a MS Doc file – can you run that through your system/process? Some say content management.

The hot topic is social networks and analyzing for sentiment analysis.

Customer preferences can be mined as an example.

Figure 2: Source: SAP

Anthony said you typically don’t go into your BI tool and search for unstructured information.

It is challenging; it is intensive to process and analyze. Text mining is CPU intensive; there is a lot of “noise” in that data.

Introduction to Hadoop

Figure 3: Source: SAP

Hadoop is a big open source framework for being able to use economical boxes in a distributed manner

HDFS – Hadoop Distribution File System are the essence

Figure 4: Source: SAP

HDFS distributes and replicates data across the machines. The programming model takes advantage of parallelization.

HBase is a schema less database

Hive is a SQL interface for data warehouse

It is harder to find experts in Hadoop; cannot equate Hive to regular relational database – it is a subset

Pig is the scripting language for data flow

Mahout for machine learning – feature vector, training set, classification to run data through it

Figure 5: Source: SAP

The advantage is distributed data storage, allowing for scalability - you just add a box as part of the cluster and it is reliable, has libraries available

The disadvantages are that it is not real time and it is slow; batch oriented environment; there are open source projects to make this more efficient

Figure 6: Source: SAP

This is showing where an end user can find a single interest

Data set is predominantly male; age is mostly 20-30 year olds and they have the lowest income and they are interested in music, basketball, and fashion.

The older market is interested in stock market but younger is interested in music and basketball

Figure 7: Source: SAP

Figure 7 suggests using HANA for performance, high value data, for data volume with noise use Hadoop, and ability to visualize use BI.

Figure 8 shows you can use Data Services to load into SAP HANA.

The example they looked at was user behavior analysis by visiting web sites

Use Hadoop at front end for unstructured as considering it low value data at that point

When pull data into HANA it is higher value; you could use Hadoop but depends on where you want to do the work

He suggested taking advantage of using SQL Script and Predictive Analytics Library (PAL)

Figure 9: Source: SAP

Figure 9 shows getting information about user behavior on a web site

Figure 9 shows using Hadoop to train and classify (machine learning)

Fetch text in Hadoop, create feature vectors and train via machine learning and then run it through the classifier the feature vectors that have segmented the words

You tie that in with the URL of interest such as “basketball” “health” “books”

Then you transfer the data to HANA; all use HANA for is analyzing.

Figure 10: Source: SAP

This solution uses Hadoop less – using Hadoop to create the features and HANA is doing more work. You do the modeling and training in HANA

Classifying using two methods: SQL Script and PAL to show difference in performance

Figure 11: Source: SAP

Figure 11 shows the performance comparison

Hadoop has 10 notes, 8 cores on each of the nodes with 16 GB

HANA was running on a single appliance with 32 cores

Hadoop has cheaper hardware

The amount of data 180 million rows per day; this was being analyzed by day with 10 M unique users and 66 interest numbers

Solution 1 in Hadoop takes 285 seconds – Hadoop is doing training and classifying here

Solution 2 – training & classifying is in HANA – time is less – 130 seconds (SQL script for classifying – URL’s mapped to interest and run the calculation based on interest)

Solution 3 – training & classification is lessusing PAL

Hadoop only was 24 hours

What do you want to do in Hadoop? What do you want to do in HANA? It depends on your environment