Data Conduction

Former Member · ‎03-12-2013

This blog post is intended to be a primer making a case for the use of Hadoop and SAP integration with R. You may just want to skip to the example at the bottom - I promise not to be offended.

... I think is a reasonable description for what we do all the time with data - we pull it in from multiple sources, in multiple formats, and then manipulate it in order to make sense of it, or in the hope that it can reveal secrets about the places that it comes from through this process of orchestration.

Hadoop epitomises this, where the orchestration spans as many nodes as you have in your cluster, and the data is shunted through various pipelines of processing, changing format, sorting, summarising, and eventually coming out the other end in the desired marshalled state.

But Hadoop isn't the management layer, or even a particularly good interface to data for an Analyst to work with - this is where R comes in with it's natural expressiveness for interpreting the data that we wield.

In this blog, I want to explore using SAP as a primary data authority, and Hadoop as another tool to add to the list that can be seamlessly integrated with R, giving us more options when we tackle our large scale data analytics problems.

Hadoop

In the last year or so, Hadoop got very easy to use with scripting languages when a relatively small change was added to Hadoop streaming - this was something called Typed Bytes (first proposed in 2009, but only made main stream releases relatively recently) which enabled the flexible definition of data passed in and out of streaming map or reduce jobs. Suddenly scripting language interfaces could format data for Hadoop in ways that suited their internal structures - game on. This has lead to a number of frameworks for Hadoop most notably Dumbo for Python and RHadoop (Antonio Piccolboni from RevolutionAnalytics) for R.

Hadoop takes a bit of setting up - to take advantage of the new features you need to ensure that you have version >= 1.0.2. There is a good simple tutorial for getting it going here with the standard installation instructions here .

One thing about Hadoop that is not widely discussed, is that it's not just good at scaling up to managing vast quantities of computing resources for parallel processing - it is actually good at handling highly constrained resources too. The typical trade off is time as you break the job down into smaller pieces and handle them serially - this is no more apparent than when testing examples such as the following on a single node cluster like my laptop!

R, SAP, and Hadoop

R is a natural interface for shaping and interpreting data where it inherently expects data to come from multiple sources, especially with RStudio as a workbench (covered here previously). I find that it allows you to incrementally approach a problem, back track, as well as put the problem aside (without loosing the thread) for a while when necessary through features such as session and history saving. All of this is highly conducive to the craft (art?) of data analytics, where the majority of the time it is hard to know in advance what paths will be taken to the answer, or indeed if the answer is anything like what you thought it might be.

These unique data exploratory characteristics of R are one of open sources best kept secrets.

SAP is the core/central data authority in the business. Other data sources exist inside and outside the enterprise, but SAP (in most cases) is the single source of truth for key data describing the business. In many cases, these data sets are not huge, but critical eg: business partner lists, and account balances.

Data sources that lie outside of SAP are often difficult to access in a real-time fashion and may also be in many and varied formats, with differing access mechanisms which is where R and Hadoop can help.

In the following example, I want to demonstrate how Hadoop integration works with R, and give a flavour of how it can be used to solve problems that need to span SAP and other data sources.

Solving a Problem

Using the standard flight booking example tables in SAP - SBOOK, and SFLIGHT2, I have a fictional travel agency, where I want to know who are my most valued customers. I define my valued customers in terms of (a) how much money they spend, and (b) how far they fly. In the SFLIGHT2 table, I do get an estimate of mileage of each flight, but I don't trust this value, so I need to go outside of SAP to calculate it. It so happens that Google has a geo location API and I know what the airport codes are for the journey start and end points. Given this, I want to take each completed booking, determine the location start and end points, calculate the distance between them and then summarise the total distance, and total spend per customer per year.

Using RSAP I'm going to extract the booking and flight data from SAP, and then use Hadoop to farm out the 'heavy processing' (it's not really heavy but hey - this is just an example) of data formatting, geo location, distance calculation and summarisation.

Setup

Firstly, you don't actually need a Hadoop cluster to try out these examples, but it makes it more fun. Rhadoop has a local processing option that allows you to debug your code prior to hitting the cluster by setting - rmr.options(backend = "local") (see the DEBUG flag in the example below).

However, you must have a memcache server available, as the geo locating process stashes discovered locations here. This is necessary because of Googles API limit on calls, and is also an interesting twist on how to improve performance of map/reduce jobs.

From R, you need to install the following packages - ensure that they are installed in a site wide package location:

RSAP (from CRAN - further instructions can be found here)
rmemcached (from github)
The two required RHadoop modules (rmr2, rhdfs) as described here
RJSONIO (from CRAN)
fossil - for calculation distances between locations (from CRAN)

Basic R package installations take the form of install.packages('<package name>') from the R command line.

Extract the SFLIGHT2 and SBOOK tables from SAP:

dataDir = '/a/directory/R/'

 # Grab data from SAP via RFC call and store in R data file

if (!exists('cbook')) {

    if (!file.exists(file.path(dataDir, 'cbook.RData'))) {

        print("Could not find cbook - generating it...")

        # read SFLIGHTS2  for airports and SBOOK for bookings

        library(RSAP) 

        conn <- RSAPConnect("sap.yml") 

        print(system.time((sflights <- RSAPReadTable(conn, "SFLIGHTS2"))))  # 4880 rows

        print(system.time((sbook <- RSAPReadTable(conn, "SBOOK",

                                                  fields=list('CARRID', 'CONNID', 'FLDATE', 'INVOICE', 'CANCELLED', 'CUSTOMID', 'PAYMENTSUM'))))) # 1370282 rows

        RSAPClose(conn) 

        # merge the tables together on flight and date

        cbook <- merge(sflights, sbook, by=c('CARRID', 'CONNID', 'FLDATE'))

        save(cbook, file=file.path(dataDir, 'cbook.RData'), ascii=FALSE)

        print("generated and saved cbook for next time")

    } else {

        # grab the saved dataset

        print("Loading cbooks from disk")

        rm(cbook)

        print(load(file.path(dataDir, 'cbook.RData')))

    }

}

We now have two data frames sflights, and sbook that contain 4,880, and 1,370,282 rows respectively, and they have been stashed in an R data file, so we don't have to go there again.

Run the map/reduce

Key things to note about the map/reduce job:

set the environment variables for your Hadoop cluster
the local backend option uses temp files, and chops the data up into 10,000 record chunks to emulate Hadoop processing
Ensure that you do not exceed the Google API query limit or the job will fail

Please read the notes in the code:

# sort out loss of env with RStudio for Hadoop

Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")

Sys.setenv(HADOOP_HOME="/usr/share/hadoop")

library(rmr2)

 # switch on debug to run local - no Hadoop

DEBUG=TRUE

if (DEBUG) {

    rmr.options(backend = "local")

} else {

    rmr.options(backend = "hadoop")

}

# the returned data structure from the completed job - this is the natural output from a map/reduce call

if (!exists('out')) {

    rm(out)

}

if (!exists('result.df')) {

    rm(result.df)

}

out = from.dfs(

    mapreduce(

        to.dfs(cbook), # save the bookings data to the Hadoop file system

        map =              # define the mapper that will do the formatting and geo location

            function(k, v) {

                # connect to server here as global so that the connection can be reused

                library(rmemcached)

                server <- cache.connect("127.0.0.1", 11211) # change to your memcache server

                # geo coding routine shamelessly taken from Jitender Aswani http://allthingsbusinessanalytics.blogspot.com/

                getGeoCode <- function(gcStr) {

                    library("RJSONIO")

                    # fix up airport codes that are wrong

                    if (gcStr == 'TYO') {

                        gcStr = 'HND'

                    } else if (gcStr == 'KIX') {

                        gcStr = 'KANSAI'

                    } else if (gcStr == 'THF') {

                        gcStr = 'BRANDENBURG'

                    }

                    # use memcache to store results so we don't spam the Google API

                    stash <- cache.get(server, gcStr)

                    if (!is.null(stash)) {

                        names(stash) <- c("Lat", "Lng")

                        return(stash)

                    } else {

                        # stop(paste("could not geocode from cache", gcStr))

                        gcStr <- gsub(' ','%20',gcStr) #Encode URL Parameters

                        connectStr <-

                          paste('http://maps.google.com/maps/api/geocode/json?sensor=false&address=', gcStr, '%20airport', sep="")

                        con <- url(connectStr)  # call the API

                        data.json <- fromJSON(paste(readLines(con), collapse="")) # decode the JSON results

                        close(con)

                        #Flatten the received JSON

                        data.json <- unlist(data.json)

                        # grab the lat and long from the JSON structure

                        if(data.json["status"]=="OK") {

                            lat <- data.json["results.geometry.location.lat"]

                            lng <- data.json["results.geometry.location.lng"]

                            gcodes <- c(lat, lng)

                            names(gcodes) <- c("Lat", "Lng")

                            # save in the cache for next time

                            cache.set(server, gcStr, as.numeric(gcodes))

                            return(as.numeric(gcodes))

                        } else {

                           # if you got here then you probably exceeded the API limit

                            stop(paste("could not geocode", gcStr, '-', data.json))

                        }

                    }

                }

                # function to calculate the distance between two geo loc points

                calcDistance <- function(origin, dest) {

                    library(fossil) # has earth.dist

                    return(as.numeric(earth.dist(as.table(rbind(getGeoCode(origin), getGeoCode(dest))))))

                }

                # now the real mapper function - notice how it uses native R structures of data.frame or matrix

                # to describe inputs and outputs

                # select only relevant records and columns:

                #   * invoiced

                #   * not cancelled

                v <- v[v$INVOICE == 'X' & v$CANCELLED == "", c('AIRPFROM', 'AIRPTO', 'CUSTOMID', 'PAYMENTSUM', 'FLDATE')]

                if (nrow(v) > 0) {

                    v$year <- substr(v$FLDATE, 1, 4) # grab the year from the flight date

                    v$distance <- unlist(lapply(1:nrow(v), function(i) {calcDistance(v[i,'AIRPFROM'], v[i,'AIRPTO'])})) # calculate distance

                } else {

                    v$year <- c()        # empty result placeholders

                    v$distance <- c()

                }

                # we return a matrix of keys and a matrix of values

                return(keyval(cbind(v$year, v$CUSTOMID), cbind(rep.int(1, nrow(v)), round(as.numeric(v$distance), 2), as.numeric(v$PAYMENTSUM))))

          },

        # use the reducer to summarise the distance and cost by customer by year

        reduce =

          function(k, vv)  if(nrow(vv) > 0) keyval(k, cbind(sum(vv[, 1], na.rm=T),       # number of. flights

                                                            sum(vv[, 2], na.rm=T),       # total distance

                                                            sum(vv[, 3], na.rm=T))),     # total paid

        combine=TRUE # no real combiner allocated

  ))

# reformat output to data.frame

result.df <- data.frame(cbind(out$key, out$val), stringsAsFactors = FALSE)

colnames(result.df) <- c('year', 'customid', 'flights', 'distance', 'cost')

You may receive a warning like:

In to.dfs(cbook) : Converting to.dfs argument to keyval with a NULL key

This is because no key has been specified for the dataset, and can be disregarded.

The Results

And there we have in the results below. Not surprisingly we have a different set of customers when viewed by mileage than by revenue. The higher mileage customers might think they have more of a claim on a customer reward scheme, but we may see it differently as revenue by customer may indicate higher value. Either way, we can now reward our top customers.

# 2012 top 10 most valued customers by revenue

subset(result.df[order(as.numeric(result.df$cost), decreasing=TRUE),], year=="2012")[1:10,]

year customid flights distance cost

2012 1731 77 669032.78 27619910.84

2012 1764 74 626387.3 26672217.89

2012 3077 65 567615.38 26331673.76

2012 1919 65 524266.25 26318934.19

2012 4015 69 589793.49 26160147.92

2012 1402 75 616431.91 25427669.25

2012 2793 76 692470.6 25044591.17

2012 1922 70 644232.32 24998612.27

2012 4528 75 636527.83 24915862.29

2012 52 70 643509.53 24757153.08

# 2012 top 10 most valued customers by airmiles

subset(result.df[order(as.numeric(result.df$distance), decreasing=TRUE),], year=="2012")[1:10,]

year customid flights distance cost

2012 2793 76 692470.6 25044591.17

2012 3348 80 674154.39 21708568.64

2012 2525 78 672130.08 22931827.02

2012 1731 77 669032.78 27619910.84

2012 1511 83 663782.33 22940185.46

2012 3429 84 661984.52 22914627.49

2012 4268 77 655323.39 21989782.32

2012 18 72 653851.2 21678816.85

2012 2150 68 650492.96 20314354.86

2012 861 73 650359.4 23346561.28

In Summary

RHadoop provides new and novel ways of data blending within the R workbench, adding another useful tool to the life of a data analyst. The interface to the map reduce process is very R like, allowing us to relate to the data in the pipeline as R data types.

R is a blank canvas with a whole raft of widgets for 'painting' a path to data analytics, but the same is true of Hadoop. What you put inside a map/reduce job could literally be anything in terms of how you source and manipulate data - infact there is nothing stopping you from mixing and matching steps from different programming languages and data interfaces to get the jobs done.

Almost too much choice!

Data Conduction - R SAP and Hadoop

Data Conduction

Hadoop

R, SAP, and Hadoop

Solving a Problem

Setup

Extract the SFLIGHT2 and SBOOK tables from SAP:

Run the map/reduce

The Results

In Summary

SAP PI for Beginners

ABAP 7.40 Quick Reference

Fiori: technical installation and configuration of one app from A - Z