Technology Blogs by Members
Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the mix!
cancel
Showing results for 
Search instead for 
Did you mean: 
Former Member

This blog post is intended to be a primer making a case for the use of Hadoop and SAP integration with R.  You may just want to skip to the example at the bottom - I promise not to be offended.

Data Conduction

... I think is a reasonable description for what we do all the time with data - we pull it in from multiple sources, in multiple formats, and then manipulate it in order to make sense of it, or in the hope that it can reveal secrets about the places that it comes from through this process of orchestration.

Hadoop epitomises this, where the orchestration spans as many nodes as you have in your cluster, and the data is shunted through various pipelines of processing, changing format, sorting, summarising, and eventually coming out the other end in the desired marshalled state.

But Hadoop isn't the management layer, or even a particularly good interface to data for an Analyst to work with - this is where R comes in with it's natural expressiveness for interpreting the data that we wield.

In this blog, I want to explore using SAP as a primary data authority, and Hadoop as another tool to add to the list that can be seamlessly integrated with R, giving us more options when we tackle our large scale data analytics problems.

Hadoop

In the last year or so, Hadoop got very easy to use with scripting languages when a relatively small change was added to Hadoop streaming - this was something called Typed Bytes (first proposed in 2009, but only made main stream releases relatively recently) which enabled the flexible definition of data passed in and out of streaming map or reduce jobs.  Suddenly scripting language interfaces could format data for Hadoop in ways that suited their internal structures - game on.  This has lead to a number of frameworks for Hadoop most notably Dumbo for Python and RHadoop (Antonio Piccolboni from RevolutionAnalytics) for R.

Hadoop takes a bit of setting up - to take advantage of the new features you need to ensure that you have version >= 1.0.2.  There is a good simple tutorial for getting it going here with the standard installation instructions here .

One thing about Hadoop that is not widely discussed, is that it's not just good at scaling up to managing vast quantities of computing resources for parallel processing - it is actually good at handling highly constrained resources too.  The typical trade off is time as you break the job down into smaller pieces and handle them serially - this is no more apparent than when testing examples such as the following on a single node cluster like my laptop!

R, SAP, and Hadoop

R is a natural interface for shaping and interpreting data where it inherently expects data to come from multiple sources, especially with RStudio as a workbench (covered here previously).  I find that it allows you to incrementally approach a problem, back track, as well as put the problem aside (without loosing the thread) for a while when necessary through features such as session and history saving.  All of this is highly conducive to the craft (art?) of data analytics, where the majority of the time it is hard to know in advance what paths will be taken to the answer, or indeed if the answer is anything like what you thought it might be.

These unique data exploratory characteristics of R are one of open sources best kept secrets.

SAP is the core/central data authority in the business.  Other data sources exist inside and outside the enterprise, but SAP (in most cases) is the single source of truth for key data describing the business.  In many cases, these data sets are not huge, but critical eg: business partner lists, and account balances.

Data sources that lie outside of SAP are often difficult to access in a real-time fashion and may also be in many and varied formats, with differing access mechanisms which is where R and Hadoop can help.

In the following example, I want to demonstrate how Hadoop integration works with R, and give a flavour of how it can be used to solve problems that need to span SAP and other data sources.

Solving a Problem

Using the standard flight booking example tables in SAP - SBOOK, and SFLIGHT2, I have a fictional travel agency, where I want to know who are my most  valued customers.  I define my valued customers in terms of (a) how much money they spend, and (b) how far they fly.  In the SFLIGHT2 table, I do get an estimate of mileage of each flight, but I don't trust this value, so I need to go outside of SAP to calculate it.  It so happens that Google has a geo location API and I know what the airport codes are for the journey start and end points.  Given this, I want to take each completed booking, determine the location start and end points, calculate the distance between them and then summarise the total distance, and total spend per customer per year.

Using RSAP I'm going to extract the booking and flight data from SAP, and then use Hadoop to farm out the 'heavy processing' (it's not really heavy but hey - this is just an example) of data formatting, geo location, distance calculation and summarisation.

Setup

Firstly, you don't actually need a Hadoop cluster to try out these examples, but it makes it more fun.  Rhadoop has a local processing option that allows you to debug your code prior to hitting the cluster by setting - rmr.options(backend = "local") (see the DEBUG flag in the example below).

However, you must have a memcache server available, as the geo locating process stashes discovered locations here.  This is necessary because of Googles API limit on calls, and is also an interesting twist on how to improve performance of map/reduce jobs.

From R, you need to install the following packages - ensure that they are installed in a site wide package location:

  • RSAP (from CRAN - further instructions can be found here)
  • rmemcached (from github)
  • The two required RHadoop modules (rmr2, rhdfs) as described here
  • RJSONIO (from CRAN)
  • fossil - for calculation distances between locations (from CRAN)

Basic R package installations take the form of install.packages('<package name>') from the R command line.

Extract the SFLIGHT2 and SBOOK tables from SAP:

dataDir = '/a/directory/R/'
# Grab data from SAP via RFC call and store in R data file
if (!exists('cbook')) {
    if (!file.exists(file.path(dataDir, 'cbook.RData'))) {
        print("Could not find cbook - generating it...")
        # read SFLIGHTS2  for airports and SBOOK for bookings
        library(RSAP) 
        conn <- RSAPConnect("sap.yml") 
        print(system.time((sflights <- RSAPReadTable(conn, "SFLIGHTS2"))))  # 4880 rows
        print(system.time((sbook <- RSAPReadTable(conn, "SBOOK",
                                                  fields=list('CARRID', 'CONNID', 'FLDATE', 'INVOICE', 'CANCELLED', 'CUSTOMID', 'PAYMENTSUM'))))) # 1370282 rows
        RSAPClose(conn) 
        # merge the tables together on flight and date
        cbook <- merge(sflights, sbook, by=c('CARRID', 'CONNID', 'FLDATE'))
        save(cbook, file=file.path(dataDir, 'cbook.RData'), ascii=FALSE)
        print("generated and saved cbook for next time")
    } else {
        # grab the saved dataset
        print("Loading cbooks from disk")
        rm(cbook)
        print(load(file.path(dataDir, 'cbook.RData')))
    }
}

We now have two data frames sflights, and sbook that contain 4,880, and 1,370,282 rows respectively, and they have been stashed in an R data file, so we don't have to go there again.

Run the map/reduce

Key things to note about the map/reduce job:

  • set the environment variables for your Hadoop cluster
  • the local backend option uses temp files, and chops the data up into 10,000 record chunks to emulate Hadoop processing
  • Ensure that you do not exceed the Google API query limit or the job will fail

Please read the notes in the code:

# sort out loss of env with RStudio for Hadoop
Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
Sys.setenv(HADOOP_HOME="/usr/share/hadoop")
library(rmr2)
# switch on debug to run local - no Hadoop
DEBUG=TRUE
if (DEBUG) {
    rmr.options(backend = "local")
} else {
    rmr.options(backend = "hadoop")
}
# the returned data structure from the completed job - this is the natural output from a map/reduce call
if (!exists('out')) {
    rm(out)
}
if (!exists('result.df')) {
    rm(result.df)
}
out = from.dfs(
    mapreduce(
        to.dfs(cbook), # save the bookings data to the Hadoop file system
        map =              # define the mapper that will do the formatting and geo location
            function(k, v) {
                # connect to server here as global so that the connection can be reused
                library(rmemcached)
                server <- cache.connect("127.0.0.1", 11211) # change to your memcache server
                # geo coding routine shamelessly taken from Jitender Aswani http://allthingsbusinessanalytics.blogspot.com/
                getGeoCode <- function(gcStr) {
                    library("RJSONIO")
                    # fix up airport codes that are wrong
                    if (gcStr == 'TYO') {
                        gcStr = 'HND'
                    } else if (gcStr == 'KIX') {
                        gcStr = 'KANSAI'
                    } else if (gcStr == 'THF') {
                        gcStr = 'BRANDENBURG'
                    }
                    # use memcache to store results so we don't spam the Google API
                    stash <- cache.get(server, gcStr)
                    if (!is.null(stash)) {
                        names(stash) <- c("Lat", "Lng")
                        return(stash)
                    } else {
                        # stop(paste("could not geocode from cache", gcStr))
                        gcStr <- gsub(' ','%20',gcStr) #Encode URL Parameters
                        connectStr <-
                          paste('http://maps.google.com/maps/api/geocode/json?sensor=false&address=', gcStr, '%20airport', sep="")
                        con <- url(connectStr)  # call the API
                        data.json <- fromJSON(paste(readLines(con), collapse="")) # decode the JSON results
                        close(con)
                        #Flatten the received JSON
                        data.json <- unlist(data.json)
                        # grab the lat and long from the JSON structure
                        if(data.json["status"]=="OK") {
                            lat <- data.json["results.geometry.location.lat"]
                            lng <- data.json["results.geometry.location.lng"]
                            gcodes <- c(lat, lng)
                            names(gcodes) <- c("Lat", "Lng")
                            # save in the cache for next time
                            cache.set(server, gcStr, as.numeric(gcodes))
                            return(as.numeric(gcodes))
                        } else {
                           # if you got here then you probably exceeded the API limit
                            stop(paste("could not geocode", gcStr, '-', data.json))
                        }
                    }
                }
                # function to calculate the distance between two geo loc points
                calcDistance <- function(origin, dest) {
                    library(fossil) # has earth.dist
                    return(as.numeric(earth.dist(as.table(rbind(getGeoCode(origin), getGeoCode(dest))))))
                }
                # now the real mapper function - notice how it uses native R structures of data.frame or matrix
                # to describe inputs and outputs
                # select only relevant records and columns:
                #   * invoiced
                #   * not cancelled
                v <- v[v$INVOICE == 'X' & v$CANCELLED == "", c('AIRPFROM', 'AIRPTO', 'CUSTOMID', 'PAYMENTSUM', 'FLDATE')]
                if (nrow(v) > 0) {
                    v$year <- substr(v$FLDATE, 1, 4) # grab the year from the flight date
                    v$distance <- unlist(lapply(1:nrow(v), function(i) {calcDistance(v[i,'AIRPFROM'], v[i,'AIRPTO'])})) # calculate distance
                } else {
                    v$year <- c()        # empty result placeholders
                    v$distance <- c()
                }
                # we return a matrix of keys and a matrix of values
                return(keyval(cbind(v$year, v$CUSTOMID), cbind(rep.int(1, nrow(v)), round(as.numeric(v$distance), 2), as.numeric(v$PAYMENTSUM))))
          },
        # use the reducer to summarise the distance and cost by customer by year
        reduce =
          function(k, vv)  if(nrow(vv) > 0) keyval(k, cbind(sum(vv[, 1], na.rm=T),       # number of. flights
                                                            sum(vv[, 2], na.rm=T),       # total distance
                                                            sum(vv[, 3], na.rm=T))),     # total paid
        combine=TRUE # no real combiner allocated
  ))
# reformat output to data.frame
result.df <- data.frame(cbind(out$key, out$val), stringsAsFactors = FALSE)
colnames(result.df) <- c('year', 'customid', 'flights', 'distance', 'cost')

You may receive a warning like:

In to.dfs(cbook) : Converting to.dfs argument to keyval with a NULL key

This is because no key has been specified for the dataset, and can be disregarded.

The Results

And there we have in the results below.  Not surprisingly we have a different set of customers when viewed by mileage than by revenue.  The higher mileage customers might think they have more of a claim on a customer reward scheme, but we may see it differently as revenue by customer may indicate higher value.  Either way, we can now reward our top customers.

# 2012 top 10 most valued customers by revenue
subset(result.df[order(as.numeric(result.df$cost), decreasing=TRUE),], year=="2012")[1:10,]

year  customid flights  distance      cost

2012    1731       77 669032.78 27619910.84

2012     1764      74 626387.3  26672217.89

2012     3077      65 567615.38 26331673.76

2012     1919      65 524266.25 26318934.19

2012     4015      69 589793.49 26160147.92

2012     1402      75 616431.91 25427669.25

2012     2793      76 692470.6  25044591.17

2012     1922      70 644232.32 24998612.27

2012     4528      75 636527.83 24915862.29

2012       52      70 643509.53 24757153.08

# 2012 top 10 most valued customers by airmiles
subset(result.df[order(as.numeric(result.df$distance), decreasing=TRUE),], year=="2012")[1:10,]

year customid flights distance        cost

2012     2793      76 692470.6  25044591.17

2012     3348      80 674154.39 21708568.64

2012     2525      78 672130.08 22931827.02

2012     1731      77 669032.78 27619910.84

2012     1511      83 663782.33 22940185.46

2012     3429      84 661984.52 22914627.49

2012     4268      77 655323.39 21989782.32

2012       18      72 653851.2  21678816.85

2012     2150      68 650492.96 20314354.86

2012     861       73 650359.4  23346561.28

In Summary

RHadoop provides new and novel ways of data blending within the R workbench, adding another useful tool to the life of a data analyst.  The interface to the map reduce process is very R like, allowing us to relate to the data in the pipeline as R data types.

R is a blank canvas with a whole raft of widgets for 'painting' a path to data analytics, but the same is true of Hadoop.  What you put inside a map/reduce job could literally be anything in terms of how you source and manipulate data - infact there is nothing stopping you from mixing and matching steps from different programming languages and data interfaces to get the jobs done.

Almost too much choice!

8 Comments
Labels in this area