1 2 3 11 Previous Next

SAP Predictive Analytics

151 Posts

Hi Guys


This blog basically addresses some of the basic challenges and issues which many of us are facing when we do R integration with SAP PA. Yes SAP Predictive Analytics does provides us this smart feature of R integration which is a very well-known statistical based language. But sometimes some challenges which I am discussing below might come while doing this integration.


Problem 1: You have installed SAP PA but when you go to install R, sometimes an error message may come which won’t allow you to install and configure R internally.




You have to install R manually and configure it with SAP PA. Below are the steps you can follow:



If you are on a 64-bit Windows 7, make sure that you have a 64bit JRE and if you are on a 32-bit Windows 7, ensure that you have 32-bit JRE.


To check the version of JRE, run “java –version” on the command prompt.




Download & Install R from the following location




After installing R, you need to install the R packages used by SAP Predictive Analysis 2.2 .

Below are the list of packages that are required.

  1. rJava
  2. RODBC
  3. RJDBC
  4. DBI
  5. monmlp
  6. AMORE
  7. XML
  8. pmml
  9. arules
  10. caret
  11. reshape
  12. plyr
  13. foreach
  14. iterator

These packages either you can download the zip files from- https://cran.r-project.org/bin/windows/base/old/3.1.2/ and install it. Or the second way is install it directly from the internet via R GUI Console . In the below snapshot you can see the two highlighted box, one is for installing packages from local zip files and the other one saying Install packages is for downloading packages from internet.



To test if the required packages are installed correctly, Go to

  • Start->Programs->R->R3.1.2
  • In the command prompt, type library (rJava) .Press enter. It is successfully installed if you don’t see any error

Eg : > library(rJava)

  • Try this for all the packages for confirmation



Go to SAP PA Expert mode Install and Configure R option. As shown below there will be a predefined default path for R installation folder, change it to the path where you have installed R.Then go back to Installation tab and click again on the install R option. Restart the tool, it should work now.





Problem2: Every time when you install some new package in R for making a new custom R algorithm, you have to make that new package installation reflect in your SAP PA Expert mode too otherwise your algorithm won’t work in SAP PA environment. While doing so you need to do the same step as step5 above. But some of the folks may get the below attached error :




Go to configuration tab and change that default path to C:\Users\[Username]\.sappa and then click Install. Restart the tool, give it a try.




Problem3: One more common error is REngine Initialization Failed. This mostly occcurs when you run a model which is based on R algorithm not with SAP PA Expert Mode out of the box algorithms.


Solution:  In the Configuration tab change the default path to the path where you have installed your R gui and then go to installation tab and click on Install R option. Restart the tool after this and rerun the model.



Above are the few of the challenges I faced, if anyone has any feedback or anything to be added which can be useful to others, you are more than welcome to comment.

Hope this blog has been helpful.




Visualizing a PAL decision tree using d3



In this tutorial we will visualize a Hana PAL decision tree using d3.js.

For most visualization purposes, it is most convenient to use SAP UI5 and SAP Lumira. At the moment however, these solutions do not offer a possibility to visualize a decision tree which was determined by one of the decision tree algorithms in SAP Hana. In this tutorial, we will create a visualization of such a tree using d3.js.



  1. Parse a tree model created by Hana PAL
  2. Create a visualization using d3.js
  3. Add functionality to the visualization





In order to visualize a decision tree that has been created by Hana PAL, the output of the algorithm has to be set to PMML. Information on how this can be done and a decision tree sample can be found in the Hana PAL guide.


What is PMML?

PMML is short for predictive model markup language, an xml standard which contains information about a predictive model. In our case, this is a decision tree. The definitions of the standard can be found on the website of the data mining group at dmg.org.


The reason why we use PMML is that it can be created by all of the decision tree algorithms available in the Hana PAL library (C4.5, CART, CHAID). On the other hand, since PMML is an XML format, the browser DOM-parser can be used to parse the tree.



Parse a tree model created by Hana PAL



At first we need to make the PMML document accessible through JavaScript.

We create a new XSJS Project in our repository and call it “vis_tree”.

(For information on how to set up a XSJS project, refer to:  Creating an SAP HANA XS Application)

In this project we create a new XSJS file and call it “getPmml.xsjs”. Edit that file and put the following lines of code in it (Where <SCHEMA NAME> has to be changed to your schema name and <PMML MODEL TABLE> to the name of your PMML output table):




// open a connection to the database and select the PMML model column from the table

       var conn = $.db.getConnection();

       var pstmt = conn.prepareStatement("SELECT  PMML_MODEL FROM \"<SCHEMA


       var rs = pstmt.executeQuery();


       var data ="";


// iterate through the rows, dependent on the settings, the pmml is stored in multiple rows, however this is not the case in our example 


while (rs.next()) {

              data += rs.getString(1);




// set the response

       $.response.contentType = "text/xml";






} catch (e) {


       $.response.status = $.net.http.INTERNAL_SERVER_ERROR;




If you now point your Browser to “http://<SERVER IP>:8000/vis_tree/getPmml.xsjs”

You should see an output as the following:




As you can see, the decision tree is stored in the <TreeModel> node. We want to parse this tree into a JSON object, which satisfies the condition that children of a tree are stored in a list called “children”. This JSON tree structure, together with a “name” in each node, is commonly used for d3.js and referred to as flare.json. We will utilize d3.layout and thus need a flare like structure, though the “name” string is not needed, thus we will not use a name string, but put all of the information in each node of the tree.


In our project we create a new file and call it “pmmlTree2Flare.js”. At first we need to pull the XML file from the xsjs script, then we will parse the recursively in a JSON object.



function getFlare() {


       var flareNode = {};


       var xmlHttp = new XMLHttpRequest();

       xmlHttp.open("GET", "./getPmml.xsjs", false);


       var pmml = xmlHttp.responseText;

       var xmlDoc;


       if (window.DOMParser) {

              var parser = new DOMParser();

              xmlDoc = parser.parseFromString(pmml, "text/xml");

       } else // code for IE


              xmlDoc = new ActiveXObject("Microsoft.XMLDOM");

              xmlDoc.async = false;




       var flare = pmml2Flare(xmlDoc.getElementsByTagName("TreeModel")[0].getElementsByTagName("Node")[0],



       return flare;



function pmml2Flare(pmmlNode, flareNode) {


       // Fill the node with data

       flareNode["name"] = pmmlNode.getAttribute("id");

       flareNode["scoreDistribution"] = getScoreDistribution(pmmlNode);

       flareNode["score"] = pmmlNode.getAttribute("score");

       flareNode["recordCount"] = pmmlNode.getAttribute("recordCount");

       flareNode["predicateList"] = getPredicates(pmmlNode);


       if (pmmlNode.getElementsByTagName("Node").length === 0) {

              return flareNode;


       flareNode["children"] = [];


       // Create an array of all nodes who are top level children of the active

       // node

       for (var i = 0; i < pmmlNode.getElementsByTagName("Node").length; i++) {


              if (pmmlNode.getElementsByTagName("Node")[i].parentNode === pmmlNode) {

                     var node = {};


                     pmml2Flare(pmmlNode.getElementsByTagName("Node")[i], node);





       // If there are no children this means it is an endnode, we delete the

       // children array and attach the score.

       return flareNode;



function getScoreDistribution(node) {


       var scoreList = [];

       var scoreDistribution = node.getElementsByTagName("ScoreDistribution");

       for (var i = 0; i < scoreDistribution.length; i++) {


              if (scoreDistribution[i].parentNode === node) {



                           value: scoreDistribution[i].getAttribute("value"),

                           recordCount: scoreDistribution[i].getAttribute("recordCount"),

                           confidence: scoreDistribution[i].getAttribute("confidence")




       return scoreList;




//if the predicate is compound, we have to figure out the simple predicates in

//the compound


function getPredicates(node) {

       var predicateList = {

              predicates: []


       var compound = node.getElementsByTagName("CompoundPredicate")[0];


       if (!compound || compound.parentNode !== node) {


              if (node.getElementsByTagName("SimplePredicate").length === 0) {





       } else {

              for (var j = 0; j < compound.getElementsByTagName("SimplePredicate").length; j++) {




                     predicateList.operator = compound.getAttribute("booleanOperator");





       return predicateList;



function predicate2Json(simplePredicate) {


       var predicate = {};

       predicate.field = simplePredicate.getAttribute("field");

       predicate.operator = simplePredicate.getAttribute("operator");

       predicate.value = simplePredicate.getAttribute("value");

       return predicate;





Now we have a simple parser to pull out the XML data into a simple flare.json object.

What is left is to use d3 to create a simple tree in html.

Create a visualization using d3.js


What is d3?

D3.js is a javascript library, which makes it easy to correlate data and XML. This can be used to manipulate SVG images dynamically according to a dataset. The basic idea is to select (or create) XML nodes and then bind data to them, which is simply done by calling selectAll(<Node_Id>).data(<dataset>).

Afterwards, each of the XML nodes can be manipulated according to the data.


What is d3.layout?

D3.layout is a part of d3, which makes it easy to calculate common graphic structures.

In our example, we want to create a tree, thus we would have to calculate the position of each node according to the depth of the tree and the maximum number of nodes on one level. D3.layouts does all that work and returns x and y values for each node, as well as the depth and parent and child nodes.





In order to create a visualization we create a new file in our repository named “createSvg.js”.

We will use a <div> pane in our html which has the id “viz”. To keep things simple, we first create the most basic tree:



function createSvg(treeData) {


       // Create a svg canvas and shift it away from the left border, such that the

       // first circle is still visible

       var vis = d3.select("#viz").append("svg:svg").attr("width", width).attr(

                     "height", height).append("svg:g").attr("transform",

                     "translate(10, 0)");


       // Create a tree layout using d3.layout

       var tree = d3.layout.tree().size([ width - 300, height - 200 ]);


       var diagonal = d3.svg.diagonal()

       // in order to have a left to right tree, we have to change x and y axis

       .projection(function(d) {

              return [ d.y, d.x ];



       // call tree.nodes in order to compute the x and y position of each subnode

       // in the tree

       var nodes = tree.nodes(treeData);

       // call tree.link to create an array of the links between the nodes

       var links = tree.links(nodes);


       // draw a link for each of the links

       var link = vis.selectAll("link").data(links).enter().append("svg:path")

                     .attr("class", "link").attr("d", diagonal)


       // return the computed position of each node, relative to the drawing pane

       var node = vis.selectAll("node").data(nodes).enter().append("svg:g")

                     .attr("transform", function(d) {

                           return "translate(" + d.y + "," + d.x + ")";



       // draw a circle for every node

       node.append("svg:circle").attr("r", 3.5);


       // attach a simple text to each node, which states the splitting condition

       // of the node, and the score for an endnode



                     .attr("dx", 8)

                     .attr("dy", 3)

                     .attr("text-anchor", function(d) {

                           return "start";



                                  function(d) {

                                         var name = "";

                                         if (d.predicateList) {

                                                for (var j = 0;

j < d.predicateList.predicates.length; j++) {

                                                       name +=


                                                                     + " "                                                               + d.predicateList.predicates[j].operator

                                                                     + " "                                                               + d.predicateList.predicates[j].value;

                                                       if (d.predicateList.operator) {

                                                              name += " " +


                                                                     + " ";





                                         if (!d.children) {

                                                name += " => " + d.score;



                                         return name;







Now let’s put things together: create a new file in the repository called “simpleTree.html” and put the following into it:






<meta http-equiv="X-UA-Compatible" content="IE=edge">

<style type="text/css">

  1. path.link {

       fill: none;

       stroke: #F0AB00;

       stroke-width: 1.5px;


text {

       font-size: 11px;



<script src="./.lib/d3.js" charset="utf-8"></script>

<script src="./.lib/d3.layout.js" charset="utf-8"></script>

<script src="pmmlTree2Flare.js"></script>

<script src="createSvg.js"></script>    



       <div id="viz"></div>

       <script type="text/javascript">








Point your browser to “http://<YOUR SERVER>/vis_tree/simpleTree.html”

You should see the following:







As you can see our basic tree looks rather ugly, and is inconvenient for bigger tree structures. Nevertheless,  parsing the tree structure in a flare.json object allows us to reuse any given example of a tree which is written in d3. Though remember that we have not set the “name” string in our flare.json.

In the next step, we use an existing code for a collapsible tree and add a mouseover text which contains all of the information we have.




Add functionality to the visualization



Our goal is to add two features:


  1. Make the tree collapsible, so that bigger tree structures are convenient to view
  2. Make all the information visible in the tree structure


In the end our tree structure will look as follows, blue nodes indicate the existence of children, a click on a blue node will expand the top level children. The blue box is visible when hovering over a node, it shows every bit of information associated with that node.

The tags on each node show the most possible outcome value on the branch of the tree.




For a collapsible tree, a good tutorial can be found at




In order to create a mouseover textbox, we create a <div> object containing all the information and append it in each node. The following is the complete source, which can be tested by copying and pasting it in createSvg.js.


Collapsible Tree with textboxes:



function complexTree(flare) {


       var m = [20, 120, 20, 120],

              w = 1280 - m[1] - m[3],

              h = 800 - m[0] - m[2],

              i = 0,



       var tree = d3.layout.tree().size([h, w]);


       var diagonal = d3.svg.diagonal().projection(function(d) {

              return [d.y, d.x];



       var vis = d3.select("#tree").append("svg:svg").attr("width",

                     w + m[1] + m[3]).attr("height", h + m[0] + m[2]).append("svg:g")

              .attr("transform", "translate(" + m[3] + "," + m[0] + ")");


       // This function produces a div text box, which we will append to each node

       function tooltip(d) {


              var textdiv = d3.select("body").append("div").style("position",

                           "absolute").style("z-index", "10").style("opacity", 0).style(

                           "background-color", "rgb(176, 196, 222)").attr("id", "toolbox")

                     .text("Score: " + d.score);



              textdiv.append('tspan').text("Record Count : " + d.recordCount).append(



              if (d.predicateList.operator) {


                           "Predicate Logic : " + d.predicateList.operator).append(



              for (var i = 0; i < d.predicateList.predicates.length; i++) {

                     textdiv.append('tspan').text(function() {

                                  var text = d.predicateList.predicates[i].field + " " + d.predicateList.predicates[i].operator + " ";

                                  if (d.predicateList.predicates[i].value) {

                                         text += d.predicateList.predicates[i].value;


                                  return text;






              for (var i = 0; i < d.scoreDistribution.length; i++) {


                           "Score for " + d.scoreDistribution[i].value + ": records " + d.scoreDistribution[i].recordCount + ", confidence: " + d.scoreDistribution[



              return textdiv;



       root = flare;

       root.x0 = h / 2;

       root.y0 = 0;


       function toggleAll(d) {

              if (d.children) {






       // Initialize the display to show a few nodes.





       function update(source) {

              var duration = d3.event && d3.event.altKey ? 5000 : 500;


              // Compute the new tree layout.

              var nodes = tree.nodes(root).reverse();


              // Normalize for fixed-depth

              // -----------------------------------------------------------------

              nodes.forEach(function(d) {

                     d.y = d.depth * 180;



              // Update the nodes…

              var node = vis.selectAll("g.node").data(nodes, function(d) {

                     return d.id || (d.id = ++i);



              // Enter any new nodes at the parent's previous position.

              var nodeEnter = node.enter().append("svg:g").attr("class", "node")

                     .attr("transform", function(d) {

                           return "translate(" + source.y0 + "," + source.x0 + ")";

                     }).on("click", function(d) {



                     }).on("mouseover", function(d) {

                           if (!d.tooltip) {

                                  d.tooltip = tooltip(d);


                           d.tooltip.style("visibility", "visible");

                           return d.tooltip.transition().style("opacity", 0.9);



                           function(d) {

                                  return d.tooltip.style("top", (event.pageY - 10) + "px").style("left", (event.pageX + 10) + "px");

                           }).on("mouseout", function(d) {

                           d.tooltip.transition().style("opacity", 0).duration(1000);

                           return d.tooltip.style("visibility", "hidden");



              nodeEnter.append("svg:rect").attr("height", 0).attr("width", 0).attr(

                     "transform", function(d) {

                           var length;

                           d.score ? length = d.score.length * 5 + 20 : length = 30;

                           return "translate(-" + length / 2 + ",-20)";

                     }).style("fill", function(d) {

                     return d._children ? "lightsteelblue" : "#fff";



              nodeEnter.append("svg:text").attr("text-anchor", "middle").text(

                     function(d) {

                           return d.score;

                     }).style("fill-opacity", 0);


              // Transition nodes to their new position.

              var nodeUpdate = node.transition().duration(duration).attr("transform",

                     function(d) {

                           return "translate(" + d.y + "," + d.x + ")";



              nodeUpdate.select("rect").attr("height", 30)

                     .transition().duration(duration / 4).attr("width", function(d) {

                           return d.score ? d.score.length * 5 + 20 : 30;

                     }).style("fill", function(d) {

                           return d._children ? "lightsteelblue" : "#fff";



              nodeUpdate.select("text").transition().duration(duration / 2).style("fill-opacity", 1);


              // Transition exiting nodes to the parent's new position.

              var nodeExit = node.exit();



                     .transition().duration(duration / 2).attr(

                           "width", 0)

                     .transition().duration(duration / 2).attr(

                           "height", 0);



                     function(d) {

                           return "translate(" + source.y + "," + source.x + ")";



              nodeExit.select("text").style("fill-opacity", 0);


              // Update the links…

              var link = vis.selectAll("path.link").data(tree.links(nodes),

                     function(d) {

                           return d.target.id;



              // Enter any new links at the parent's previous position.

              link.enter().insert("svg:path", "g").attr("class", "link").attr("d",

                     function(d) {

                           var o = {

                                  x: source.x0,

                                  y: source.y0


                           return diagonal({

                                  source: o,

                                  target: o


                     }).transition().duration(duration).attr("d", diagonal);


              // Transition links to their new position.

              link.transition().duration(duration).attr("d", diagonal);


              // Transition exiting nodes to the parent's new position.

              link.exit().transition().duration(duration).attr("d", function(d) {

                     var o = {

                           x: source.x,

                           y: source.y


                     return diagonal({

                           source: o,

                           target: o




              // Stash the old positions for transition.

              nodes.forEach(function(d) {

                     d.x0 = d.x;

                     d.y0 = d.y;




       // If the children are currently visible, we move them to a

       function toggle(d) {

              if (d.children) {

                     d._children = d.children;

                     d.children = null;

              } else {

                     d.children = d._children;

                     d._children = null;







This is part 6 of the blog series, this previous blogs can be found here.

From R to Custom PA Component Part 1

From R to Custom PA Component Part 2

From R to Custom PA Component Part 3

From R to Custom PA Component Part 4

From R to Custom PA Component Part 5



In this blog I'm going to focus on how we create our own component for Predictive Analytics. The component we are creating will be a time series component. Predictive Analytics already has this time series components, we will however be creating this for us to learn and also compare.




Standard Component



Before we start creating our own component, we will create a predictive model on the same data as our custom component. We can then compare our custom component with the standard components.


We will create a simple component as shown below. The source data is a dow_jones_index.csv file that is attached to the blog. This data came from http://archive.ics.uci.edu/ml/datasets.html


I have edited the data slightly. The data basically has information of specific stocks, there open price, etc.



The file has many stock items. We will analyse a specific stock. So ensure to filter and only look at the stock 'PFE'





Ensure the R-Single Exponential Smoothing has the below settings. Note we using custom periods, this is due to the data not being for the whole year in the file, also has the data in weeks. So for this example we going to analyse 30 weeks, we have data for 25 weeks. So we will predict for 5 weeks.





Once we run the model we should have the below diagram and supporting data.











Create R Code for Custom Component



We will now create our custom component and hopefully get similar results. For us to stat creating the component it is best t open R, in order to test our data we need a function, we need a part that will read the data and call the function. This is just for us to test, once we ready to move into Predictive Analytics we will only copy the function.







We will now create a variable that is a data frame that will hold values from the file, excepts we will be filtering by stock == PFE. Also note the return statement now returns the filtered data.







We have now added two more lines. The one line we calculate the mean and store in a variable named datafilter.mean, we do this by making use of HoltWinters library. We then predict for 5 periods.





You will start seeing the values in our R that corresponds to our already created model. For example below you can see the alpha value and confidence level. You will also see the periods to predict is 5 as we did in the model also.



We then show the values on a graph. In R it is a plot.







We should already start to see similarities between our component and the Predictive Analytics model. If we look below we see they don't look similar. But they are, the graphs look different cause in R our y axis goes up until 21, also the increments are 0.5 where is the Predictive Analytics model y axis goes up to 25.





So by defining the y axis we can now see the graphs will match.





We can now also add two more lines of code that will also indicate an optimistic value and pessimistic value. In other words the highest value one can expect, then the lowest value one can expect.






I have now added several lines of codes to make new data sets with less columns, also combine forecast (predict) values with original actuals. We need to return a single list to show in Predictive Analytics. Also added date values to predicted values.




The R code will now return this new formatted list.




Create Custom Predictive Component


We are now ready to take our component and place it into Predictive Analytics.Go to the below location to create the component.






The below will be displayed. Enter the component name. Click next.




You then place the R Code in the Script Editor. Only place the code from the function. Do not place the part that calls the file and calls the function as we will do this in predictive analytics.




We then list the columns and the data types for each column that we expect the function to return. We can click finish.




Under our algorithms, at the end you can see your custom one. We can now model with it. I have not used the filter as we have the filter in the component.






We can now run the model and see the results in Predictive Analytics





The data we returned.





Hope the above helped. Part 7 will focus on making the above more dynamic. Will be posted soon.




Find more info on twitter @louisdegouveia

This is the second part of my blog related to predictive & rugby.


In part 1, I explained the data preparation, the creation of the predictive model and the findings.


The tragedy of modern rugby players is that despite their amazing performances, few of them will be remembered on the long-run.


Didier MAZOUE pointed my attention on the fact that a performance data set is not capturing 100% of the reasons why certain players become legends.


Call it charisma, leadership, fair-play, beauty or sometimes evil character traits (this is rugby after all!), but its participates to our collective memories.


Sebastien “Caveman” Chabal is a retired French player that was featured a lot in the sport medias over the past years.

I will first test my model on that player, to see if its fame is coming from its performance or its from its "beardy" look

Run + Simulation.PNG

I fill the 6 variables that are part of the model, using the data I collected.


The score I get is quite low (0,2 out of 1), which means that Mr. Chabal's popularity will probably never exceed France’s boundaries in the coming decades.


Now the fun part starts - trying to predict the future


I am going to place a bet on Jonathan Sexton becoming an all-star of the next world cup, so as to please my Dublin colleagues and also because I feel he deserves it!


Let’s back this up with some degree of rationale:

Let’s be positive and say Ireland will reach the semi-finals (I don’t want to know what that would mean for France ).

If such is the case, Jonathan Sexton will play 6 additional matches, win 5 of them and unfortunately lose the 6th one.

Let’s pretend he will score an average 8 points per match, that means a total of 48 points.

Here are its current RWC stats:


  • 2 matches started in the field
  • 21 points scored
  • 4 matches won (with the team).
  • He never scored a try during a Rugby World Cup.
  • He is not a captain and will not be this year - this goes to Paul O’Donnell
  • He never won the Rugby World Cup.

Sexton Stats.PNG


Jonathan Sexton’s stats after the World Cup would be

  • 8 matches started in the field
  • 69 points scored (in the two RWC editions)
  • 9 matches won (with the team).
  • Let’s assume he would do 3 tries.
  • He is not a captain and will not be this year .
  • Ireland (and him) would not win the Rugby World Cup (in that scenario)

Prediction Sexton - take 1.PNG

There is slightly one chance out of two that Jonathan Sexton will become one of my greatest players of all time for the rugby world cup.


Let’s now unleash fantasy and explore different scenarios.


  • Two time more points per game. It means Mr Sexton would reach 117 points in total, 16 points per game. The variation of the score is very little (0,53).

Prediction Sexton - take 2.PNG


  • Let’s say he is also scoring two time more tries. The score becomes better (0,58) but not there yet.

Prediction Sexton - take 3.PNG

  • Total Fantasy Scenario (or…could that be ? that's the glorious uncertainty in sports). Ireland is beating England and New Zealand and becoming the world champion for the next four years. Mr Sexton is a key artisan of the victory with 15 points per game and 6 tries overall. Go Ireland! After such a successful campaign, I will certainly do some room for Jonathan Sexton in my personal rugby hall of fame!

Prediction Sexton - take 4.PNG

Thanks for reading this blog post, I hope you have enjoyed it and please comment on who you think will be the next RWC 2015 rock-stars


I wish all of you a nice summer break, see you on SCN in September!



This is the fist part of a two-blog story. Part 2 is here.

I spent two years in Africa and was given the chance to learn how to play rugby in an “international team” – easy as it was the only one across the country .


Now, 2015 Rugby World Cup is coming in just 46 days!


I was wondering if we can find out the reasons why a player can become a rugby all-star during the World Cup.

Data Preparation

From the site here, I extracted the statistics for all the players that ever participated to a world cup.

This represents 2440 players, 50 all-star players but also unsung heroes from Ivory Coast or Portugal.

The value of my target variable is 0 for a normal player, 1 for an all-star player.

My 50 all-star players originates from 9 countries, basically the Six Nations (Italy set apart) and the 4 nations.

Top 50.PNG

(Apologies to my colleague Pierpaolo Vezzosi for not shortlisting Italian players )


This becomes a classification problem. I am going to see how the different input variables can help explain and predict the output variable. The full final dataset is attached to the post.


I open SAP Predictive Analytics 2.2, click on Modeler and Create a Classification/Regression Model.

Screenshot 1.PNG

I load the Excel file containing all the players and their stats.

Screenshot 2.PNG

I need to describe the data. Here is the final screen, with all the information filled in (I attached the description file to the post).

Data Description.PNG


I set the target variable and exclude the Player variable at it is not useful for the model.

Screenshot 4.PNG

I check the Compute Decision Tree check box.

Screenshot 5.PNG

The model overview gives me useful information:

  • The Predictive Power and Prediction Confidence indicators are both very good.
  • 6 variables only were kept in the model.

There is one suspicious variable Matches Won – as this is alone a very strong predictor of the output. SAP Predictive Analytics is warning us about the strong correlation between the input and the output variable.

The more you and your team win World Cup matches, the higher the chances are that you become a rugby legend. Makes sense, right?

Model Overview.PNG


Let’s move to the variable contributions.

Contributions by Variables.PNG

Some more facts: 

  • The more matches you start in the field as a main player, the more chances you become a rugby legend. This is true if the player participated to 8, 9 or 10 matches, and even truer if the player did more than 11 matches (which means that he probably participated to several world cups). To be noted: to start the matches, you must be a consistent, strong player.

Variables - Matches Started.PNG

  • “History is written by the victors”. It’s not only about starting the matches but about winning them.

Variables - Matches Won.PNG

  • As an individual player and an all-star you have to bring some decisive points to your team. For exceptional individuals, this would be more than 26 points and could even reach 277 points for one all-star player (quiz time: who is this guy?)

Variables - Points.PNG

  • Self-explanatory, right? What can make you a star is to bring back the Webb Ellis cup (aka “Bill”) home. And if you can do it two times…

Variables - Nr of RWCs Won.PNG

  • Scoring tries is also key to being recognized an all-star player. Like Jonah Lomu who scored the record number of 15 tries in two World Cup editions.

Variables - Tries.PNG

  • Being a captain and leading the team to victory is key to staying in people’s hearts & memories.

Variables - Captain.PNG

I move to the Decision Tree and use the two more important variables in my model. 

Decision Tree.PNG

Here is my interpretation:

  • On top, we see the overall player population, including 50 all-star players (2,05% of the overall).
  • If the player starts at least 11 matches, the percentage of legend players in the population of 93 individuals is climbing to 29%. 27 out of my 50 greatest fall into this category. 
  • If I add the criteria “player has won at least 10 matches”, the percentage is climbing to 44% and 22 greatest players are part of this category.

I will keep this first blog post short and leave the rest for the second part (teaser: it's all about predictions ).

Thanks for reading! Your comments are welcome!


Seems it caught your attention last time so I’m happy to repeat what I did for Q1 as you may not have seen all of them: here is the top 10 articles the community voted in Q2. Summer is a good period to make up for what you did not have the time to read before.


#1 Announcing SAP Predictive Analytics 2.2! - by Ashish Morzaria

Discover all the new capabilities released in this new version of our product whether you are a business analyst with the Automated Analytics mode or a data scientist with the Expert Analytics mode!

#2 7 #Predictive Sessions You Should Attend @ #BI2015 in Nice, France - by P Leroux

Going to attend SAP Insider BI Singapore? This post give you an overview of what kind of predictive sessions you will have the opportunity to attend!

#3 Predictive Smackdown: Automated Algorithms vs The Data Scientist - by Ashish Morzaria

3 profiles, 1 tool. Whether you are a Data Scientist, a Business Analyst or a Business User, SAP Predictive Analytics will meet your needs and expectations. Learn more through a funny analogy!

#4 Gartner BI Summit 2015: Big Data = Predictive Analytics - by Ashish Morzaria

A detailed article on one of the key Gartner summit outcomes or how far Big data and Predictive Analytics are linked and work together. Read more!

#5 Learning about Automated Clustering with SAP Predictive Analytics 2.0 - by Tammy Powlas

Discover the step by step explanation of how Tammy run her first clustering analysis with the Automated mode SAP Predictive Analytics and the ease of use she recognizes to it.


#6 Using Application Function Modeler To Create Forecast (APL) Procedure in SAP HANA - by Angad Singh

Angad shared his experience gained while creating forecast procedure using following features of SAP HANA: APL (Automated Predictive Library) Forecast Function and Application Function Modeler (AFM). Learn from it!


#7 Predicting the Future using a BEx Query as a Data Source - by Tammy Powlas

A progressive approach of using a BEx query of actual expenses by project by month to forecast the future with the Expert mode of SAP Predictive Analytics.


#8 Announcing ASUG SAP Predictive Analytics Council Launch - Roadmap Preview - by Tammy Powlas

Learn more about our ASUG SAP Predictive Analytics Council focusing on Exploratory Analytics! And if you are interested in joining the council, please complete the council’s participation survey.

#9 How to install SAP Predictive Analytics Desktop 2.2 - by Tammy Powlas

A detailed post to help you install SAP Predictive Analytics Desktop 2.2. Very useful!


#10 Predicting Happiness - by Kurt Holst & Savaneary SEAN

An end-to-end data mining case using SAP Predictive Analytics to uncover new information related to happiness and also to predict whether a country is happy or not & what features impacts happiness. Read the paper now!

And finally a repeat session of what you may have missed earlier in the year:

The 10 SAP Predictive Analytics Community Most Viewed Q1 Articles

Again, there are many predictive resources available on SCN and sap.com/predictive! Here are 3 ways to get engaged:

- Follow the SAP Predictive Analytics community to be informed as soon as there is something new posted or discussed here

- Check the ‘Content’ tab to make discoveries here

- Follow your favorite authors to be informed when they publish a new piece

And don’t forget the tutorials page that is updated on a regular basis and where you find tons of crucial tips!




This is part 5 of a series of blogs.

Please refer to the other parts that can be found here

From R to Custom PA Component Part 1

From R to Custom PA Component Part 2

From R to Custom PA Component Part 3

From R to Custom PA Component Part 4




In this blog I'm going to focus on how to use swirl. Swirl is a package that we will load that will allow you to learn R in R. This will be the last blog in my series explaining normal R syntax and programming. Blogs to follow will focus purely on using the R code in Predictive Analytics to make a component.


So I thought it would be good to end the R learning part of my blogs with a way for readers to further learn and expand their knowledge on R on their own.


Swirl is not part of the default installation, we will need to install this library to use it. There is two ways to install the libraries. One being straight forward, simple and more automatic. Then another one being more manual, but is good to know in instances where you have problems with automated method.




Swirl - Automatic Library Install


As mention in part 1 of my blogs, we can see the libraries available for us to use in this directory. You will see that there is nothing for Swirl.



If you try to load the library swirl you will get a message saying it does not exist.





The simplest and easiest way to install is to do the command install.packages("swirl")





Then choose CRAN mirror, me being from South Africa I will select the Johannesburg one.





Will then download all the required packages automatically and attempt to install.




Once it is completed you will see that it installed the packages it downloaded. Will also let you know where it downloaded the packages to.




You will also see your library is updated with the new packages.




Swirl should now work, jump to the "Start Swirl" section.






Swirl - Manual Library Install


As mention in part 1 of my blogs, we can see the libraries available for us to use in this directory. You will see that there is nothing for Swirl.





If you try to load the library swirl you will get a message saying it does not exist.




You will need to download the packages required here https://cran.r-project.org/bin/windows/contrib/3.1/

You will download a zip file.

You will need to download swirl, httr, R6 and yaml as a minimum. But it is recommended you should also include jsonlite, mime and curl.





Then to install the package, open the RGui. Select "Install package(s) from local zip files". Select the zip file you downloaded previously.





You will then get a message in RGui that it installed successfully. Will also now see it in your library folder.




You will need to repeat this for the libraries swirl, httr, R6, yaml, jsonlite, mime and curl..






Start Swirl


Once you have done the above you can now load the library swirl, then start swirl with the command swirl(). It will guide you, ask questions and you need to provide answers. There are a few courses, you will need to install the courses. If you are connected to the internet it will do it for you. Alternatively you can ask it to open the github repository and do a manual install.



You can now carry on learning R at your own pace, swirl will guide you and show you the commands step by step.



Hope this is helpful. Hope you will use swirl to learn more R. Part 6 coming soon.




This is part 4 of a series of blogs.


Please refer to the other parts that can be found here

From R to Custom PA Component Part 1

From R to Custom PA Component Part 2

From R to Custom PA Component Part 3



In this blog I will be focusing on how to debug in R. This will assist you in debugging your own R but also assist in understanding what is going on at each step of R code.


So I will cover debugging in RGui and in RStudio




RGui Debug


So lets look at debugging the function we did in a previous blog. You will need to execute the function as shown below.





You then need enter the debug command and indicate the function name we want to debug. Once you have done this, when we call the function it wll go into debug mode.




You will now see that when we call the function it shows Browse. We can now at every step look at hat variables is defined, the values in each variable.





If you press enter it will step through each line. If you enter ls() it will show you the variables been declared. Whatever variable you would like to see the value for you just type the variable name. Below you can see I went through the whole function, look at variables declared, looked at variable values.




Every time you execute the function it will debug the function, to stop this you must issue the undebug command.





The other way to go into debug mode is to insert the command browser() in your R code. It will then go into debug mode from the line where the browser is located.






RStudio Debug




So to debug in RStudio is pretty simple, click in the margin by the line you want to stop at, this creates a break point. Then click source. Note to use the debug your script must be saved.




You will then be in debug mode and be able to step through each line.







Hope this helps, part 5 can be found here From R to Custom PA Component Part 5

Having predictive analytics and predictive models where more insight and value is gained from data is on the wish lists of many organisations today.  The ability for organisations to use predictive modelling tools to help predict outcomes such as which customers to target, which customers are more likely to buy other products and what is the likelihood of customers leaving is something that is more achievable than ever before.  However, when we ask a business question from our predictive model to solve we must not forget the importance of getting the data right.


Clearly, if you put rubbish data into a predictive model then it doesn’t take a genius to work out that a rubbish model will be generated.  Predictive models are only as good as the data going into them.  Making sure that the source data that you use is properly managed and organised is key to this.


Extraction, transformation and loading tools have been around in the marketplace for many years, however many organisations are using out of date tools to just “lift and shift” data from one environment into another.  This process of extraction, transformation and loading of data (or ETL) is a key area where problems and issues in your data can be identified and rectified before they end up in the models.  Ideally problems and issues should be resolved in the source systems, however this is not always possible.


Here are some key areas where help is needed on data: 

  • Removing duplicates
  • Integrity checks
  • Names and address checking
  • Text checking to pull out sentiment in the data
  • Putting in place rules and analytics to check the quality of the data


Coding SQL is never a scalable and sustainable option –  imagine having to trawl through code just to change one business rule.


SAP have a complete range of tools that provide functionality to profile data, add in business rules and move data without using code.  Data Stewards can create repeatable jobs that are easy and quick to maintain all through graphical interfaces.  For more information about these tools click here http://scn.sap.com/community/enterprise-information-management

As a user of Predictive Analysis I thought that I would share this hidden gem, inside the Expert Analytics option - the ability to do further analysis and create visualisations on the results of your predictive models.


Let me show you an example of what I mean by this.


Below is a screenshot of a cluster analysis that I have created inside the Predict tab of Expert Analytics.


Cluster Analysis.JPG


Looking at the results I can see how the clusters are represented using the standard visuals that are available for cluster analysis.


This is great, however what I also want to do is analyse the clusters themselves and look at the data in more detail: i.e. use the visualise panel to create different visuals to understand what data makes up each cluster and filter on clusters to analyse further.


You can do this however it is not very obvious.  To do this follow these steps:


1. After you have created and run your predictive model click on the Visualise tab.


2. Just below the Visualise tab there is a "Select Component" drop down option

select component.png

3. Notice that I can select "Auto Clustering" - this is the results set of the clustering that I have just performed in the Predict tab.


4. Select this.


Now I can create new visuals.  The data includes the extra predictive column created through my predictive model.  This enables me to filter on specific clusters and analyse further what makes up these clusters.


visualise cluster analysis.JPG


You can do this on any predictive model results set, making further analysis of the results of your models very easy to do.


Hope that this helps in your work.






This is part 3 of a series of blogs.


Please refer to part 1 and part 2 that can be found here

From R to Custom PA Component Part 1

From R to Custom PA Component Part 2


In this blog I will be focusing on more intermediate syntax for a beginner, for an experienced R developer this will be still considered as basic. You will however still require this knowledge to create a Predictive Analytics component.


I will cover the following

  • Vectors
  • Matrix
  • Data Frames
  • R Scripts
  • Functions
  • Loops
  • Read files
  • Graphs


Then we will review the differences in RGui and RStudio





In R a vector is basically what we would call an array in other programming languages. A vectors values can be numbers, characters or any other type. But they should all be the same type.


Start by opening RGui again, type straight into the console. To create a function we use the c function which is short for combine. It combines the values to make a vector. Below is an example of two vectors.




But the previous example we not storing it in a variable, in reality we want to store everything in a variable. So here I have stored a list of values into X. Then display the content of X.



You can also assign a range of values using the below syntax. So below i'm saying variable Y will have a list of values from 5 to 9.




We can also access a single value in the vector. Below i'm accessing the value in the vector at position 3.



We can append values with the below syntax.




We can change one of the values, below we are changing the second value.




We can assign names to each value in the vector.




You can then access by the name of the column you have assigned in previous steps.





All the vector steps shown above can be applied with other data types. Below i'm doing it with text.








In R a matrix is basically what we would call a two dimensional array in other programming languages.


So here is an example of how to create a matrix. mat is a variable. The function matrix creates the matrix, I have said create it with 3 rows, 3 columns and default the values with 1. You can see the result in the variable mat.





We can change the value of a specific item in the matrix, similar syntax as a vector except we must list the column and row. So the below I have said change row 1 column 3 to have the value 5.



We can also access all the values by row or by column. Below I have first access by row, showing row 2. Then I access and display column 3.






Data Frames



Data frames is similar to a matrix. Except that a data frame can have different types of data for the different columns where a matrix can't. Also the data frame is more easy to work with, however a matrix is more efficient when coming down to performance.


In the below example I created two vectors. One with employee names, then another with salaries. I then combine them into a data frame. employee.data is a variable.





You can then access the data frame the same way as you would access a matrix.





R Scripts


Up until now we have entered everything in the console. We have done this to learn the syntax and understand how the console works. But in reality you would not work directly in the console. You would create a R script and enter everything in there.


To create a script go to File->New script




You can now add R code to the script. So in this example I created variable X with value 10, created variable Y with value 2, created variable Z with X*Y, I then output Z





To execute the lines, highlight them and then right click and select "Run line or selection". You can also select one line at a time and execute them.RGui17.jpg


It will then execute the commands in the console.








Being able to write functions is important, you will need this to create a custom PA component.


The basic syntax for a function is


myfunction <- function(arg1, arg2, ...) {

    function body




Below is an example of a function. In this function I receive two values, add them together in variable Z and print the Z variable out. So this is a very basic function. Please note that from now on I will always create a script and place the R code in the script. Then execute from the script.






The above example of a function is very basic, we would not normally do a function like that. A function we would usually create in a way that we return a value, we will need to return a value when creating a component for Predictive Analytics. So be sure to understand the below example.









We get different types of loops in R. I will cover two of them.


Here is an example of a for loop.  Here i'm assigning X variable with a range of values from 1 to 10. Z variable is created and made to be null. Then in the for loop we are saying the variable i will start from 1 and go up until 10. So the loop will repeat the code between the curly brackets 10 times. Each loop I then get the value in X at the position of the value i. So if the loop is on it's third cycle, the variable i will have three in it, we saying get the value in variable X at position 3. In this case it is 3 in position 3. But could of been different value. We then add 1 to the value and assign to Z. Each time I replace Z, then print the value.






Here is an example of a while loop. The biggest difference would be that a for loop the amount of loops is determined before the loop starts. While the while loop the loop will go until the condition is met, the condition changes while the loops occur. In the below example you can see that in the while loop I increment the i variable and will stop when i <= 10.






Read Files


You will need to be able to read files to work with sets of data that you have in text files.


Below is an example where I use the setwd function to set the directory where the file is. I then read in the data by specifying the file name, when reading in the data I indicate the first row is the header. Then display the data. When reading in the data from the file, due to different data types it creates a data frame.








There is several options and libraries for graphs.


Going to stick to the basic ones.


Here is an example of a bar graph, called bar plot in R. Basically I create a vector called graphvalues, I then assign names to the vector as we have done previously. These names are used on the x axis of the bar plot. Then I call the function barplot and send the vector values as a parameter.




When you execute this the following bar plot will be displayed.






Using the same as the above, just change barplot(graphvalues) to plot(graphvalues)





Using the same as the above, just change plot(graphvalues) to plot.ts(graphvalues)








RStudio Differences



Below you can see the differences when doing everything above in RStudio. The script takes up the left top window, the console moves to the bottom. Your plots/graphs can be shown on the bottom right.






When working with data frames we can see another difference. Displaying the data in the data frame is easier. You can see the data in an easy to scroll window that shows the data in an excel looking grid.





Hope you find this useful. Part 4 can be found here From R to Custom PA Component Part 4




This is part 2 of a series of blogs.


Please refer to part 1 that can be found here From R to Custom PA Component Part 1


In this blog I will be focusing on beginner syntax that is required to create a Predictive Analytics component, will cover the following

  • Basic syntax
  • Functions
  • Variables
  • Help


We will then have a look at RStudio again to understand the difference between RStudio and RGui.



Basic Syntax



So open the RGui as shown in part 1 of my this blog series. You can immediately use it as a calculator by typing in 4+4 and then pressing enter. Then 8 will then be shown as the answer. As seen below I have done +, / and *. Give it a try and see the same results as shown below




You can then also do Boolean expressions by typing 5<7, then pressing enter. Will respond with TRUE or FALSE. In this case TRUE. Note to test if 8=7+1, in R = is shown as ==.



In R, the value T is short syntax for TRUE. Same for the value F being short for FALSE.




When working with text you should enter the text in double quotes. As shown below I just typed the text and got an error back. The second time I enterd the text correctly. With text it just responds with what you have entered.







R comes with built in functions. So lets go through some of the functions to see how they work.


To use the function sum, type sum(value1,value2,value3,etc.). Below is some examples




To use the square root function, type sqrt(value1)




There is a repeat function, below is an example.




More examples of functions. Give them a try. Feel free to google some more functions.









When working with variables, most programming languages you need to declare the variable, say whether it is string or integer. In R you don't need to do that. You can just type x<-10. This syntax is saying the variable is X, we assigning the value 10 to X.


So below you can se I assigned X to have the value 10, after pressing enter the console is waiting for another command. If you just type X again it will tell you the value in X.



You can then start working with x. So if x has a value of 10, then if I want to divide by 2 you just simply type X/2, then press enter. Below you can see I have divided by 2, later on I add 10. Bear in mind I'm not changing X value, I'm just calculating what X would be should I divide by 2 or add 10 to it. X still has the value 10.



To change the value you must use <- to assign the value. So below I'm saying X has a new value that is X/2 which is 5.




We can do the same with text. Here I have a variable called Y, I then assign the value Hello World to it.



You can check the data types R has assigned to x and y by using str function. Will show the type and the value.







You can also use the help function to get more info. So to get more help just type help(function), so below I have typed help(sum). This will then launch a browser that shows help for this function.






RStudio Differences




So as mentioned in part 1 of my blog series we can code in the RGui or RStudio. So lets see what RStudio offers.



So in the R Console we can type the commands the same as we did in the RGui console. By looking at the top right window we can then see that it shows the X and Y variables and values. So in the RGui we have to call x and y to see the value where in the RStudio we can see it immediately in the top right hand window. The bottom right window you can access the help.




When working in the console you can press ctrl+space bar and then RStudio will assist with available functions or syntax with regard to the text you typed. Will also provide info on how to use the function. So for example below we can see for substr it requires x which is the string, then a start value, and a stop value.




Hope this helps, part 3 can be found here From R to Custom PA Component Part 3.




SAP has come a long way with their Predictive Analytics solution that has some excellent functionality. Even though the solution has some excellent functionality from time to time you will need to create your own component or code in R.


To create your own component in expert analytics mode can be difficult if you don't know where to start. There is some good content on SCN like the following links




I however didn't find a step by step guide that explains it as simple as possible from the beginning to the end. So I'm going to write several blogs that will take you from the beginning to the end of creating your own component. But when I say the beginning, I'm mean start with raw R and understand the syntax up until creating a custom component.


I'm hoping to create blogs covering these topics.

  • From R to custom PA component Part 1 - will focus on R environment (IDE) where you can script
  • From R to custom PA component Part 2 - will focus on basic R syntax and developing in R
  • From R to custom PA component Part 3 - will focus on more intermediate R syntax and developing in R
  • From R to custom PA component Part 4 - will focus on how to debug in R
  • From R to custom PA component Part 5 - will focus on swirl, continue to learn R on your own.
  • From R to custom PA component Part 6 - will focus on creating a basic component in Predictive
  • From R to custom PA component Part 7 - will focus on making the component in Part 5 to be more dynamic and complex


I will edit the list as I create the blogs and update each one with links.


So this blog being Part 1 we will focus on how we can code in R environment, this environment will help us learn the basic syntax we need to create the R component, will also provide an environment for us to test the R code for our component.






In order to follow these series of blogs you will need to have Predictive Analytics installed. I currently have Predictive Analytics 2.2 installed. I would recommend the same software to follow these blogs.





R Integrated Development Environment (IDE)



There is a few R IDE available in the market. But for this series of blogs I'm going to focus on two products.

  • Option 1 - RGUI, is installed and readily available after the Predictive Analytics 2.2 installation.
  • Option 2 - RStudio, this is a popular R IDE environment that you can download for free.


I would recommend installing both as in my series of blogs I will refer to both environments and at the end of the blogs you will know how to use either one.

R IDE Option 1 - RGUI


You need to ensure that you have installed and configure R as part of your installation.


R Install.png





Once installed you will see on your desktop an icon as shown below. Predictive Analytics from SAP makes use of R and it's libraries and this is why the application is installed and available after installing Predictive Analytics.




Now from within this environment we can create R code, learn R syntax and get us up to a point where we will have fully functional R code to take into Predictive Analytics for our custom component.

R IDE.png



The small window shows the R console. This console acts like a command prompt but for R. To test everything is in order type in 4+4 as shown in red. The console will then reply with the answer, which is 8.


R Console.png



You will also see that the R has installed in the following directory C:\Users\Public\R-3.1.2


From here you can navigate to the libraries. As we learn R you will find these libraries important, many functions will come from different libraries, some functions will be from libraries not installed yet and we can the install them, each new library installed will make a new folder here.





R IDE Option 2 - RStudio


Before proceeding, I would recommend ensuring that you have installed and configure R as shown in option 1. We will then install the R Studio and make use of the same R libraries. This will allow us to use both environments that will use the same R libraries.


You need to download the RStudio from the below link, you can choose the appropriate version required.



I have selected Windows/Vista/7/8 version as indicated below.





Once you have downloaded, you can install the RStudio. The installation is very simple. Just click next next.


Once installed you can open the RStudio. Will open as shown below. As seen below you will have the R console where you can type in 4+4 and get an answer as shown in Option 1.




By navigating to Tools->Global Options you can see the R install being used, this will also dictate the libraries being used. I'm using the same installation as option 1.





You can see the libraries in the bottom right window under the packages tab. Will show the contents in C:\Users\Public\R-3.1.2\library a discussed in option 1.







Hope you have found this helpful. Part 2 can be found here From R to Custom PA Component Part 2



In the previous post  we talked about the concept of Exploratory Analytics.

As a quick reminder, exploratory analytics is not a product, it is rather an approach to data analysis and a set of functionalities where you let mathematical algorithms and computer automation work on your dataset to surface, automatically, some interesting results (correlations in data, outliers in your dataset, groups of items with similarities, etc.).


You, as a business savvy person, can look at those results, see what their business value is and take strategic decisions based on them.

Exploratory analytics are complementary to both classic analytics and advanced analytics as shown in the picture below:


In the classic analysis approach, you decide, step by step, what to do with your data. You create tables, filters, slice and dice with the goal to surface some knowledge which you expect to see in the dataset. You usually work manually towards a specific goal in mind with a trial and error approach. With this approach you can easily answer questions such as “How many customers are buying my product?"


In advanced analysis you let mathematical algorithms work on the data to build a predictive model.  You then use this model to take operational decisions. With advanced analytics you can, for example, build a model which answers (in real time if you want) a question such as “Is this prospect likely to buy my product?”.


Finally in an exploratory analysis, your make use of the same algorithms of advanced analytics to obtain insights which help answering questions such as “Why are customers buying my product? “.  Knowing the ‘why’ behind a decision can help you change your business to improve it.



What you’ll learn reading this blog?

In this blog we show how SAP Predictive Analytics, with its Automated Analytics module can provide you the instruments you need to do exploratory analytics.

Practically speaking, we show that after performing a classification, you are automatically presented with various insights which can be used to drive your decisions.


Supposing that you are analyzing a dataset showing customer characteristics and your target is a flag saying if a customer has purchased or not a product, after running the classification you automatically get various insights:


“Key Influencers” are the variables which mostly explain the target (e.g. what customer characteristics are most related to the decision to purchase or not

a product). You can get insight on specific values of key influencers but you also automatically get “groups” or “bands” of values with a similar influence.

The values are automatically “grouped” together when the variable is categorical (e.g. “customers country is France, USA or Italy”; they are automatically “banded” when the values are continuous (e.g  “age between 29 and 45”). Groups and bands greatly simplify the analysis and the tool does a great job automatically proposing the best ones without you having to worry about the best way to bin your data.


Finally the tool quickly and automatically points you out “segments” of interest. Those are set of records having similar characteristics which have a strong influence on the target (e.g. the tool can show that customers “living in the USA and aged between 18 and 25” show the highest likelihood to purchase your product).


It is time now to see some action and understand, with an example, how you can improve your business based on insights coming out of an exploratory analytics approach.


The whitest napkins you have ever seen!

Imagine that you are working in a company specialized in cleaning table cloths and napkins for restaurants. In the past few months you created a new offer called “Premium Service” which guarantees restaurants to have the whitest napkins in the whole country! You proposed the service to several of your existing customers, some of them purchased it, some other not.


You created a list of all the restaurants to whom you proposed the service. In this list you put all the characteristics about your customers (e.g. how many seats the restaurants have, if they are located downtown, in the suburbs, in the country, the average price of a meal, if they have a valet, etc.).For each customer you marked if yes or no they purchased the Premium Service.

The dataset might look like this:



You can use this list for two tasks: create a predictive model which can tell if a prospect is likely to accept the service (advanced analytics) and/or see if you can find some interesting patterns in the restaurant profiles which you can use to improve your business (exploratory analytics).


Typically a marketing manager focused on a short term marketing campaign (where the goal is to maximize the return and minimize the cost) would use the predictive model in an operational mode.


A business strategist who wants to improve the business globally on the long term would be more interested in the exploratory approach.


To accomplish both tasks you can use SAP Predictive Analytics and its Automated Analytics module.


The basic question you want to answer is if ‘yes or no’ a customer is likely to buy the service. This is a typical classification problem so you apply the Classification module. You set the Premium Service flag as the target variable.  (If you never used SAP Predictive Analytics you can watch this video to see how to use Classification http://scn.sap.com/docs/DOC-62236 )


All other variables (excluding IDs) are going to be analyzed to understand their influence in the purchase decision. The screen where you set the variables looks like the following:


Now you click a few Next buttons and, after the Classification completes its execution, the model has been created automatically for you.

First of all you need to check if the quality of the model is good, to do that look for the Predictive power (also know as “Ki”) and Prediction Confidence (“Kr”) in the model summary.


If the model is good you can now use it in an operational mode to ask  “Is this prospect likely to purchase the service?” or you can use it in an exploratory mode to ask “What are the typical profiles of customers who purchase the service”. This second mode helps taking strategic decisions.


You should notice here that you are using information from the past (your list of customers who, you already know, purchased or not your product). The Classification module is able to discover patterns in the past data. The tool can then apply the same patterns on new data (prospects) or help you analyze them to understand what is influencing a purchase decision.


For an operational usage you can immediately go to the Run or to the Save/Export sections. From there you can check in real-time which new prospects are likely to purchase the service. Alternatively, if you are a developer or work with developers, you can export the model in various programming languages (like Java, C, C++, SQL, and many others) so that it can be embedded in an application suggesting to your sales team which restaurants to approach. SAP Predictive Analytics automatically provides you the code in the language you need. You copy it and paste it into your application.


In this blog we are more interested in an exploratory analytics usage, let’s see how to proceed with it.


For strategic decisions you can look at various information generated with the model, see if they make business sense and decide how to use them.

You can start your exploration by looking at the key influencers.

In the Automated Analytics module you open the Contributions by Variables section under Display. You see a visualization similar to the one below:

3.varaible contribution (2).jpg

This graphic tells you that the variables which are most related to the decision to purchase the service are, in order of priority, the Price Segment, the Location and the Number of Covers of a restaurant. Those are the key influencers of the Premium Service target.


If you double click on Price Segment you see the following visualization:

4.price segment (2).jpg

This is telling you that, according to your past data, a very expensive restaurant (80 and more USD for a dinner, on the left of the display and of positive value) is more likely to purchase the service. On the contrary, inexpensive restaurants (19 USD for a dinner, in the right part and negative) are less likely to purchase the service.


While you are on this screen, you can also see that all other price segments were automatically grouped under the label “KXOther”. Those other price ranges are not really meaningful and the tool simplifies the visual analysis for you by grouping them together.


If you now open the second variable, Location, you see something like the following screen:


This screen tells you that Downtown restaurants are more likely to purchase your service while Countryside restaurants won’t probably purchase it. Here again you see that a new group has been created automatically with restaurants in Small Town or in Suburbs. They have the same influence (negligible), no need to make analysis more complex by showing separate entries.


When opening the third variable, Covers, you have the following screen:


We won’t actually use this information for our analysis but you can see that you automatically obtained bands of values which have a similar influence. If the visualization had a bar for each value of “Cover” it would have been almost useless because too difficult to read and too detailed to be effective.  With the automated banding you can immediately see that restaurants in the band of 76 to 106 seats are the most likely to purchase the service.


Let’s see how we can use the knowledge we already gained.

First of all you have now identified your most important variables in the list of key influencers. You could decide to simplify your analysis (even a classic analysis) by taking into account only them. In this example it might not seem very useful but if you think of a scenario where you have thousands of sensors, you might be able to identify the few ones which are really important for your analysis and use only their data.


Restaurants which are more likely to purchase the Premium Service are expensive, they are probably luxury restaurants. You could propose to your marketing team to refresh your brand so to make it look high-end. New expensive restaurant prospects might be attracted by this luxury aspect of the brand.

On the other hand you might take a completely different approach: reduce your pricing to be more attractive for inexpensive restaurants.


On the location front, you could decide to focus your business to the downtown areas of large cities. This could reduce your cost of transport while making sure your trucks are faster on site when an important customer calls for something urgent. This decision could even mean that you decide to disregard completely restaurants located in the countryside.


We can go even further in our exploratory analysis.

If you open the Decision Tree section of SAP Predictive Analytics you can look at the combined influences of multiple variables. The screen below shows the root and some leaves of the tree (you can actually choose the leaves you want to display or have SAP Predictive Analytics automatically open the most influencing leaves one after the other). The decision tree helps you identify segments of interest for your analysis.

6.decision tree.jpg

Looking at the Decision Tree you see that the most likely customer to purchase the service are restaurants which are expensive AND located downtown (20,65% of them purchased your service). You could be tempted to create a specific marketing campaign for that kind of restaurant but if you look at the absolute numbers you see that there are only 431 restaurants of that type over a whole population of more than 8000 restaurants. This segment contains only the 5% of restaurants. This should make you think: is it a good idea to target such a small population of restaurants? Shouldn’t you have two different marketing campaigns, one for expensive restaurants, wherever they are and one for downtown restaurants whatever their prices? You can talk about this with the marketing team and bring the numbers and visualizations with you to support the discussion.



To summarize, you have seen that in a few clicks, using the classification module and looking the model debriefing, you were able to do exploratory analytics to take strategic decisions for your company. Those decisions were taken on real data based on a good model automatically provided by SAP Predictive Analytics. You didn’t have to think about how to manipulate the data, how to filter it, how to visualize it. The tool did all of that for you. You could then concentrate on deciding how the mathematically correct and interesting output could be used to improve your business.


Here we took the example of napkins but you can use the same concepts in many different situations.

Just think of your business and of some things you want to improve, of the data you already have in stock and you are likely to find good examples.

If you have any idea or example, just post it under this blog so that all the community can benefit from it!


I hope this paper inspired you to try out SAP Predictive Analytics and make exploratory analytics with it. If you want, you can download a free trial version here: www.sap.com/trypredictive.

And if you have any feedback or idea on how to improve SAP Predictive Analytics you can post it here:


Have fun and happy explorations!

Let’s face it – “predictive analytics” can be a bit of a complicated topic once you get into it. The “why” and the “what” are pretty easy to get your head around but the “how” is where the rubber hits the road.  Before we even get to the complexity of which algorithms to use and how to configure them, there’s a higher level consideration to deal with first – which technologies to use and how do they fit together? 


In my last article: Predictive Smackdown: Automated Algorithms vs The Data Scientist, I discussed where our Automated Analytics and Expert Analytics fit into the bigger picture, so in this entry, let’s turn our attention to SAP HANA and the predictive options available there. 

Predictive Alphabet Soup?

Predictive Alphabet soup.jpg

Sometimes the options on SAP HANA look like “Predictive Alphabet Soup” because with R, PAL, and APL, it is not just the letters in the acronyms that are important but also what order they are arranged in.  Unfortunately for the uninitiated, some customers these three technologies as disjoint and confusing – when do you use R and when do you use PAL?  What are the differences between PAL and APL? 

I have even heard from a few customers that they would like to wait until these technologies “merge” into one (hint: it’s like saying you want to wait until an apple and an orange become one fruit).  A better way to look at this is to understand the pros and cons of each and how they work together.  Let’s take a look at each one of them individually and then what that means overall.

R – The De Facto Predictive Language

R.pngIt’s pretty impossible to read anything about predictive analysis and not hear about the open source language “R”.  What is R? It is a language used by statisticians and data scientists to analyze data sets with complex mathematical algorithms.  There are well over 5,800 "packages" (and growing) that implement statistical techniques, data manipulation, graphing, reporting, and more.

R is extremely popular because it is freely available, easy to extend, and there are lots of resources (and people) to learn from.  But R is a statistical language made for the mathematically inclined and therefore isn’t something you just pick up a book on and learn in a few hours unless you have some background already. 

SAP Predictive Analytics 2.x uses R and provides a graphical modelling environment on top to make the creation and comparison of predictive models much easier than by invoking R on the command line.  You can even add your own custom R components so there’s virtually no limit to the types of modelling you can do.


If you have SAP HANA, you can deploy an R server as a sidecar to run predictive algorithms on your data.  This opens up all new possibilities because the full breadth of R’s capabilities can be unleashed on your data in HANA.  However as an external system, this type of deployment requires data extraction from HANA to feed the R server which will crunch the numbers and return the results back to HANA.  In addition to the obvious I/O bottlenecks involved in bringing data to an external system, you lose the parallel processing that SAP HANA is legendary for.

The SAP Predictive Analytics client tool can use R locally, but can also be used for scenarios where you want to leverage an external R server that is connected to SAP HANA.

SAP Predictive Analysis Library (PAL)

The SAP PAL is a native C++ implementation on HANA of the most commonly used predictive algorithms in data science.  The goal of this library is to enable up to 80% of the common predictive scenarios that you would normally use an external R server for.  Note the goal is 80% of the use cases, not 80% of the algorithms - You can imagine that with over 5,000 R algorithms in the world, there is actually a tiny fraction of them that are used very frequently.

By using SAP PAL you can leverage all the in-memory goodness and near-linear parallelism performance that SAP HANA offers to perform training, scoring, categorization, and more without your data leaving the server.  So what’s the problem?

Well, if you need an algorithm that is not in the SAP PAL, you may still need to deploy an external R server.  Additionally, many data scientists develop their own R algorithms - something within their skill set whereas developing those same algorithms in C++ to be deployed natively on HANA may not be.

How do you use the SAP PAL? You can call it directly in SQLScript, but fortunately SAP Predictive Analytics not only supports R, but it also supports SAP PAL – and even a combination of the two.  Of course this only makes sense when you are using an external R server connected to SAP HANA as PAL itself is native to HANA.

But there’s also another thing to consider – what if you aren’t a data scientist? 

SAP Automated Predictive Library (APL)

The APL is a native C++ implementation on HANA of SAP’s patented automated machine learning technologies that make Automated Analytics so cool. Instead of rehashing the benefits of Automated Analytics here, please take a look at my previous blog entry that details it more fully: Predictive Smackdown: Automated Algorithms vs The Data Scientist

You could perform automated analytics with HANA before the creation of the APL, but you would have needed to deploy a sidecar predictive server to run the automated machine learning algorithms.  The APL was introduced at the beginning of 2015 to bring all of that “automagic” goodness to HANA, and just like the PAL, the APL also does not need to extract data from the HANA system to do it's predictive magic.


You can find out more about the APL here: What is the SAP Automated Predictive Library (APL) for SAP HANA? .

Predictive Peanut Butter and Jelly (or Chocolate & Peanut Butter)

PB and J.jpg

A more interesting analogy to the Gestalt Principle is the concept of peanut butter and jelly sandwiches (or chocolate and peanut  cups if you prefer).  Peanut butter is rich and creamy, but adding the sweetness and tartness of jelly somehow creates a magical combination that is better than either topping by itself. "PB&J" is one of the best inventions in the world.

The predictive options are pretty much like peanut butter and jelly – you can use R by itself, you could use SAP PAL by itself, or you could go the automated route with SAP Automated Predictive Library.  Each has it’s own purpose but being able to use one or more of these together based on your needs is where things get very interesting:

  • Need the flexibility of custom R algorithms but use SAP HANA?
    • No problem, deploy R because HANA can be connected to it.

  • Want the speed of HANA but still need R’s flexibility?
    • Deploy both R and PAL together and do as much in PAL as you can.

  • Want to have some intelligent auto-clustering algorithms but still need some hardcore data science requirements?
    • Simple! – deploy PAL and APL.

Know Your Predictive OPTIONS

predictive solutions.png

A prerequisite to “knowing what you are doing” is understanding what is available and what you need. Personas and use cases are usually good hints:

SAP PAL and R:

  • Data Scientists and Mathematicians creating models themselves (typically by hand).


  • Business and Data Analysts as well as Data Scientists wanting automatic model creation.

kc Choice.pngOne reason some of our customers get confused is that  they think "all predictive is the same" and assume that if their HANA system has predictive capabilities, they also have the Automated Predictive Library (APL). However the APL is part of the “Predictive Option for SAP HANA” license so you want to ensure you know if you are licensed for the APL or need to get it.  A future article will go into this option into more detail.

You must resist the urge to try to rearrange the letters and assume that you can replace the APL with PAL or vice versa.  Hopefully this article shows you how each predictive technology on SAP HANA has its place and they are not interchangeable.

Ofcourse SAP Predictive Analytics 2.x operates with all of these - R, PAL, and APL.

Most customers realize the ROI of their existing investment in SAP HANA can be greatly enhanced by enabling users of all types to benefit from automated predictive analytics and adopt the Predictive Option for SAP HANA .  Whether you like your peanut butter with jelly or chocolate, you have to admit, it tastes pretty damn good. 


Filter Blog

By author:
By date:
By tag: