1 2 3 11 Previous Next

SAP Predictive Analytics

164 Posts

Any time a (relatively) specialized or obscure topic gets subjected to worldwide hype and before finally becoming part of the mainstream, there is an interesting phenomena that occurs between those “original believers” and the “newcomers”.   I’m old enough to remember a time before the Internet and even Linux when e-mail had to be sent using a Unix command line program called (imaginatively enough) “mail” and could only be sent to people at other universities. 


kidcoding.jpgWhen the Internet became more and more available, some of us “computer geeks” were very proud that we used e-mail long before people even figured out if the word needed a hyphen or not.  By the time residential cable and DSL modems became ubiquitous, it was a badge of honor to be running your own e-mail and Web servers at home because back then, you really had to know what you were doing (for the record, I run both in my own on-premise private Cloud to this day - Why? because I *can*).


Watching data scientists interact with regular business users reminds me of my own evolution from a “computer geek” into a “software engineer” (which was much cooler and definitely paid more).  The tipping point though was when my background as a computer geek no longer qualified me to lord my knowledge over those less enlightened then myself.  It was not that my experience became less valid, it simply became less relevant. 

Technology has progressed far enough that making it easier to use and accessible to everyone no longer sacrifices cost or performance.  In the case of e-mail servers, it’s actually now cheaper to outsource the whole thing in the cloud than to run your own.  That started to make my experience seem really expensive, and in some cases, unnecessary.



Predictive Analytics Has Hit Primetime


Let’s face it, “Predictive Analytics” is the evolution of that much more boring-sounding topic of “Statistics”.  However now that predictive is “cool”, I’m seeing the same phenomena: many of those who previously had a math or statistics background became “data scientists” because they have a deep understanding of what makes predictive tick.  Today, data scientists are in an enviable (and financially lucrative) position: statistics will never get easier and the fundamentals of mathematics are unlikely to change in…. well... ever.


The truth is that the job of a (good) data scientist is not that easy, and it can be infinitely boring.  Significant time is spent cleaning, preparing, and deriving data before it even makes sense to start the predictive modelling process.  Creating the models themselves requires a delicate understanding of the (now augmented) dataset and which algorithms are most applicable to it.  This is a highly iterative process which requires an understanding of statistics to decide how to refine the models for the best possible accuracy.  If this sounds repetitive, you would be right (and if you don’t think this is repetitive, you are likely a data scientist yourself).



Predictive Automation to the Rescue


The good news (for us non-data scientists) is that technology is poised to blow the fortified towers of data science wide open just as easily as it devalued my computer science degree.   The industry’s focus on predictive analytics has shifted from pure performance and efficiency to ease of use and accessibility for the larger business audience.   Encoding self-tuning algorithms into an autonomous “data scientist in a box” application has always been the Holy Grail of predictive analytics, but how realistic is it?


Ironically predictive analytics software can be made smarter specifically because the rules of mathematics cannot change.  Many of the previous attempts have been to create an “uber-data scientist” application that can automatically pick the best algorithm for a given problem.   This approach has some merits: there are a number of algorithms that apply specifically to classification problems so you could simply run all of them on a target dataset and pick the best one (and in fact, this is what some products do).


However a data scientist will tell you that one of the dangers is that this can yield drastically different results between runs.   The algorithm chosen this week may not be chosen as the best one next week and therefore the results of the two weeks are not directly comparable.   Choosing a specific algorithm from week to week may not be the best idea either – it’s possible the data for the first week’s run would yield a different algorithmic decision than the next four weeks.  In the end, you need a data scientist to tell you which of the algorithms you could stick with (defeating the purpose of a “data scientist-in-a-box” approach).



The solution to this is to have this “auto-selection” intelligence built into the algorithm itself rather than have it sit above.  That means the application can pick this “uber-algorithm” which will then have the intelligence to handle any type of data and always make the right modelling choices.   In practice, this super-smart algorithm would do many of the same steps a data scientist would.  It would use statistical analysis to determine the optimum analysis parameters and then iteratively create many candidate models before finally coming down to the winner.  The difference is that a computer can create hundreds or even thousands of models before choosing the most optimal one.


SAP Predictive Analytics has an automated mode where you basically pick the type of problem (classification, association, etc) to solve and the software handles the rest in order to put predictive analytics in the hands of more user (for a more detailed discussion, see How does Automated Analytics do it? The magic behind creating predictive models automatically).



Predictive Analytics Is Not a “Zero” or “One”


Predictive-Technologies.jpgI am continually surprised by the number of customers that think you either are, or aren’t a data scientist.  Digging into this a little bit further, I find many times a data scientist has created this binary distinction to differentiate their skills from the “regulars” (or in Harry Potter terms, “muggles”) – who readily agree they don’t want to be anywhere near a mathematical equation, much less an actual algorithm.


However, the field of data science is really a spectrum that blends really quickly into the analytics/business intelligence world.  If you accept that predictive analytics “is an algorithmic analysis of past data to find patterns that can be applied to new data to improve a future outcome”, I would argue that business intelligence is "a visual and calculation oriented analysis of past data to make better decisions about the future."

That means every business intelligence user can benefit from automated predictive analytics to do what they are currently doing – better.


The sooner you can get your organization out of the “data scientist or not” mentality, the quicker you can make everyone more effective at understanding and solving business problems.   If you get stuck on this, look for the strongest opponents and likely there’s a non-technical reason for their resistance to “opening the predictive gates”.



Don’t Worry Data Scientists, We Will Still Need You


Does predictive automation completely replace data scientists?  Definitely not – a human can understand the semantics in the data, derive new data fields based on their domain knowledge, and can create far more complicated models without so many iterations.  However for those that do not have the knowledge or skills, the automated way gets you pretty darn close - and a whole lot faster.


Federal-Judges-in-New-York-and-Illinois-Consider-Important-Issues-That-Could-Shape-the-Future-of-Predictive-Coding-Technology-for-eDiscovery-rev.4.docx_.jpgThe massive influx of business users with access to predictive analytics reduces the burden on (typically) overloaded data scientists by freeing them up from some of the more “simpler” problems that could be handled by users directly and letting them work on the more sophisticated predictive problems.  Sounds like win-win doesn’t it?


Interestingly, many data scientists can also benefit from the use of automated predictive technology to better understand their data and create baseline models *before* they dive into creating their own models by hand.  By having a computer do the initial analysis, a data scientist can save hours to days of data profiling before getting down to the core predictive modelling they are being paid those big bucks for.  In some cases, the automatically generated models may be solid enough to solve a problem without requiring any manual modelling.



In The Future, It Won’t Matter


M-fortune-cookie.jpgThe field of predictive analytics is maturing at its fastest rate ever and the move towards more ease of use and simplicity will eventually reach a plateau, just like standing up a full Hadoop cluster in the cloud can be done in under ten minutes today.   The focus will shift from “which is the best algorithm to use when?” to “how I can use these predictive results to improve the business?”.  

The number of predictive models needed will explode as companies explore micro-segmentation and the (potentially) small gain of a single manually crafted model will give way to the need to create hundreds of models per day.


So to you data scientists out there: We will always need your skills, your experience, and your wisdom.


But take it from someone who has been through this cycle and had his computer geekdom commoditized – automation opens up predictive analytics to everybody, so just remember: “It’s not all about you”.



You can try SAP Predictive Analytics and all of its automated predictive goodness for free by downloading it at http://www.sap.com/trypredictive


Most data scientists and statisticians agree that predictive modeling is both art and  science yet, relatively little to no  air time is given to describing the art. This post describes one piece of the art of modeling called feature engineering which expands the number of variables you have to build a model.  I offer  six ways to implement feature engineering and provide examples of each. Using methods like these is important because additional relevant variables increase model accuracy, which makes feature engineering an essential part of the modeling process. The full white paper may be downloaded at Feature Engineering Tips for Data Scientists.


What Is Feature Engineering?

A predictive model is a formula or method that transforms a list of input fields or variables (x, x, …, x) into some output of interest (y). Feature engineering is simply the thoughtful creation of new input fields (z, z, …, z) from existing input data (x). Thoughtful is the key word here. The newly created inputs must have some relevance to the model output and generally come from knowledge of the domain (such as marketing, sales, climatology, and the like). The more a data scientist interacts with the domain expert, the better the feature engineering process.

Take, for example, the case of modeling the likelihood of rain given a set of daily inputs: temperature, humidity, wind speed, and percentage of cloud cover. We could create a new binary input variable called “overcast” where the value equals “no” or 0 whenever the percentage of cloud cover is less than 25% and equals “yes” or 1 otherwise. Of course, domain knowledge is required to define the appropriate cutoff percentage and is critical to the end result.

The more thoughtful inputs you have, the better the accuracy of your model. This is true whether you are building logistic, generalized linear, or machine learning models.


Six Tips for Better Feature Engineering

Tip 1: Think about inputs you can create by rolling up existing data fields to a higher/broader level or category. As an example, a person’s title can be categorized into strategic or tactical. Those with titles of “VP” and above can be coded as strategic. Those with titles “Director” and below become tactical. Strategic contacts are those that make high-level budgeting and strategic decisions for a company. Tactical are those in the trenches doing day-to-day work.  Other roll-up examples include:

  • Collating several industries into a higher-level industry: Collate oil and gas companies with utility companies, for instance, and call it the energy industry, or fold high tech and telecommunications industries into a single area called “technology.”
  • Defining “large” companies as those that make $1 billion or more and “small” companies as those that make less than $1 billion.


  Tip 2: Think about ways to drill down into more detail in a single field. As an example, a contact within a company may respond to marketing campaigns, and you may have information about his or her number of responses. Drilling down, we can ask how many of these responses occurred in the past two weeks, one to three months, or more than six months in the past. This creates three additional binary (yes=1/no=0) data fields for a model. Other drill-down examples include:

  • Cadence: Number of days between consecutive marketing responses by a contact: 1–7, 8–14, 15–21, 21+
  • Multiple responses on same day flag (multiple responses = 1, otherwise =0)


Tip 3: Split data into separate categories also called bins. For example, annual revenue for companies in your database may range from $50 million (M) to over $1 billion (B). Split the revenue into sequential bins: $50–$200M, $201–$500M, $501M–$1B, and $1B+. Whenever a company falls with the revenue bin it receives a one; otherwise the value is zero. There are now four new data fields created from the annual revenue field.  Other examples are:

  • Number of marketing responses by contact: 1–5, 6–10, 10+
  • Number of employees in company: 1–100, 101–500, 502–1,000, 1,001–5,000, 5,000+


Tip 4: Think about ways to combine existing data fields into new ones. As an example, you may want to create a flag (0/1) that identifies whether someone is a VP or higher and has more than 10 years of experience. Other examples of combining fields include:

  • Title of director or below and in a company with less than 500 employees
  • Public company and located in the midwestern United States

You can even multiply, divide, add, or subtract one data field by another to create a new input.


Tip 5: Don’t reinvent the wheel – use variables that others have already fashioned.

Tip 6: Think about the problem at hand and be creative. Don’t worry about creating too many variables at first, just let the brainstorming flow. Feature selection methods are available to deal with a large input list; see the excellent description in Matthew Shardlow’s “An Analysis of Feature Selection Techniques.” Be cautious, however, of creating too many features if you have a small amount of data to fit. In that case you may overfit the data, which can lead to spurious results.


Hundreds, even thousands of new variables can be created using the simple techniques described here. The key is to develop thoughtful additional variables that seem relevant to the target or dependent variable. So go ahead, be creative, have fun, and enjoy the process.

One the key challenges facing enterprises today is making sense of the explosion in data generated by employees, partners and customers.  With the trend towards Internet of Things (IoT) this problem is getting worse.  Every new product, service, device and system is becoming computerized and generating data.


Executives understand that data is valuable and can be used to help their organizations run more efficiently but they struggle to work out how to achieve this:

  1. The data becomes very big very quickly.  Storing it for long periods on traditional database platforms is very expensive and businesses have turn to lower-cost alternatives – such as Hadoop.
  2. A large percentage of the data captured has very low information density and does not have the same profile as traditional enterprise data.  Finding insight through manual analysis where a business analyst compiles reports and dashboards to measure and understand KPI’s is not viable. The way to extract value from this mountain of data is to use Machine Learning and Data Mining techniques.


So how do we use advanced analytic techniques to extract insight from huge volumes of data stored on Hadoop?


Hadoop has incredibly powerful data mining capabilities from languages like Python and Scala to frameworks such as Spark ML & Spark/R, but these technologies all require very skilled practitioners.  They need to understand both the problem domain, the data science techniques to solve it and also the programming languages to implement it in. The custom-coded solutions need to be integrated into operational systems such as ERP, CRM, and Finance and maintained by IT departments.  This is a real challenge for IT departments struggling to maintain a balance between delivering innovations for the business while containing operational costs.


This is where Predictive Analytics from SAP can help.  We have an end-to-end solution that turns data mining on Hadoop into a predictable, repeatable and easy to manage process. 


So how does this work in practice?  Let’s take the example of Alan who works as a Data Analyst in the IT department of a Fortune 500 company. 

  1. Alan’s employer has created a Data Lake on Hadoop.  They have created a replica of their enterprise data and mashed it up with IoT data generated from their latest mobile application.
  2. The business wants to know how geolocation information from the mobile app can be used to identify potential high value customers early.
  3. Alan builds an Analytical Dataset (ADS) using Predictive Analytics Data Manager.   Alan uses his domain knowledge to enrich the customer record with derived attributes and make the data more predictive.  This ADS will be reused in the future to answer other questions about the customer. 
  4. Alan uses Predictive Analytics Modeler to automatically identify high value customers.  Once he has built the model he creates a report directly from the tool which shows his executive management the ROI they can expect from the solution.
  5. Finally Alan uses Model Manager to automate the deployment of the model into a production environment.  Model Manager will automatically monitor and maintain the model to ensure accuracy.
  6. Because the model is embedded directly in the CRM application, every time a new customer is added the system automatically validates if they are likely to become high value.
  7. New high value customers are given a differentiated experience and there is a clear improvement in marketing effectiveness.


By the end of the process Alan has successfully prepared the data, built a model and deployed it into production without writing a single line of code.  The whole project was completed faster than expected. The other good news is that as Alan develops Data Science skills, he can use Predictive Analytics Expert Mode to answer more complex problems using traditional Data Science languages and frameworks. 


SAP will be in New York for the Strata+Hadoop World from Sep 29-Oct 1.  Stop by and we’ll show you how you can take advantage of SAP Predictive Analytics on your Hadoop data.

The amazing technology through which you can predict the future outcomes and trends by extracting information and creating patterns from the existing historical data sets is known as Predictive Analytics.

This blog post has been written as part of the assignment for Predictive Analytics studied at Victoria University. Getting the practical hands-on experience of SAP products in this university has always been the best part of learning experience. We are very grateful to our lecturer Dr Shah Miah who helped in understanding the value and scope of Predictive Analytics.

The competition is rising in the market and their are loads of data that is sitting in the database also known as Big Data. This big data comprises of the historical data sets and can be used to produce some valuable information about the company and its operations/processes. The extraction of information is done with the help of Data Mining tools and a Predictive Model is built. Predictive model act as a training set and when the new testing for prediction needs to be done with the new real-time data, then the data is run through the model and it predicts the outcomes. These outcomes about the future helps us make better informed decisions.

A 2014 report found that the top five things predictive analytics are used for is to:

  • Identify trends.
  • Understand customers.
  • Improve business performance.
  • Drive strategic decision making.
  • Predict behaviour.


Predictive analytics are used in almost all fields of Fraud Detection and Security, Marketing, Operations and Risk. The most common areas of application are Credit Card, Banking and Financial Services, Governments and the Public Sector, Health Care Providers, Manufacturers, Media and Entertainment, Oil, Gas and Utility Companies, Retailers, Sports Franchises, Health Insurers and Insurance Companies.

We decided to do our predictive analytics study with the help of a healthcare example. One of the nemesis for Australia economy is considered to be obesity according to most journals. The prevalence percentage of obesity across Australia has been increasing in the recent years and it is expected to continue this trend until a proper solution is proposed. We had collected the data from Australian bureau of statistics and some health journals which gave us the percent of obesity population across Australia. The screen shot below shows the data from year 2000 to 2015 with a five year gap analysis.



We consolidated the given data to a flat file and used the data history to calculate our predictions on prevalence of obesity in the years 2020 and 2025. We had used the prediction tools to calculate mean and variance here to find the varying trend and increase of obesity in years with previous data. This data was than applied to the already present data to find the gap and fill out the predicted percentage of increase that will happen in future. The below screen shot of excel file with consolidated data shows us the final data or the percentage of increase that would probably occur if the problem is not taken care off.


The data here is the method of data representation which uses the data to be represented in desired format for study purposes. As for being a data analyst, we have to consider the importance of data collection from the past in order to study about the patterns that develop  in particular fields.

In our case study here, the value of predictive analytics is used as a research purpose analytical case. With the predicted data found above, the government can take actions against various factors that cause this physical sickness among the youngsters of Australia. The government can concentrate on particular areas or suburbs to create awareness centres to curb this economical nemesis. Same type of study can be applied to various other fields for their origin of business or study. The sales industry can concentrate on individualized portfolio of a customer in order to find out the pattern buying behaviour of that person to sell him or her desirable products from their data history. Another case can be the human resource industry planned operations to meet the company requirements of work force for the next ten years in advance and keeping the idea of immediate prospects out of their company.

At the end, we would like to conclude that this formula of prediction can also be used in many other cases for predicting the future outcomes of study or business. It does not mean the values are real and perfect, but rather approximated for future. Its enough said that predictive analytics is the future of business intelligence.

Thank You.

Pranav Gulati

Karthik Nagaraj

A new video is available on SAP HANA Academy showing APL Recommendation functions



The APL play list: http://bit.ly/hanaapl



Predictive Analytics is getting more and more mainstream – the advantages of being able to understand  trends by algorithmically analyzing historical data are so obvious, the question of whether to use predictive is not “Why?”, but “How?”.   This indicates that while predictive analytics are intuitively a good idea, customers want to understand how to derive real business value before they dive into a large project.  It is this need for tangible value that is driving both the need for real-world use cases and a more accessible way of unlocking that value.


No-Brainer-Sign-Easel.jpgThat’s where SAP Predictive Analytics comes in: Once the exclusive realm of data scientists and mathematics PhDs, predictive analytics capabilities are becoming more accessible to business users through the power of automated machine learning technologies that reduce the need for such specialists.   This has opened new opportunities in churn prediction, fraud detection, and predictive maintenance – just to name a few.


In the rush for “new” use cases, it is sometimes surprising that customers are not looking at some of their existing scenarios – such as Business Planning and Consolidation (BPC).  The very nature of BPC is to improve the operation of the business by making better decisions – whether that is to increase profitability, improve customer service, or to help the business run better.   All of these are predictive use cases with a very real, very tangible return – it’s what you might call a “no brainer”.


If you are near Santa Clara on September 15th or near Costa Mesa on September 17th, you really want to attend a lunch event co-hosted with our very good partner, Column5, and explore the possibilities of bringing predictive analysis to your BPC environment.


A few possibilities with you combine BPC + Predictive Analytics:


  • Anticipate trends and spot important deviations from them to identify future opportunities
  • Gain insight into the main influences of customer satisfaction, customer churn and employee turnover, and their impact on company success
  • Understand how historical sales, costs, and key performance metrics predict future performance
  • Identify cross-sell and upsell opportunities



The event is free for registered guests – Space is limited, so secure your spot NOW:



9/15 - Register for Santa Clara

9/17 - Register for Costa Mesa


If you cannot attend this event but want to know more, you can also reach Column5 directly at www.column5.com or contact Steve Sussman at ssussman (at) column5.com

SAP PA 2.png

September is when summer vacations come to an end, students head back to school – and the time for another quarterly release of SAP Predictive Analytics.  Our product teams were busy this summer cranking out new features and improvements based on overwhelmingly positive customer feedback the previous 2.2 version.


On September 2nd, we formally released SAP Predictive Analytics 2.3 and available on SMP now for licensed customers.  For those of you who are not already using it, you can download the 30-day trial here.


What’s New In 2.3?


You can learn more about everything in SAP Predictive Analytics 2.3 in the What’s New document.  Here are some of the highlights broken out by our release themes:


Go Faster and Do It Easier:


  • Better experience when editing R scripts:  New keyword highlighting, line numbering, and window expansion capabilities makes it easier to work with R without jumping into another environment.
  • Unified keycode system: Users can now add their license keys in the main PA window and have it apply to both Automated and Expert Analytics.  Users can also view them through the Help menu.
  • Easier access to resources: A new welcome page enables users to browse the Predictive Analytics Community, suggest ideas in SAP Idea Place, or subscribe to the new SAP Predictive Analytics Newsletter.


Do More and Do it Smarter:


  • More sophisticated model comparison:  Perform model comparison on three or more algorithm chains at once within Expert Analytics
  • More flexibility with model comparison metrics: Select how models should be measured and which order their KPIs should be compared in Expert Analytics
  • New capabilities in SAP HANA Automated Predictive Library (APL): Applications using the SAP HANA APL can now use a new recommendation type for “similarity”


Analyze All of your Data:


  • Updated support of major database versions:  Take advantage of the latest versions of all major supported databases (Check out the latest support matrix for more info)
  • Better HANA View Support: Recommendation and Social analysis now support HANA Views in Automated Analytics.
  • Improved Hadoop performance:  Enhancements to Automated Analytics processing when using Hadoop as a source.



NEW! Deep Dives by Product Managers


Also with this release, some of our product managers have created deep dive articles to discuss some of the new features in 2.3:


Jayanta Roy :



NEW! SAP Predictive Analytics Newsletter


We're searching the global SAP Predictive Analytics community to bring you the most popular and useful blog posts, tutorials, how-to resources, expert tips, and more, direct to your inbox.


>> SUBSCRIBE Here <<


How To Get Started?


  1. Download the trial!
  2. Check out the online materials and tutorials:
  3. Participate in the SCN Community: SAP Predictive Analytics
    • Learn, ask questions, get answers!


Antoine CHABERT has great blog entry with common questions and answers: Frequently Asked Questions - Downloading, Installing and Activating SAP Predictive Analytics



The (Road)Map Ahead


September also marks the beginning of the fall conference season and that means our product team has pressure to deliver an even bigger and better release with SAP PA 2.4.   While the ASUG SAP BusinessObjects User  Conference in Austin Texas and SAP Insider in Singapore are already behind us, SAP TechED 2015 is just around the corner.


Our next version is shaping up to be a major release for us – so look out for SAP Predictive Analytics lectures, webinars, and demos that will show off the latest achievements you can expect in another great release coming up next!

The version 5 of the Predictive Analytics RDS is released in August.  There are now about 30 different scenarios available across 9 different domains. Around 4 lines of business such as Manufacturing, Sales & Marketing, Finance, Portfolio & Project Management, and 5 industries such as Telco, Retail, Consumer Products, Banking and Insurance.


As in the earlier versions of the RDS, the use cases from the different domains are built using the approaches of either with Automated Analytics or Expert Analytics. The end results are written back into HANA tables and visualized with SAP Lumira(Expert Analytics of SAP Predictive Analytics) or SAP UI5 or SAP Business Objects Explorer tools.


  • One of the highlights of the version 5 are the customer co-innovation opportunities that are rolled into the RDS as customer POCs.
  • A home care customer from UK collaborated with the RDS team in developing the use case that could be realized for the debt collection scenario
  • Another customer in the consulting projects business from Sweden collaborated with the RDS team in developing the use case that could help in identifying the profitable projects which could help underline the bottom line revenues.
  • Different predictive modeling approaches as part of the total list of v5 use cases
    • Automated Analytics with Automated Analytics predictive library on desktop
    • Expert Analytics with HANA PAL
    • Expert Analytics with Open source R
    • SAP HANA Scripts with SAP HANA PAL
  • Visualizations of the end results after predictive modeling based on the different tools
    • SAP Lumira or SAP Predictive Analytics (Expert Analytics)
    • SAP Predictive Analytics (Automated Analytics)
    • SAP UI 5 (HTML 5)
    • SAP Business Objects Explorer views


Quick snap shot of the business scope as of v5:

Bus Scope.png

Quick snap shot of the use cases as of v5:


Use cases v5.png

An upgraded version of the HANA cloud image shall be made available in the SAP store in the later half of Sep 2015. This upgraded HANA cloud image shall have all the different scenarios in the context of SAP HANA using the SAP predictive tools and technologies. An announcement shall be made later in this month.

The ability to compare models (or data sets) using a component in a chain makes things simpler for the analysts. The first step was taken in a previous release where this was achieved by dropping a new component in a analysis chain which internally compared the Key Performance Indicators (KPIs) and a Best Component Identifier highlighted the best node. Now, with this new release, more control is afforded to the analyst where one can select the KPIs they want the compare performance on and also specify an order to break a tie if such a condition occurred. The compare node also lets the user pick a partition set for the comparison thus requiring the partition component to be present before the algorithms in the chain.


Lets have a look at this process flow step by step where we'll build a chain using a data set on SAP HANA.



Step 1: Select a data set on SAP HANA to build the chain

We'll start the analysis by picking a view or table on SAP HANA and connect a Partition node so the data set is split into three sets. This will be used by the Model Statistics and Compare components following training. We'll then add three algorithms, classification from APL, Naive Bayes from PAL and a CNR tree in R.

Configure the Partition component to slice the date set into three parts

Lets configure the partition node to split the data set into three slices of 80% for training, 10% for validation and 10% for testing the generated models.

We'll then configure the individual algorithms so they are ready to be trained on the train slice of the data set.

Note: the size of Validation and Test sets has to be greater than or equal to 10%.



Step 2: Add Model Statistics

Now with the algorithms configured, lets add the Model Statistics component to the algorithms so their performance statistics can be computed.

Model Stats.png

The three components automatically take the name of their parents and now these nodes can be configured though a very simple properties panel by double clicking on the components or through the context menu.

Stats Config.png

The Target Column is the column used as the target or dependent column in the supervised training of the parent component and the Predicted Column is the column that lists the predicted values generated by the trained algorithm/model. Availability of this independent component makes it possible for the analyst to compute performance statistics for not just the standard algorithms available in the solution but on custom components and data sets too.



Step 3: Add Model Compare component

With the Model Statistics node configured, lets add the Model Compare component from the right hand panel.


Now, the Compare component can be connected to the other Model Statistics components by dragging and dropping on them one by one.

Note: For simplicity, this example uses three parents but Model Compare node can be used to compare more than that at a time.

Add 3 parenents.png


Once all parent components are connected, the Model Compare component can now be configured.

Ready to Run.png


Step 4: Configure the Model Compare node to define how you would like it to compare models

The Model Compare component can be configured in four different ways..

  1. Comparing Classification models (two class)
  2. Comparing Regression models
  3. Comparing two Classification or Regression models.
  4. Comparing three or more Classification or Regression models.

A Model Statistics component is required to be the parent of Model Compare since it relies on the statistics calculated by it. It also get the algorithm type (classification or regression) from it and displays different KPIs based on it.


Configuring Model Compare for Classification

When comparing Classification algorithms, the properties panel lists the KPIs relevant to classification modeling as below:

MPC Config - Classification.png

Here, the analyst can pick which slice of the partitioned data set should be used to compare the performance of the individual KPIs selected. The order of the KPIs will be used in situations where there is a tie in the KPI being compared.


Configuring Model Compare for Regression

The properties panel lists KPIs relevant to regression modeling and the analyst can choose the partition slice to use for comparing the performance of training the models.

MPC Config - Regression.png


Step 5: Extend the analysis further when comparing two models

To enable extending the analysis chain with more components, its important for the Model Compare to have a results table at the end of its execution. When comparing models, there will ever be one winner and there needs to be a way to map it to the output of the compare node. Therefore, its possible to map the columns from two parent comparison into a single set that will form the output table of the Model Compare node. In all other cases, the Model Compare node will become the terminal or leaf node in an analysis chain..

When comparing two algorithms/models, the following mapping screen will be enabled to allow column mapping.

Column Mapping in MPC.png

Icons: The icon on the Model Compare node will change indicating whether or not the analysis can be further extend.

Model Compare TWO parentsIndicates the Model Compare is working with two parents and can be extend.

Model Compare Multiple parentsIndicates the Model Compare is working with multiple parents and that its a terminal/leaf node in the analysis.


Step 6: Run the analysis

Now that all the components in the analysis chain have been configured, clicking on the run button will execute the training of the chain. It can also be triggered from the context menu on the last component in the chain.

Export Best as SP.png

Results tab will show the results and performance of the training. Also, it is now possible to export the best algorithm as stored procedure directly from the Model Compare component.


Step 7: Results of training

More detailed summary

The Results tab will have the summary of individual components. The Model Statistics and Compare summary will now show more details on per partition. Here is a sample summary from the Model Compare component comparing with three Classification algorithms/models:

MPC Summary.png

It indicates the best model based on the KIPs selected and the partition used for comparison. Regression comparison results will list the regression KPIs. The KPIs in bold highlight the ones selected in the Model Compare properties panel and algorithms/models in bold indicate the best performer in a partition. The best node chosen based on Model Compare configuration is indicated in blue with the star icon.



More charts

Charts in Model Statistics component will display performance details from individual partitions and Model Compare components overlays the outputs from different algorithms/models being compared and groups them for the individual partitions used in the process:

Composite Charts of MPC.png

KPI Definition


Following table lists all the Classification KPIs with descriptions:

KiPredictive power. A quality indicator that corresponds to the proportion of information contained in the target variable that the explanatory variables are able to explain.
KrModel reliability, or the ability to produce similar on new data. A robustness indicator of the models generated. It indicates the capacity of the model to achieve the same performance when it is applied to a new data set exhibiting the same characteristics as the training data set.
Ki & KrPredictive power and model reliability. Gives equal importance to the robustness and generalizing capabilities of the model. For more information, see the definitions above.
AUCArea Under The Curve. Rank-based measure of the model performance or the predictive power calculated as the area under the Receiver Operating Characteristic curve (ROC).
S(KS)The distance between the distribution functions of the two classes in binary classification (for example, Class 1 and Class 0). The score that generates the greatest separability between the functions is considered the threshold value for accepting or rejecting the target. The measure of seperability defines how well the model is able to distinguish between the records of two classes. If there are minor deviations in the input data, the model should still be able to identify these patterns and diiferentiate between the two. In this way, seperability is a metric of how good the model is; the greater the seperability, the greater the model. Note that the predictive model producing the greatest amount of separability between the two distributions is considered the superior model.
Gain % (Profit %)The gain or profit that is realized by the model based on a percentage of the target population selection.
Lift %The amount of lift that the trained model gives in comparison to a random model. It enables you to examine of the difference between a perfect model, a random model and the model created.



Following table lists all the Regression KPIs with descriptions:

KiPredictive power. A quality indicator that corresponds to the proportion of information contained in the target variable that the explanatory variables are able to explain.
KrModel reliability, or the ability to produce similar on new data. A robustness indicator of the models generated. It indicates the capacity of the model to achieve the same performance when it is applied to a new data set exhibiting the same characteristics as the training data set.
Ki & KrPredictive power and model reliability. Gives equal importance to the robustness and generalizing capabilities of the model. For more information, see the definitions above.
R2The determination coefficient R2 is the proportion of variability in a dataset that is accounted for by a statistical model; the ratio between the variability (sum of squares) of the prediction and the variability (sum of squares) of the data.
L1The mean absolute error L1 is the mean of the absolute values of the differences between predictions and actual results (for example, city block distance or Manhattan distance)
L2The mean square error L2 is the square root of the mean of the quadratics errors (that is, Euclidian Distance or root mean squared error – RMSE).
LinfThe maximum error Linf is the maximum absolute difference between the predicted and actual values (upper bound); also know as the Chebyshev Distance.
ErrorMeanThe mean of the difference between predictions and actual values.
ErrorStdDevThe dispersion of errors around the actual result.


In the last decade companies like Google, Facebook and Netflix have led the way in collecting and monetizing huge amounts of data generated by consumer’s everyday activity.  They look on this data as a strategic asset – every decision in their organizations is data driven, every product they sell is data driven.  This has created enormous interest amongst traditional enterprises who can easily see the benefit of putting their data to work in the same way.

The techniques they have used to achieve this are based on Hadoop and its surrounding ecosystem of technologies which allow the collection, storage and processing of gigantic amounts of data at a low economic cost.  Initially Hadoop was primarily used for batch workloads but in the last couple of years Apache Spark has emerged as a technology which enables both batch and real time capabilities to work in parallel in the same infrastructure. This capability is known as a Lambda Architecture.  It is particularly suited to predictive analytics where typically patterns are identified within a historical dataset on a regular basis (Model Training) but then new incoming records are checked in real time to see if they correspond to these patterns (Scoring).  For example a bank will identify what constitutes a profitable customer during Model Training.  They will then look at new customers in real time to see if they have high potential profitability.  It is incredibly important for scoring to happen quickly – if a profitable new customer is not given the correct treatment they may walk away.   This is why having a real time capability is very important.

The following document shows how SAP Predictive Analytics can be used to generate models on existing facts in spark and then deploy them to Spark Streaming where they can score incoming data in real time.

Spark and Spark Streaming

Spark Streaming provides a real time streaming capability to Apache Spark while Spark SQL provides one of the batch mechanisms. Spark streaming enables Spark to process 100,000-500,000 records per node per second, and to reach sub-second latency.


A key feature of Spark is fault tolerance, which refers to the ability of a system to continue to operate properly in the case of a failure. Similar to Spark, raw data in Spark Streaming is distributed and replicated in memory across a cluster that can be reproduced if data is lost. Likewise , the data stream in Spark Streaming can be recomputed if a node that comprises the cluster fails.

Spark Streaming provides APIs in Scala, Java, Python and R. The APIs can read data from or write results into multiple resources including Flume, HDFS, Kafka and raw TCP stream. What’s more, users can create Resilient Distributed Datasets (RDDs) -- the basic abstraction in Spark -- by normal Spark programming. A combination of the RDDs with data from the multiple resources works as the input for Spark Streaming.



Bank's potential customer workflow

In case of Bank's potential customer scenario -  Bank can use SAP PA to train the model on existing customer data that may reside in hadoop. Thanks to the big data technologies embedded into SAP predictive analytics starting version PA 2.2, models can be trained in seconds or minutes to deal with tens of gigabytes of data. Now to keep ahead of market competition, lets see how bank can make use of Spark streaming along with SAP PA to perform real time scoring- in other words predict 'new potential customers'  in real time.


  1. Step 1 and 2 depicted above are carried in SAP Predictive Analytics. The Automated mode of SAP Predictive Analytics (PA) automates the whole life cycle of data science, from data preparation to analysis result validation. The analytics models that Automated Analytics generate, can be exported in different languages to reproduce the operations made by Data Encoding, Clustering and Classification/Regression. The generated code can then be used outside SAP to apply models on new data. [For more details on applying model see - how to integrate code generated by scorer ]. In this case the model is exported in java class file.
  2. Embed the model in Spark by invoking the KxJModelFactory which is utility method of Java API offered by SAP PA, and then deploy it on Spark using spark-submit.KJModelFactory.jpg
  3. Model score generator is a java project which imports KxJRT library from SAP PA to apply model. It receives target data and requests from users via a given port in spark. (We set port 10999 to receive scoring requests by TCP stream in our POC but the port number can be configured flexibly). Internally, Spark Streaming then divides the input data streams into batches, which are then processed by the Spark engine to generate the final stream of results in batches. [More details: spark streaming programming ]
  4. Apply PA model which will generate scores and output results from spark streaming. (In our POC we used text files for output scores).




Final Notes:

With SAP PA's wide variety of model scoring APIs and simplicity with spark streaming integration; it didn't take huge amount of efforts and time to build this integration scenario. Furthermore it is now possible for business to react to situations immediately and with higher stability in the big data space.

Special thanks to Lei Xu for implementing this PoC and giving us opportunity to do more in this area.
Please let us know what you think these use-cases and if you have any questions related to this blog please reach out to me or Predictive Analytics Big data Product Manager - Richard Mooney.

We have seen in the first part how to create a recommendation model using an SAP HANA information view and asking the user to choose a country. However, more work needs to be done since the business requires us to define a recommendation model for each country. In order to avoid doing the work manually through the user interface, we are going to use a KxShell script. The nice thing about KxShell scripts is that even if you don’t know how to write them, you can ask SAP Predictive Analytics to produce them for you. In the summary screen (the step just before the generation of the model), you can click on: Export KxShell Script.


You need to indicate the folder where to store the script and the name of the kxs file.


The script can be run interactively. Our SAP Predictive Analytics desktop version is installed on a Windows machine. With the cd command we change our working directory to CPP (the name of the folder where the kxshell.exe file is found).

cd C:\Program Files\SAP Predictive Analytics\Desktop 2.3\Automated\EXE\Clients\CPP

We then run the script with country as an argument having the value: Germany.


To obtain the name of the SAP HANA country variable, we looked at the automatically generated script.

# Declaring arguments/variables used in the Events data set (ANV_INVOICES).

# ## Hana view Variable 1 : VAR_COUNTRY (Country)

default EVENTS_HVIEW_VAR_1_ANSW_1_VAL1 "France"

default EVENTS_HVIEW_VAR_1_ANSW_1_OP "="

In fact this script must be modified a little bit before it can be used. We replaced the strings $STORE_USER and $STORE_PWD with the SAP HANA user and password.

#Declaring the 'EVENTS' store and space

default EVENTS_STORE_TYPE "Kxen.ODBCStore"



default EVENTS_STORE_PWD "drowssap_elbakaerbnu"

Perhaps you don’t want to leave the password in the KxShell script, in which case you need to use this syntax:


The password must then appear as an argument in the command.

kxshell C:\ADV_ANALYTICS\RECO_FOR_A_COUNTRY.kxs -DEVENTS_HVIEW_VAR_1_ANSW_1_VAL1="Germany" -DSTORE_PWD="drowssap_elbakaerbnu"

Another section that we modified has to do with saving the model.

default MODEL_SAVE_STORE_TYPE "Kxen.FileStore"






Note that we made the parameter related to the country variable part of the model name.

So far we’ve run our KxShell script in interactive mode, which is useful for adjusting and testing the script. Now that it’s working properly, we will switch to batch mode. We prepared a bat file with two lines in it.

"C:\Program Files\SAP Predictive Analytics\Desktop 2.3\Automated\EXE\Clients\CPP\kxshell.exe" C:\ADV_ANALYTICS\RECO_FOR_A_COUNTRY.kxs -DEVENTS_HVIEW_VAR_1_ANSW_1_VAL1="Spain" > C:\ADV_ANALYTICS\log_create_reco_spain.txt

"C:\Program Files\SAP Predictive Analytics\Desktop 2.3\Automated\EXE\Clients\CPP\kxshell.exe" C:\ADV_ANALYTICS\RECO_FOR_A_COUNTRY.kxs -DEVENTS_HVIEW_VAR_1_ANSW_1_VAL1="UK" > C:\ADV_ANALYTICS\log_create_reco_uk.txt

We redirect the output of the command to a text file in order to keep a detailed trace of the recommendation model generation.

After running the batch file, we can see the models from the desktop application.



The models created with a script are ready to be used for recommending products to customers.

Note that using a KxShell script for automating the creation of multiple models is not specific to recommendation. It is applicable to other data mining functions like classification, regression or times series forecasting. 


All of us, at one time or another, have rented or bought an item based on a suggestion made by a friend. This friend, who influenced your purchase decision, shared your taste, your interests, or thought he knew you enough. Most likely he himself saw that movie, read that book or used that device he talked to you about. Nowadays, when renting or shopping on-line, we get suggestions from recommender systems. These collaborative filtering systems have access to much more data than our friends do. Their algorithms, the ones that are able to tell us "people who bought/rented item X also bought/rented item Y", can access information on the entire population of users (not just our friends) and look at the complete catalog of items (not just the few items known by our friends).

A data set suitable for use by a recommendation engine must be made up of a minimum of two fields: user and item. Usually the data set also includes a timestamp and a rating or quantity. The data set for this example uses an SAP HANA view on sales invoices.


This analytic view includes three variables. One variable requests that the user choose a country (mandatory). The two other variables let us filter invoices on flags that tell when the purchase occurred, either during the weekend or at night.

Let’s use this sales invoice view to create a recommender with SAP Predictive Analytics version 2.3.


First, we connect to the database through the SAP HANA ODBC driver and choose a view. The view can come from an SAP BW system or any custom application built on SAP HANA.


We ask the application to “Analyze” that view and, since it includes variables, a window pops up with prompts.


After choosing a country, we specify the roles required for a recommendation.


We go ahead and generate the model that will be specific to the country we have chosen.


We can now display the recommended products for a given French customer.


At this stage we have created a recommendation model based on purchases made in French stores. We will see in part 2 how we can automatically generate one recommendation model per country.


Hi Guys


This blog basically addresses some of the basic challenges and issues which many of us are facing when we do R integration with SAP PA. Yes SAP Predictive Analytics does provides us this smart feature of R integration which is a very well-known statistical based language. But sometimes some challenges which I am discussing below might come while doing this integration.


Problem 1: You have installed SAP PA but when you go to install R, sometimes an error message may come which won’t allow you to install and configure R internally.




You have to install R manually and configure it with SAP PA. Below are the steps you can follow:



If you are on a 64-bit Windows 7, make sure that you have a 64bit JRE and if you are on a 32-bit Windows 7, ensure that you have 32-bit JRE.


To check the version of JRE, run “java –version” on the command prompt.




Download & Install R from the following location




After installing R, you need to install the R packages used by SAP Predictive Analysis 2.2 .

Below are the list of packages that are required.

  1. rJava
  2. RODBC
  3. RJDBC
  4. DBI
  5. monmlp
  6. AMORE
  7. XML
  8. pmml
  9. arules
  10. caret
  11. reshape
  12. plyr
  13. foreach
  14. iterators

These packages either you can download the zip files from- https://cran.r-project.org/bin/windows/base/old/3.1.2/ and install it. Or the second way is install it directly from the internet via R GUI Console . In the below snapshot you can see the two highlighted box, one is for installing packages from local zip files and the other one saying Install packages is for downloading packages from internet.



To test if the required packages are installed correctly, Go to

  • Start->Programs->R->R3.1.2
  • In the command prompt, type library (rJava) .Press enter. It is successfully installed if you don’t see any error

Eg : > library(rJava)

  • Try this for all the packages for confirmation



Go to SAP PA Expert mode Install and Configure R option. As shown below there will be a predefined default path for R installation folder, change it to the path where you have installed R.Then go back to Installation tab and click again on the install R option. Restart the tool, it should work now.





Problem2: Every time when you install some new package in R for making a new custom R algorithm, you have to make that new package installation reflect in your SAP PA Expert mode too otherwise your algorithm won’t work in SAP PA environment. While doing so you need to do the same step as step5 above. But some of the folks may get the below attached error :




Go to configuration tab and change that default path to C:\Users\[Username]\.sappa and then click Install. Restart the tool, give it a try.




Problem3: One more common error is REngine Initialization Failed. This mostly occcurs when you run a model which is based on R algorithm not with SAP PA Expert Mode out of the box algorithms.


Solution:  In the Configuration tab change the default path to the path where you have installed your R gui and then go to installation tab and click on Install R option. Restart the tool after this and rerun the model.



Above are the few of the challenges I faced, if anyone has any feedback or anything to be added which can be useful to others, you are more than welcome to comment.

Hope this blog has been helpful.




Visualizing a PAL decision tree using d3



In this tutorial we will visualize a Hana PAL decision tree using d3.js.

For most visualization purposes, it is most convenient to use SAP UI5 and SAP Lumira. At the moment however, these solutions do not offer a possibility to visualize a decision tree which was determined by one of the decision tree algorithms in SAP Hana. In this tutorial, we will create a visualization of such a tree using d3.js.



  1. Parse a tree model created by Hana PAL
  2. Create a visualization using d3.js
  3. Add functionality to the visualization





In order to visualize a decision tree that has been created by Hana PAL, the output of the algorithm has to be set to PMML. Information on how this can be done and a decision tree sample can be found in the Hana PAL guide.


What is PMML?

PMML is short for predictive model markup language, an xml standard which contains information about a predictive model. In our case, this is a decision tree. The definitions of the standard can be found on the website of the data mining group at dmg.org.


The reason why we use PMML is that it can be created by all of the decision tree algorithms available in the Hana PAL library (C4.5, CART, CHAID). On the other hand, since PMML is an XML format, the browser DOM-parser can be used to parse the tree.



Parse a tree model created by Hana PAL



At first we need to make the PMML document accessible through JavaScript.

We create a new XSJS Project in our repository and call it “vis_tree”.

(For information on how to set up a XSJS project, refer to:  Creating an SAP HANA XS Application)

In this project we create a new XSJS file and call it “getPmml.xsjs”. Edit that file and put the following lines of code in it (Where <SCHEMA NAME> has to be changed to your schema name and <PMML MODEL TABLE> to the name of your PMML output table):




// open a connection to the database and select the PMML model column from the table

       var conn = $.db.getConnection();

       var pstmt = conn.prepareStatement("SELECT  PMML_MODEL FROM \"<SCHEMA


       var rs = pstmt.executeQuery();


       var data ="";


// iterate through the rows, dependent on the settings, the pmml is stored in multiple rows, however this is not the case in our example 


while (rs.next()) {

              data += rs.getString(1);




// set the response

       $.response.contentType = "text/xml";






} catch (e) {


       $.response.status = $.net.http.INTERNAL_SERVER_ERROR;




If you now point your Browser to “http://<SERVER IP>:8000/vis_tree/getPmml.xsjs”

You should see an output as the following:




As you can see, the decision tree is stored in the <TreeModel> node. We want to parse this tree into a JSON object, which satisfies the condition that children of a tree are stored in a list called “children”. This JSON tree structure, together with a “name” in each node, is commonly used for d3.js and referred to as flare.json. We will utilize d3.layout and thus need a flare like structure, though the “name” string is not needed, thus we will not use a name string, but put all of the information in each node of the tree.


In our project we create a new file and call it “pmmlTree2Flare.js”. At first we need to pull the XML file from the xsjs script, then we will parse the recursively in a JSON object.



function getFlare() {


       var flareNode = {};


       var xmlHttp = new XMLHttpRequest();

       xmlHttp.open("GET", "./getPmml.xsjs", false);


       var pmml = xmlHttp.responseText;

       var xmlDoc;


       if (window.DOMParser) {

              var parser = new DOMParser();

              xmlDoc = parser.parseFromString(pmml, "text/xml");

       } else // code for IE


              xmlDoc = new ActiveXObject("Microsoft.XMLDOM");

              xmlDoc.async = false;




       var flare = pmml2Flare(xmlDoc.getElementsByTagName("TreeModel")[0].getElementsByTagName("Node")[0],



       return flare;



function pmml2Flare(pmmlNode, flareNode) {


       // Fill the node with data

       flareNode["name"] = pmmlNode.getAttribute("id");

       flareNode["scoreDistribution"] = getScoreDistribution(pmmlNode);

       flareNode["score"] = pmmlNode.getAttribute("score");

       flareNode["recordCount"] = pmmlNode.getAttribute("recordCount");

       flareNode["predicateList"] = getPredicates(pmmlNode);


       if (pmmlNode.getElementsByTagName("Node").length === 0) {

              return flareNode;


       flareNode["children"] = [];


       // Create an array of all nodes who are top level children of the active

       // node

       for (var i = 0; i < pmmlNode.getElementsByTagName("Node").length; i++) {


              if (pmmlNode.getElementsByTagName("Node")[i].parentNode === pmmlNode) {

                     var node = {};


                     pmml2Flare(pmmlNode.getElementsByTagName("Node")[i], node);





       // If there are no children this means it is an endnode, we delete the

       // children array and attach the score.

       return flareNode;



function getScoreDistribution(node) {


       var scoreList = [];

       var scoreDistribution = node.getElementsByTagName("ScoreDistribution");

       for (var i = 0; i < scoreDistribution.length; i++) {


              if (scoreDistribution[i].parentNode === node) {



                           value: scoreDistribution[i].getAttribute("value"),

                           recordCount: scoreDistribution[i].getAttribute("recordCount"),

                           confidence: scoreDistribution[i].getAttribute("confidence")




       return scoreList;




//if the predicate is compound, we have to figure out the simple predicates in

//the compound


function getPredicates(node) {

       var predicateList = {

              predicates: []


       var compound = node.getElementsByTagName("CompoundPredicate")[0];


       if (!compound || compound.parentNode !== node) {


              if (node.getElementsByTagName("SimplePredicate").length === 0) {





       } else {

              for (var j = 0; j < compound.getElementsByTagName("SimplePredicate").length; j++) {




                     predicateList.operator = compound.getAttribute("booleanOperator");





       return predicateList;



function predicate2Json(simplePredicate) {


       var predicate = {};

       predicate.field = simplePredicate.getAttribute("field");

       predicate.operator = simplePredicate.getAttribute("operator");

       predicate.value = simplePredicate.getAttribute("value");

       return predicate;





Now we have a simple parser to pull out the XML data into a simple flare.json object.

What is left is to use d3 to create a simple tree in html.

Create a visualization using d3.js


What is d3?

D3.js is a javascript library, which makes it easy to correlate data and XML. This can be used to manipulate SVG images dynamically according to a dataset. The basic idea is to select (or create) XML nodes and then bind data to them, which is simply done by calling selectAll(<Node_Id>).data(<dataset>).

Afterwards, each of the XML nodes can be manipulated according to the data.


What is d3.layout?

D3.layout is a part of d3, which makes it easy to calculate common graphic structures.

In our example, we want to create a tree, thus we would have to calculate the position of each node according to the depth of the tree and the maximum number of nodes on one level. D3.layouts does all that work and returns x and y values for each node, as well as the depth and parent and child nodes.





In order to create a visualization we create a new file in our repository named “createSvg.js”.

We will use a <div> pane in our html which has the id “viz”. To keep things simple, we first create the most basic tree:



function createSvg(treeData) {


       // Create a svg canvas and shift it away from the left border, such that the

       // first circle is still visible

       var vis = d3.select("#viz").append("svg:svg").attr("width", width).attr(

                     "height", height).append("svg:g").attr("transform",

                     "translate(10, 0)");


       // Create a tree layout using d3.layout

       var tree = d3.layout.tree().size([ width - 300, height - 200 ]);


       var diagonal = d3.svg.diagonal()

       // in order to have a left to right tree, we have to change x and y axis

       .projection(function(d) {

              return [ d.y, d.x ];



       // call tree.nodes in order to compute the x and y position of each subnode

       // in the tree

       var nodes = tree.nodes(treeData);

       // call tree.link to create an array of the links between the nodes

       var links = tree.links(nodes);


       // draw a link for each of the links

       var link = vis.selectAll("link").data(links).enter().append("svg:path")

                     .attr("class", "link").attr("d", diagonal)


       // return the computed position of each node, relative to the drawing pane

       var node = vis.selectAll("node").data(nodes).enter().append("svg:g")

                     .attr("transform", function(d) {

                           return "translate(" + d.y + "," + d.x + ")";



       // draw a circle for every node

       node.append("svg:circle").attr("r", 3.5);


       // attach a simple text to each node, which states the splitting condition

       // of the node, and the score for an endnode



                     .attr("dx", 8)

                     .attr("dy", 3)

                     .attr("text-anchor", function(d) {

                           return "start";



                                  function(d) {

                                         var name = "";

                                         if (d.predicateList) {

                                                for (var j = 0;

j < d.predicateList.predicates.length; j++) {

                                                       name +=


                                                                     + " "                                                               + d.predicateList.predicates[j].operator

                                                                     + " "                                                               + d.predicateList.predicates[j].value;

                                                       if (d.predicateList.operator) {

                                                              name += " " +


                                                                     + " ";





                                         if (!d.children) {

                                                name += " => " + d.score;



                                         return name;







Now let’s put things together: create a new file in the repository called “simpleTree.html” and put the following into it:






<meta http-equiv="X-UA-Compatible" content="IE=edge">

<style type="text/css">

  1. path.link {

       fill: none;

       stroke: #F0AB00;

       stroke-width: 1.5px;


text {

       font-size: 11px;



<script src="./.lib/d3.js" charset="utf-8"></script>

<script src="./.lib/d3.layout.js" charset="utf-8"></script>

<script src="pmmlTree2Flare.js"></script>

<script src="createSvg.js"></script>    



       <div id="viz"></div>

       <script type="text/javascript">








Point your browser to “http://<YOUR SERVER>/vis_tree/simpleTree.html”

You should see the following:







As you can see our basic tree looks rather ugly, and is inconvenient for bigger tree structures. Nevertheless,  parsing the tree structure in a flare.json object allows us to reuse any given example of a tree which is written in d3. Though remember that we have not set the “name” string in our flare.json.

In the next step, we use an existing code for a collapsible tree and add a mouseover text which contains all of the information we have.




Add functionality to the visualization



Our goal is to add two features:


  1. Make the tree collapsible, so that bigger tree structures are convenient to view
  2. Make all the information visible in the tree structure


In the end our tree structure will look as follows, blue nodes indicate the existence of children, a click on a blue node will expand the top level children. The blue box is visible when hovering over a node, it shows every bit of information associated with that node.

The tags on each node show the most possible outcome value on the branch of the tree.




For a collapsible tree, a good tutorial can be found at




In order to create a mouseover textbox, we create a <div> object containing all the information and append it in each node. The following is the complete source, which can be tested by copying and pasting it in createSvg.js.


Collapsible Tree with textboxes:



function complexTree(flare) {


       var m = [20, 120, 20, 120],

              w = 1280 - m[1] - m[3],

              h = 800 - m[0] - m[2],

              i = 0,



       var tree = d3.layout.tree().size([h, w]);


       var diagonal = d3.svg.diagonal().projection(function(d) {

              return [d.y, d.x];



       var vis = d3.select("#tree").append("svg:svg").attr("width",

                     w + m[1] + m[3]).attr("height", h + m[0] + m[2]).append("svg:g")

              .attr("transform", "translate(" + m[3] + "," + m[0] + ")");


       // This function produces a div text box, which we will append to each node

       function tooltip(d) {


              var textdiv = d3.select("body").append("div").style("position",

                           "absolute").style("z-index", "10").style("opacity", 0).style(

                           "background-color", "rgb(176, 196, 222)").attr("id", "toolbox")

                     .text("Score: " + d.score);



              textdiv.append('tspan').text("Record Count : " + d.recordCount).append(



              if (d.predicateList.operator) {


                           "Predicate Logic : " + d.predicateList.operator).append(



              for (var i = 0; i < d.predicateList.predicates.length; i++) {

                     textdiv.append('tspan').text(function() {

                                  var text = d.predicateList.predicates[i].field + " " + d.predicateList.predicates[i].operator + " ";

                                  if (d.predicateList.predicates[i].value) {

                                         text += d.predicateList.predicates[i].value;


                                  return text;






              for (var i = 0; i < d.scoreDistribution.length; i++) {


                           "Score for " + d.scoreDistribution[i].value + ": records " + d.scoreDistribution[i].recordCount + ", confidence: " + d.scoreDistribution[



              return textdiv;



       root = flare;

       root.x0 = h / 2;

       root.y0 = 0;


       function toggleAll(d) {

              if (d.children) {






       // Initialize the display to show a few nodes.





       function update(source) {

              var duration = d3.event && d3.event.altKey ? 5000 : 500;


              // Compute the new tree layout.

              var nodes = tree.nodes(root).reverse();


              // Normalize for fixed-depth

              // -----------------------------------------------------------------

              nodes.forEach(function(d) {

                     d.y = d.depth * 180;



              // Update the nodes…

              var node = vis.selectAll("g.node").data(nodes, function(d) {

                     return d.id || (d.id = ++i);



              // Enter any new nodes at the parent's previous position.

              var nodeEnter = node.enter().append("svg:g").attr("class", "node")

                     .attr("transform", function(d) {

                           return "translate(" + source.y0 + "," + source.x0 + ")";

                     }).on("click", function(d) {



                     }).on("mouseover", function(d) {

                           if (!d.tooltip) {

                                  d.tooltip = tooltip(d);


                           d.tooltip.style("visibility", "visible");

                           return d.tooltip.transition().style("opacity", 0.9);



                           function(d) {

                                  return d.tooltip.style("top", (event.pageY - 10) + "px").style("left", (event.pageX + 10) + "px");

                           }).on("mouseout", function(d) {

                           d.tooltip.transition().style("opacity", 0).duration(1000);

                           return d.tooltip.style("visibility", "hidden");



              nodeEnter.append("svg:rect").attr("height", 0).attr("width", 0).attr(

                     "transform", function(d) {

                           var length;

                           d.score ? length = d.score.length * 5 + 20 : length = 30;

                           return "translate(-" + length / 2 + ",-20)";

                     }).style("fill", function(d) {

                     return d._children ? "lightsteelblue" : "#fff";



              nodeEnter.append("svg:text").attr("text-anchor", "middle").text(

                     function(d) {

                           return d.score;

                     }).style("fill-opacity", 0);


              // Transition nodes to their new position.

              var nodeUpdate = node.transition().duration(duration).attr("transform",

                     function(d) {

                           return "translate(" + d.y + "," + d.x + ")";



              nodeUpdate.select("rect").attr("height", 30)

                     .transition().duration(duration / 4).attr("width", function(d) {

                           return d.score ? d.score.length * 5 + 20 : 30;

                     }).style("fill", function(d) {

                           return d._children ? "lightsteelblue" : "#fff";



              nodeUpdate.select("text").transition().duration(duration / 2).style("fill-opacity", 1);


              // Transition exiting nodes to the parent's new position.

              var nodeExit = node.exit();



                     .transition().duration(duration / 2).attr(

                           "width", 0)

                     .transition().duration(duration / 2).attr(

                           "height", 0);



                     function(d) {

                           return "translate(" + source.y + "," + source.x + ")";



              nodeExit.select("text").style("fill-opacity", 0);


              // Update the links…

              var link = vis.selectAll("path.link").data(tree.links(nodes),

                     function(d) {

                           return d.target.id;



              // Enter any new links at the parent's previous position.

              link.enter().insert("svg:path", "g").attr("class", "link").attr("d",

                     function(d) {

                           var o = {

                                  x: source.x0,

                                  y: source.y0


                           return diagonal({

                                  source: o,

                                  target: o


                     }).transition().duration(duration).attr("d", diagonal);


              // Transition links to their new position.

              link.transition().duration(duration).attr("d", diagonal);


              // Transition exiting nodes to the parent's new position.

              link.exit().transition().duration(duration).attr("d", function(d) {

                     var o = {

                           x: source.x,

                           y: source.y


                     return diagonal({

                           source: o,

                           target: o




              // Stash the old positions for transition.

              nodes.forEach(function(d) {

                     d.x0 = d.x;

                     d.y0 = d.y;




       // If the children are currently visible, we move them to a

       function toggle(d) {

              if (d.children) {

                     d._children = d.children;

                     d.children = null;

              } else {

                     d.children = d._children;

                     d._children = null;





Filter Blog

By author:
By date:
By tag: