1 2 3 12 Previous Next

SAP Predictive Analytics

177 Posts

Dear community,

I recently installed PA 2.2 for testing purposes of the "Time series analysis".


This blog describes my steps to the final result plus I have some questions, since the final outcome seems very poor to me.


After having watched this video: http://scn.sap.com/docs/DOC-62239


I prepared a list of ~130 German companies with ~20 stock market key figures for the last 115 consecutive weeks.

The import file contained weekly data from CW 36/2013 to CW 46/2015, and my expected outcome was the share value by CW 5/2016.


Fortunately I had good help of a working student, who developed the "structure" file for me.

It seems to work but we are not sure if it is the best setup (further information is appreciated).


The blog is focusing on one example company "SAP", for which a trend line was generated.

Question 1: Why do some results show trend lines and others don't?


In my variables I used only "total value" key figures and avoided to mix them with percentage key figures.


I chose 12 future weeks to predict:


Warning message shown:


Obviously it can "predict" only 4 weeks?

Question 2: What does this warning mean? I found some warnings with 2 or 3, this with 4 as maximum horizon.


UPDATE: I forgot to include the following screenshot:


However, I continued and this is the result ... quite ...hm...  strange ... or ridiculous :-D

SAP Forecast.png

The table shows the whole "catastrophe"... almost only 40% variance between minimum and maximum.


...This and several other result seems to dice for finding the forecast.

Another highlight, Lufthansa: up & down and up& down:

LH_Wuerfeln.png FC_vs_Signal_LH.png


Finally I have some more questions and would love to learn more about the tool and "Time series analysis":

3) How can the structure file be optimized? Is there a how-to or SCN document/blog available?


4) Is there a way to analyze more than only one company at a time?

I would like to load the whole DAX (German main index) and use the same ~20 key figures of all companies for finding the results per company.

Since all shares have the "same attention" (like when in DAX or SDAX or MDAX) I would like to use additional "trends" within the market for analysis.

Is there somehow a "learning effect" I can initiate in the tool by using different data with same variables?

5) Is there a way to use no 4) "more companies at once" and getting only the trend lines per company as result ... not the single predicted values?

6) How do I find out which of the 20 key figures I should keep or change for better results?

7) How does PA deal with mixed input of "total values" and "percentage key figures"?

8) How can I tell PA which relationships exist between key figures, e.g. those which are an outcome of a formula using the weekly share value.

9) I checked the logs and found statements like:

"The automatic variable selection process discarded all the extra-predictable variables when estimating the trend(<list-of-variables>)" or

"The trend model (Regression<list-of-variables>...has been discarded from the competition." What does this mean?

Are all my 20 key figures in the file neglected and the forecast is based only on the historic share values? What could be the reason?

Thanks for reading... and any feedback is appreciated :-)

Best regards,


The next major release of SAP Predictive Analytics (3.x product line) is currently planned for release in the first semester of 2016.

SAP plans to deprecate Windows 32-bit operating system support in this release, in order to speed up product innovation deliveries on other operating systems, including Windows 64-bit operating systems.

The last SAP Predictive Analytics version that will provide Windows 32-bit operating system support is a minor release (part of the 2.x product line) that is currently planned for release during the first quarter of the year 2016. 

For customers who wish to continue installing and using SAP Predictive Analytics on 32-bit operating systems, critical fixes for the SAP Predictive Analytics 2.x product line will be available until the 10th of February 2017.

For more information about this communication, feel free to contact me directly via email (see my SCN profile).


Legends of the Fall

Posted by Antoine CHABERT Nov 9, 2015

The Rugby World Cup (RWC) 2015 is over! I can't wait until 2019!

This has been a fantastic edition, fully packed with emotion.

I enjoyed the Brave Blossoms resilience, I was inspired by the fighting spirit of “Los Pumas”, I was delighted by the wonderful moves of the All Blacks “golden generation” and I was sad about the wrecking of “Les Bleus”.

In part 1 and part 2 of these blog series, I used SAP Predictive Analytics to create my predictive model based on historical data and tested some scenarios.

I now apply the predictive model to determine the players that will make it to my hall of fame due to their overall RWC performance. The focus is not really on those new talents that emerged across this particular edition, as my data is summing up the performances across the different editions of the RWC.


I reload the model I had created and saved, then I click on Run and Apply Model.

Load a Model.png

Apply Model.png


In the Applying a model screen:

  • The Application Data Set is the data set on which I will apply my model and determine which player should be considered a legend or not. Mine is named RWC 2015 Player List, it contains the figures across the different editions for the players that participated to RWC 2015.
  • The Generation Options determines the output that is generated from the model. In this case I am selecting the Probability & Error Bars option. If the player is given a probability superior to 0,5 in the resulting file, it should be considered a legend. There are more generation options possible, I find probabilities quite easy to interpret.
  • The Results Generated by the Model is the place where I output the results. Here I am generating the results into an Excel file.
  • I click on Apply so that the file gets generated.

Applying the Model.png

I open the Excel file and look into the column D, it is corresponding to the probability of each player being considered (by me) a legend. A probability is a figure between 0 and 1. 0 means that the player is very probably not a legend (for me!), 1 means that the player is very probably a legend. 

My New Rugby Legends.png


I loaded the Excel file into SAP Lumira, and selected the players with a probability of more than 0,5:

  • The list contains 24 players in total.
  • Most of the players originate from South Hemisphere teams and the All Backs are well represented!


Frederic Michalak is represented in a high position (#26) in the overall list. He is falling a bit short to become one of my legendary players due to a probability equal to 0,42. OK, I’ll give him a bonus because he is a French guy ;-).


Jonathan Sexton or Sergio Parisse are not yet legends for me.



Now I’ll remove the players that I was already considering legends in the RWC history.

My list previously included Dan Carter, Bryan Habana, Richie McCaw, Fourie du Preez, Drew Mitchell, Kieran Read and Victor Matfield. All these legendary players shined this year and through their RWC career!

Old Legends.PNG

My final shortlist does include 17 players, from 5 different countries.

Player by Country.PNG

10 All-Blacks:

3 Australian players:

2 South-African players:

1 Argentinian player:

1 Irish player:


I do agree with most predictions:


As we have seen, it’s very easy to apply a predictive model to generate results on new data samples. 


I hope you enjoyed my blogs and the RWC 2015!

What are your personal RWC legends? 


You can follow me on Twitter: @ChabertAntoine

Fellow SCN Predictive Enthusiasts,


The attempt of this post is to make you familiar with the process of building predictive models using APL (Automated predictive libraries) through an example.


For those who have not heard of APL yet - SAP APL is a native C++ implementation of the automated predictive capabilities of SAP Predictive Analytics running directly in SAP HANA. The key differentiator for the SAP APL over other predictive components within SAP HANA is the “A” for “automated”.  Using APL you can run real time automated predictive algorithms on your data stored in SAP HANA without requiring a data extraction process.


Another advantage of APL based model is, it simply needs to be set up and be instructed what type of data mining function needs to be applied. APL then takes over from there by composing its own models, creating and selectively eliminating metadata as required, and ultimately come up with the most optimal model given the data we provided – in a mostly automated way.


I have put together a document which shows a step by step example, how an insurance company can analyze past insurance fraud data in order to create a predictive model in SAP HANA using the Automated Predictive Libraries (APL) to identify potential future fraudulent auto insurance claims.


You can also see this example in action in this recorded webinar in it, we cover an overview of the predictive analytics in SAP HANA and a live demonstration of SAP Predictive Analytics and the APL in action.




D -7 before SAP TechEd 2015! Held in Barcelona from November 9-12, this is a great opportunity for you to catch up on our latest predictive innovations, learn new skills by taking a workshop, see what other SAP customers are doing in this domain, or simply network with your peers. To help you get the most, I put together a list of sessions, workshops, and activities you should put on your agenda now.

My Top 5 List of Predictive Sessions


Tuesday, November 10


BA160 Use SAP Predictive Analytics with SAP Business Warehouse on SAP HANA

14:30-18:30 - Hands-On Workshop

Learn how you can use SAP Predictive Analytics software in combination with your data from SAP Business Warehouse powered by SAP HANA.


Wednesday, November 11


BA111 Become a Data-Driven Business: Exploratory and Prescriptive Analytics

11:15-12:15 - Lecture

The automation of predictive analytics is the key basis of two new categories of analytics: Exploratory analytics (for showing executives what is really driving business) and prescriptive analytics (to improve operations).


BA806 Road Map Q&A: SAP Predictive Analytics

16:00-17:00 PM - Roadmap session

Join us for an exclusive introduction and Q&A to our SAP Predictive Analytics strategy and road map.


BA272 Automated Predictive Analytics Integration and Scripting
16:45-18:45 - Hands-On Workshop

Discover the integration and scripting capabilities offered by the automatics module from SAP Predictive Analytics software.


Thursday, November 12


BA112 Predictive Maintenance and Service: Practical Internet of Things Experience

16:45-17:45 - Lecture

The SAP Predictive Maintenance and Service solution is in the domain where customers merge large amounts of machine sensor and failure event data with structured ERP data. Learn what more than a dozen SAP customers have done in this domain.


Don't Miss...

SS34 Predictive Demo on the Showfloor

Tuesday, Wednesday and Thursday

A chance to see live demos and chat 1:1 with members of the predictive team.

DG107 Developer's Garage

Bring your laptop and asks questions to our Predictive Expert, Adbel Dadouche. Abdel will also showcase how you can build an application on SAP HANA Cloud Platform and use cool predictive services.


Looking for more?

We have over 24 lectures, hands-on workshops, sessions and demos showcasing predictive.

View all the TechEd predictive sessions here.


The SAP Predictive Analytics team look forward to seeing you in Barcelona!

This post attempts at helping you diagnose and fix problem with executing APL components in Expert Analytics.


Symptom: Executing an analysis on HANA with Model Compare, Model Statistics and/or components prefixed with HANA Auto fails with errors relating to PROCEDURE_SIGNATURE_T.



Step 1: Check the user you log in which has been configured to use APL.


Go to HANA server instance in HANA studio and try to find the table PROCEDURE_SIGNATURE_T.


In the snapshot above, the user TEST is the only user configured to use APL and any other user trying to use APL will get the table related error. If the search doesn't return the table, it indicates the configuration is incomplete and follow the Step 2 below to fix it.


Step 2: Configure APL for every user that will use it from Expert Analytics


The HANA APL user guide explains the steps to install and configure APL (http://help.sap.com/pa) under section 2.1.5. There is a set of SQL scripts that need to be run for every user that needs access to APL. Here is a snippet:


connect USER_APL password Password1;



create type PROCEDURE_SIGNATURE_T as table (




The USER_APL should be replaced by individual user ids that will use APL and entire script should be re-executed.


Once done, login with that user to HANA from Expert Analytics and all APL components including Model Compare and Statistics should run fine.

I recently completed a full iron-distance triathlon (2.4 mile swim, 112 mile bike, 26.2 mile run) after 9 months of training.  There are similarities between my experience as a triathlete and as a Big Data Analytics team member at SAP.  Elements of success are the same in both venues.


(1)    The first similarity is acquisition of multiple skill sets. While triathlons involve swimming, biking and running, the Big Data Analytics team must have skills in machine learning/statistics, coding, business and communication. Having all skill sets leads to the best chances of success.


But what is success? In triathlons, there is a well-defined finish line clearly marked as such but, in Big Data Analytics it is more obscure. I believe that crossing the finish line is not when models are built or even when results are placed into business reports.  In Big Data Analytics we cross the finish line when predictive results are baked into everyday business decisions (total business integration).  This requires we communicate and train business experts and leaders as to why predictive is important and even how to use predictive results in decision making.

This is hard work. Business integration can take longer than building predictive models and can encounter obstacles along the way.  In business we face aversion to change and ego while in triathlons we face flat tires and wind.  But only by overcoming the obstacles can we attain full success.

(2) The second similarity is that Big Data Analytics success requires dedication and determination like iron-distance triathlons. This is not a sprint race! It took me 9 months to prepare for 140.6 mile race. You can expect business integration to take months of hard effort.  

(3) Success also takes the right mix of ingredients like triathlon nutrition.  A few of the right ingredients are:


      • Access to the right data
      • Personnel devoted to business integration
      • Curriculum that teaches business stakeholders the value of predictive results
      • Communication that describes what specific predictions are made… without techno-speak
      • Descriptions of how predictive fits into the business story


So what is the current status of predictive success in organizations?

I propose that many organizations fall far short of success and struggle to solve the problem of how to infuse predictive results into decision making processes. 

(4) At SAP, the Big Data Analytics team uses “Adoption Leads” to drive successful integration.  Adoption Leads are “coaches”. Their role is to communicate and train business leaders about predictive and work with them to achieve optimal results.  They are key to success of the Big Data Analytics program like coaches are to a triathlete. 


Adoption leads enter the process at the very start like coaches.  They facilitate project objectives and work to build relationships with stakeholders. Regular meetings help shape strong, trusting relationships that make it more likely model results will be used for decision-making. This is at the heart of the SAP program.

So whether you are a triathlete or a predictive analytics expert … multiple skill sets, dedication and determination, the right mix of ingredients and coaching will get you across the finish line and guarantee your success. 

This is part 2 of the Person of Interest series – you can find part one here: The Predictive Science Behind TV’s “Person of Interest”




"You are being watched

The government has a secret system - a “machine”.

It spies on you every hour of every day

I know, because I built it.


I designed the machine to detect acts of terror

But it sees everything

Violet crimes involving ordinary people

People like you.


Crimes the government considered irrelevant

They wouldn’t act, so I decided I would.

But I needed a partner

Someone with the skills to intervene.


Hunted by the authorities, we work in secret

You will never find us.

But if your number is up, we will find you."



Eyes and Ears Everywhere


In my first post in this series (The Predictive Science Behind TV’s “Person of Interest”), I revealed that the “machine” has access to an immense amount of data – it can tap into many forms of communication, look up data from thousands of sources, and of course track you everywhere you go. 


poi1.jpgIt looks for schemers, plotters, malicious intent, and suspicious transactions – but it has to analyze everything because individual event data doesn’t say anything about what data is relevant or why.


When he started building the “machine”, Finch had to teach it to collect information from any source without deciding whether the information is useful or not.  This is very different than the world of analytics we live in where we usually do not have the option to collect “everything” due to cost, privacy, or other issues.   Even in Person of Interest, it is unrealistic for even “the machine” to store every byte collected on every one of the city’s 8 million inhabitants.



The Secret To Better Predictive Analytics


personofinterest_daddy.jpgSo how does Finch’s “machine” work its magic without actually storing and then trying to process all of that data on the fly?  Metadata.  Finch taught his machine to perform image and facial recognition, voice-to-text transcription, and textual sentiment analysis to create metadata – data about the data.  

This turns a single image of a person into a name, a birthdate, work history, GPS location, even metadata about other people visible in the same picture.  An intercepted email contains entities, what they think, what they have done, and who they know.   The machine continuously processes everything it collects and automatically derives the additional metadata. 


In many cases, this metadata is all that needs to be kept – for example, if the machine has identified all the people in a photograph, where and when it was taken, and any other extractable metadata such as sentiment, there may be no other reason to keep the original binary image.   The same metadata processing happens for all audio, video, and any other data the machine collects – so while the machine collects terabytes of data per second, it doesn’t need to store the raw binary streams.

POI Techniques In The Real World


poi-pic5.pngHopefully this gives you some ideas – you don’t need to do hyper-speed text analysis to learn more about a customer.   If you are a retail operation, is your customer buying items from men’s and women’s departments or just their own gender’s?   If you are in a services industry, did the customer phone into the call center recently?  What was the nature of their call, and was their sentiment positive or negative?  Maybe you want to run speech-to-text and capture the transcript of every call instead of storing all the audio.  These extra pieces of metadata could be useful in determining buyer behavior or understanding if a customer is happy with you or not.

What metadata should you collect?  Anything and everything you can.  This is what the “machine” does, and what every good data scientist wishes they could do.  Whether you store the data in your SQL-based data warehouse, an SAP HANA system, or your own Hadoop cluster, it doesn’t matter – you can always transfer or blend your data later once you know what you want to do with it.  I’ve met some customers that think this is “overkill” and for some organizations it might be – the problem is you never know until it’s too late.  There is no requirement to put in this extra thought (or processing in some cases), but if it could improve your profits by even 10%, wouldn’t it be worth it?



Metadata about Metadata


But how do you actually use this data to get that boost to your business?  Not so fast – there’s another step that we typically don’t consider in the business intelligence world: derived data sets.   There may be patterns in the data that cannot be detected by analyzing a few rows and a more sophisticated way of looking at the data is needed.


For example, it may be interesting that a perpetrator was at Central Park at 2 pm on Monday, but knowing that she has been there every weekday, but only when another person is at the same location becomes very interesting.  Is the perpetrator stalking a new victim?  How are they related?  Is she shadowing the victim or does she know the victim’s routine?  By the way, how do we even know that we are analyzing a perpetrator and not a victim?

The reason predictive analytics needs on disaggregated (non-summarized) data is that while a single row may not be significant, a combination of events may mean something. The secret lies in creating additional fields in the data based on a higher level understanding of the data that is lost when looking only at summarized records. This is where data science becomes non-obvious and a data scientist earns their wage.


As a trivial example, a data scientist may add seven additional binary fields to every event/transaction to encode the day of the week for easy analysis.  Time of day? There might be another 24 fields.  This can be extended to other types of data as well - It is easier for some algorithms to use a sales order record if each product has its own binary field that is set to “true” if it is included in that order. 

Big Data = Wide Data

This creates a massive explosion in the number of fields to analyze and is a very important concept in predictive analytics:  When we say “Big Data”, we mean very wide data sets and just not long ones. 

If you thought collecting extra metadata was “overkill”, stay tuned for part 3 of this series where we’ll uncover the true secret to why the “machine” can be as accurate as it is.

As it was with part one, feedback drives the posting frequency for this series - rate this article (below) or "like" it above to help get the next installment sooner .

Make sure if you want to be alerted when there are new posts to either subscribe to the Predictive SCN or  follow me directly(receive email notifications).

Some months back I published an executive summary on Considerations when using BW as a date source for predictive modelling and also how BW can connect to SAP Predictive Analytics Expert mode. Since then we have seen a number of exciting enhancements to both SAP Predictive Analytics and SAP BW on HANA which has changed the way we approach predictive when using BW on HANA. In this blog we will go through some of these new features and how they can be leveraged when using SAP Predictive Analytics with SAP BW on HANA.


Before I delve in to what they are and the benefits they bring, it is worth recapping on some of the key features in SAP HANA. Since its conception in late 2010 SAP HANA's key differentiators have been the ability to leverage in memory technology to increase speed, reduce complexity, data redundancy and enable real time analytics. However it is also worth noting that SAP HANA is more than just a database it is also a platform with numerous capabilities, such as an embedded predictive engine. Housing numerous predictive libraries such as PAL, APL & R to name but a few. Keeping this in mind I will start by outlining 4 of the key changes in SAP Predictive Analytics 2.2 and SAP BW on HANA 7.4 and above.


1. Generation of HANA view from BW Objects:

The first feature is the HANA View Generation functionality introduced in SAP BW 7.40 SP05 on HANA for InfoProviders and since SP08 for BW Queries (BEx Queries and Query in Eclipse). The BW on HANA Guide to HANA View Generation outlines all the prerequisites and steps that are required to generate Calculation, Analytical and Attribute views from BW objects.


2. Consumption of HANA Views

HANA Analytical & Calculation views have been supported by SAP Predictive Analytics Expert mode for some time now and as of PA 2.2 Automated mode is also able to connect to both Analytical & Calculation views. This means that HANA views generated from BW objects can be consumed by both PA modes.

3. Writing to HANA

Not only can SAP Predictive Analytics consume HANA views but both Automated and Expert modes have the ability to write back scores in to to HANA Tables. This brings us a step closer to returning the insight to the Business User or Decision Maker.



4. Export Scoring Equation for real time scoring

Last but at no means least is the ability in SAP Predictive Analytics Expert to operate in HANA online mode. Although this feature has been available in Expert Analytics since it's release with the addition of HANA view generation, BW on HANA can ow operate on online mode. The benefits of this include no limit on data acquisition and speed of model training as data does not leave HANA and processing is delegated to HANA's own predictive engine. Use of predictive libraries installed on HANA database and finally the ability to seamlessly inject scoring equations back in to HANA.




In summary this is just a starting point there are additional features and functions under review to enable smooth interoperability between SAP Predictive Analytics and SAP BW on HANA. Remembering that Advanced Analytics does not end once a predictive model has been created but focusing on getting the insight back to the decision make and ensuring simple management of the life-cycle of the model. For more details please see our how to guide or even better if you are attending TechEd this year join me at the hands on session BA160 to see for yourself.

The updated version of the HANA cloud image for SAP Predictive Analytics hosted on AWS is now available in the SAP store. Like in the earlier version of the cloud image, customers and partners can run the available scenarios with the sample data sets or evaluate the scenarios with custom data sets. The HANA cloud image hosted on AWS depicts the pre-built HANA models and the predictive models with specific data visualization for a few customer centric scenarios.


Update on RDS:

As of the current writing, the version 5 of SAP Predictive Analytics RDS has been released on Aug 10th 2015. This release has few updates in terms of the predictive models, the data visualization capabilities and customer POCs that were developed. As part of the RDS offering there are about 30 scenarios across 9 domains (5 Industries and 4 LoBs). There was a new POC in Finance domain developed for a Home care customer in UK and another POC in Portfolio & Project management domain developed for a consulting customer firm in Denmark.


Scenarios available in the HANA cloud image on AWS:

The updated cloud image has many more pre-built scenarios made available to run on the hosted AWS environment. There are about 24 scenarios across 8 domains (5 industries and 3 LoBs). These scenarios help the customers and partners to get a complete end-to-end story in the context of HANA.


  • Banking LoB
    • Customer Attrition Analysis
  • Consumer Products Industry
    • Brand Sentiment and Sales Analysis
    • Demand Data Analysis
    • Product fulfillment and Optimization
  • Finance LoB
    • Company Performance Analysis
    • Late-Payment Management
    • Customer Cash Collection Analysis
  • Manufacturing LoB
    • Customer Demand and Inventory Management
    • Overall Equipment Effectiveness
    • Asset Breakdown Analysis
    • Maintenance Cost Analysis
  • Portfolio & Project Management LoB
    • Project Profitability Analysis
  • Retail Industry
    • Market Basket Opportunities
    • Customer Loyalty Programs
    • Store Clustering
  • Sales & Marketing LoB
    • Customer Segmentation
    • Market Segmentation
    • Market Campaign Success
    • Product Recommendation
    • Pipeline and Revenue Forecasting
  • Telco Industry
    • Churn Modeling and Offer recommendation
    • Post-paid Analysis
    • Rotational Churn Detection
    • Multi-SIM Detection



Architecture of the Cloud Image:

The pre-built HANA cloud image is available (in the SAP store) for a trial period of 30 days. During this time, there is no additional cost to access this image as long as you have an AWS account. The cloud image is hosted on AWS and hence you will need to pay a nominal fee towards AWS but the access to HANA cloud image is available for trial and evaluation purposes in the context of customer centric scenarios.


Follow this blog that explains in detail about how to get access to AWS and SAP CAL account. The blog has a link to a video which explains the steps in detail.


In the upcoming weeks, the cloud image shall be updated with some more scenarios and later versions of the SAP predictive tools.


Please click here to access the HANA Enterprise Cloud image for Predictive Analytics on AWS in the SAP store.

Spoiler Alert: This blog series talks about the TV show Person of Interest but does not aim to give away any “spoilers” – however it is impossible to talk about it without giving away its basic premise.  I aim to avoid any “spoilers”, but I apologize if I let one or two through – there should be nothing here that you don’t find out if you start watching right now. 



The Tom Cruise movie “Minority Report” came out in 2002, but provided us a glimpse into the not so distant future: the ability to foresee crimes before they happen and the chance to stop the perpetrator before it is too late.  Whether you haven’t seen the movie before or haven’t seen it recently, I highly recommend getting some popcorn and take a look at a much younger Tom Cruise grapple with a world where his future is foreseen by three “Precogs” that are never wrong in their predictions. 

Back then it was really futuristic, but the question I’m going to answer in this blog series is: “Can predicting future crimes really be science fiction or is it science fact?"

CBS's Person of Interest

In 2010, CBS started the TV show “Person of Interest” that is our modern day “Minority Report”, but with a twist – instead of mythical “Precogs” having visions of the future, a reclusive billionaire named Harold Finch has a “machine” that has premonitions of violent crimes that are about to happen.  Teaming up with John Reese, an ex-special forces soldier, the duo set out to prevent these crimes from ever happening.  

Over the course of season one, we learn some of Finch’s secrets.   We learn that Finch built the machine for the government to detect potential acts of terror after the events of 9/11.   After the attacks, the government gave itself the power to read every email, listen to every phone call, and monitor every video feed – watching with ten thousand eyes and listening with a million ears.   But they needed something that could sort through it all – something that could pick the terrorists out of the population before they could act.  The public of course wanted to be protected, but didn’t want to know how they were being protected, so the government kept the existence of the “machine” secret.

The “machine” worked perhaps too well – it was designed to prevent the next 9/11, but it was identifying all types of potential violent crimes.  The machine was eventually taught to classify these into crimes relevant to national security (ones that would cause massive loss of life) and ones that are considered “irrelevant”.  The problem of course is that an event may be irrelevant to the government, but it was very relevant to the victim of a crime.  

Finch built a backdoor into the system to get access to this irrelevant list – however ensure to the inner workings of the system were protected, it only provided a nine-digit number – the social security number someone involved in the crime – either the perpetrator of the crime, or a victim – but no other details.  

That’s all Finch and Reese have to stop the crime.  Convenient premise for a TV show’s plot isn’t it?

Science Fiction or Science Fact?

You can probably imagine that a lot of the show is based in science – predictive science, the question really is how much of it is “science fact”?  We know that governments around the world have been collecting this type of information question is not whether the collection of so much data is possible, but how to detect meaningful relationships within it.  How to create the thinnest of threads connecting the rarest of events into a prediction that a violent crime is about to happen.  Predictions the “machine” makes are always correct, so this is not a typical predictive analytics case of guessing “who is most likely to be involved in a crime”, or "propensity to commit a crime", but who actually will be involved.

The TV show doesn’t suggest the “machine” is at all sentient – in fact, it depicts the machine as cold, and as calculating as only a predictive algorithm could be, but with a knack for algorithmic self-preservation.  Its predictions also don’t contain any interpretation, judgment, or even recommended actions.  While this provides sufficient mystery for Reese and Finch to fill a 60-minute TV show every week, it’s also how predictive analytics really work – we can understand the key influencers to a prediction, or even see a decision tree for it, but it still requires a human to interpret the results and do something useful with them.

The Predictive Science Facts

Over the course of the next few weeks (or months depending on interest), we're going to explore the predictive science behind "Person of Interest" in a series of blog posts.   While you are very likely not building your own “machine”, I think we’d all like an analytical system that can predict the future – or at least help us make better decisions.  Join me on the journey to pick apart how the machine works and see how this parallels our real business problems.  

The entries in this blog series are going to look at specific aspects of the predictive process and how the show deals with them, so I might not be able to completely avoid spoilers. If you haven’t seen Person of Interest, this is the time to start – at least season one.


Part 2 has been posted! Part 2: Predictive Science Behind POI - Eyes and Ears Everywhere

Your feedback counts! This blog series is planned to be at least six parts, publishing every week or two. If you find this blog series interesting and you want the posting of these accelerated, just rate this article below or hit the “like” button for this article.

Make sure if you want to be alerted when there are new posts to either subscribe to the Predictive SCN or  follow me directly (receive email notifications).

Fellow SCN friends,


SAP Predictive Analytics provides its Automated Analytics module for creating, applying and exporting  high-quality predictive models with a guided and highly-automated approach.


The first time I launched the application, I looked at the screen and started wondering: “Cool! But now what should I do?”.


It took me a very short time to get up to speed thanks to the advice of a colleague and friend (hello Armelle!) but that help was key  to becoming proficient with the tool.


After conversing a bit, Gaetan Saulnier and I thought that it might be a good idea to lay out the same kind of help in a short and comprehensive format for everyone to utilize.


With SAP Press we have just published a new electronic book, coming in at less than 100 pages, that guides the reader through the most often used parts of the Automated Analytics module: data import and variable encoding, classification, regression, clustering, time series analysis and recommendation.


This E-Bite comes with sample data which you can download and use to follow the provided examples step-by-step.


If you are interested you have more details at this link:


Predictive Modeling with Automated Analytics (SAP PRESS) - by SAP PRESS




Only a few days before SAP TechEd 2015 held in Las Vegas from October 19-23. This is a great opportunity for you to catch up on our latest predictive innovations, learn new skills by taking a workshop, see what other SAP customers are doing in this domain, or simply network with your peers. To help you get the most, I put together a list of sessions, workshops, and activities you should check out during during that week.


My Top 5 List of Predictive Sessions

Tuesday, October 20

BA160 Use SAP Predictive Analytics with SAP Business Warehouse on SAP HANA

2:15-6:15 PM  -  Hands-On Workshop

Learn how you can use SAP Predictive Analytics software in combination with your data from SAP Business Warehouse powered by SAP HANA.

BA111 Become a Data-Driven Business: Exploratory and Prescriptive Analytics

3:15-4:15 PM  -  Lecture

The automation of predictive analytics is the key basis of two new categories of analytics: Exploratory analytics (for showing executives what is really driving business) and prescriptive analytics (to improve operations).


Wednesday, October 21

BA272 Automated Predictive Analytics Integration and Scripting
10:30-12:30 PM  -  Hands-On Workshop

Discover the integration and scripting capabilities offered by the automatics module from SAP Predictive Analytics software.


BA806 Road Map Q&A: SAP Predictive Analytics

2:00-3:00 PM  -  Roadmap session

Join us for an exclusive introduction and Q&A to our SAP Predictive Analytics strategy and road map.

Thursday, October 22

BA112 Predictive Maintenance and Service: Practical Internet of Things Experience

3:15-4:15 PM  -  Lecture

The SAP Predictive Maintenance and Service solution is in the domain where customers merge large amounts of machine sensor and failure event data with structured ERP data. Learn what more than a dozen SAP customers have done in this domain.

Don't Miss...
SS43 Predictive Demo on The Showfloor

Tuesday, Wednesday and Thursday

A chance to see live demos and chat 1:1 with members of the predictive team.

DG107 Developer's Garage

Tuesday, Wednesday

Bring your laptop and asks questions to our Predictive Expert, Adbel Dadouche. Abdel will also showcase how you can build an application on SAP HANA Cloud Platform and use cool predictive services.


Finally... 1 Cool Museum to Visit if You Have The Time: The Mob Museum
The Mob Museum provides a world-class, interactive journey through true stories. From the birth of the Mob, to today’s headlines. Shadows and whispers. G-Men and Made Men. Cool!

Looking for more?
We have over 47 lectures, hands-on workshops, sessions and demos showcasing predictive. View all the TechEd predictive sessions here.

Got questions about our predictive analytics presence at SAP TechEd? Contact me at @pileroux.


The SAP Predictive Analytics team look forward to seeing you in Las Vegas!

Any time a (relatively) specialized or obscure topic gets subjected to worldwide hype and before finally becoming part of the mainstream, there is an interesting phenomena that occurs between those “original believers” and the “newcomers”.   I’m old enough to remember a time before the Internet and even Linux when e-mail had to be sent using a Unix command line program called (imaginatively enough) “mail” and could only be sent to people at other universities. 


kidcoding.jpgWhen the Internet became more and more available, some of us “computer geeks” were very proud that we used e-mail long before people even figured out if the word needed a hyphen or not.  By the time residential cable and DSL modems became ubiquitous, it was a badge of honor to be running your own e-mail and Web servers at home because back then, you really had to know what you were doing (for the record, I run both in my own on-premise private Cloud to this day - Why? because I *can*).


Watching data scientists interact with regular business users reminds me of my own evolution from a “computer geek” into a “software engineer” (which was much cooler and definitely paid more).  The tipping point though was when my background as a computer geek no longer qualified me to lord my knowledge over those less enlightened then myself.  It was not that my experience became less valid, it simply became less relevant. 

Technology has progressed far enough that making it easier to use and accessible to everyone no longer sacrifices cost or performance.  In the case of e-mail servers, it’s actually now cheaper to outsource the whole thing in the cloud than to run your own.  That started to make my experience seem really expensive, and in some cases, unnecessary.



Predictive Analytics Has Hit Primetime


Let’s face it, “Predictive Analytics” is the evolution of that much more boring-sounding topic of “Statistics”.  However now that predictive is “cool”, I’m seeing the same phenomena: many of those who previously had a math or statistics background became “data scientists” because they have a deep understanding of what makes predictive tick.  Today, data scientists are in an enviable (and financially lucrative) position: statistics will never get easier and the fundamentals of mathematics are unlikely to change in…. well... ever.


The truth is that the job of a (good) data scientist is not that easy, and it can be infinitely boring.  Significant time is spent cleaning, preparing, and deriving data before it even makes sense to start the predictive modelling process.  Creating the models themselves requires a delicate understanding of the (now augmented) dataset and which algorithms are most applicable to it.  This is a highly iterative process which requires an understanding of statistics to decide how to refine the models for the best possible accuracy.  If this sounds repetitive, you would be right (and if you don’t think this is repetitive, you are likely a data scientist yourself).



Predictive Automation to the Rescue


The good news (for us non-data scientists) is that technology is poised to blow the fortified towers of data science wide open just as easily as it devalued my computer science degree.   The industry’s focus on predictive analytics has shifted from pure performance and efficiency to ease of use and accessibility for the larger business audience.   Encoding self-tuning algorithms into an autonomous “data scientist in a box” application has always been the Holy Grail of predictive analytics, but how realistic is it?


Ironically predictive analytics software can be made smarter specifically because the rules of mathematics cannot change.  Many of the previous attempts have been to create an “uber-data scientist” application that can automatically pick the best algorithm for a given problem.   This approach has some merits: there are a number of algorithms that apply specifically to classification problems so you could simply run all of them on a target dataset and pick the best one (and in fact, this is what some products do).


However a data scientist will tell you that one of the dangers is that this can yield drastically different results between runs.   The algorithm chosen this week may not be chosen as the best one next week and therefore the results of the two weeks are not directly comparable.   Choosing a specific algorithm from week to week may not be the best idea either – it’s possible the data for the first week’s run would yield a different algorithmic decision than the next four weeks.  In the end, you need a data scientist to tell you which of the algorithms you could stick with (defeating the purpose of a “data scientist-in-a-box” approach).



The solution to this is to have this “auto-selection” intelligence built into the algorithm itself rather than have it sit above.  That means the application can pick this “uber-algorithm” which will then have the intelligence to handle any type of data and always make the right modelling choices.   In practice, this super-smart algorithm would do many of the same steps a data scientist would.  It would use statistical analysis to determine the optimum analysis parameters and then iteratively create many candidate models before finally coming down to the winner.  The difference is that a computer can create hundreds or even thousands of models before choosing the most optimal one.


SAP Predictive Analytics has an automated mode where you basically pick the type of problem (classification, association, etc) to solve and the software handles the rest in order to put predictive analytics in the hands of more user (for a more detailed discussion, see How does Automated Analytics do it? The magic behind creating predictive models automatically).



Predictive Analytics Is Not a “Zero” or “One”


Predictive-Technologies.jpgI am continually surprised by the number of customers that think you either are, or aren’t a data scientist.  Digging into this a little bit further, I find many times a data scientist has created this binary distinction to differentiate their skills from the “regulars” (or in Harry Potter terms, “muggles”) – who readily agree they don’t want to be anywhere near a mathematical equation, much less an actual algorithm.


However, the field of data science is really a spectrum that blends really quickly into the analytics/business intelligence world.  If you accept that predictive analytics “is an algorithmic analysis of past data to find patterns that can be applied to new data to improve a future outcome”, I would argue that business intelligence is "a visual and calculation oriented analysis of past data to make better decisions about the future."

That means every business intelligence user can benefit from automated predictive analytics to do what they are currently doing – better.


The sooner you can get your organization out of the “data scientist or not” mentality, the quicker you can make everyone more effective at understanding and solving business problems.   If you get stuck on this, look for the strongest opponents and likely there’s a non-technical reason for their resistance to “opening the predictive gates”.



Don’t Worry Data Scientists, We Will Still Need You


Does predictive automation completely replace data scientists?  Definitely not – a human can understand the semantics in the data, derive new data fields based on their domain knowledge, and can create far more complicated models without so many iterations.  However for those that do not have the knowledge or skills, the automated way gets you pretty darn close - and a whole lot faster.


Federal-Judges-in-New-York-and-Illinois-Consider-Important-Issues-That-Could-Shape-the-Future-of-Predictive-Coding-Technology-for-eDiscovery-rev.4.docx_.jpgThe massive influx of business users with access to predictive analytics reduces the burden on (typically) overloaded data scientists by freeing them up from some of the more “simpler” problems that could be handled by users directly and letting them work on the more sophisticated predictive problems.  Sounds like win-win doesn’t it?


Interestingly, many data scientists can also benefit from the use of automated predictive technology to better understand their data and create baseline models *before* they dive into creating their own models by hand.  By having a computer do the initial analysis, a data scientist can save hours to days of data profiling before getting down to the core predictive modelling they are being paid those big bucks for.  In some cases, the automatically generated models may be solid enough to solve a problem without requiring any manual modelling.



In The Future, It Won’t Matter


M-fortune-cookie.jpgThe field of predictive analytics is maturing at its fastest rate ever and the move towards more ease of use and simplicity will eventually reach a plateau, just like standing up a full Hadoop cluster in the cloud can be done in under ten minutes today.   The focus will shift from “which is the best algorithm to use when?” to “how I can use these predictive results to improve the business?”.  

The number of predictive models needed will explode as companies explore micro-segmentation and the (potentially) small gain of a single manually crafted model will give way to the need to create hundreds of models per day.


So to you data scientists out there: We will always need your skills, your experience, and your wisdom.


But take it from someone who has been through this cycle and had his computer geekdom commoditized – automation opens up predictive analytics to everybody, so just remember: “It’s not all about you”.



You can try SAP Predictive Analytics and all of its automated predictive goodness for free by downloading it at http://www.sap.com/trypredictive


Most data scientists and statisticians agree that predictive modeling is both art and  science yet, relatively little to no  air time is given to describing the art. This post describes one piece of the art of modeling called feature engineering which expands the number of variables you have to build a model.  I offer  six ways to implement feature engineering and provide examples of each. Using methods like these is important because additional relevant variables increase model accuracy, which makes feature engineering an essential part of the modeling process. The full white paper may be downloaded at Feature Engineering Tips for Data Scientists.


What Is Feature Engineering?

A predictive model is a formula or method that transforms a list of input fields or variables (x, x, …, x) into some output of interest (y). Feature engineering is simply the thoughtful creation of new input fields (z, z, …, z) from existing input data (x). Thoughtful is the key word here. The newly created inputs must have some relevance to the model output and generally come from knowledge of the domain (such as marketing, sales, climatology, and the like). The more a data scientist interacts with the domain expert, the better the feature engineering process.

Take, for example, the case of modeling the likelihood of rain given a set of daily inputs: temperature, humidity, wind speed, and percentage of cloud cover. We could create a new binary input variable called “overcast” where the value equals “no” or 0 whenever the percentage of cloud cover is less than 25% and equals “yes” or 1 otherwise. Of course, domain knowledge is required to define the appropriate cutoff percentage and is critical to the end result.

The more thoughtful inputs you have, the better the accuracy of your model. This is true whether you are building logistic, generalized linear, or machine learning models.


Six Tips for Better Feature Engineering

Tip 1: Think about inputs you can create by rolling up existing data fields to a higher/broader level or category. As an example, a person’s title can be categorized into strategic or tactical. Those with titles of “VP” and above can be coded as strategic. Those with titles “Director” and below become tactical. Strategic contacts are those that make high-level budgeting and strategic decisions for a company. Tactical are those in the trenches doing day-to-day work.  Other roll-up examples include:

  • Collating several industries into a higher-level industry: Collate oil and gas companies with utility companies, for instance, and call it the energy industry, or fold high tech and telecommunications industries into a single area called “technology.”
  • Defining “large” companies as those that make $1 billion or more and “small” companies as those that make less than $1 billion.


  Tip 2: Think about ways to drill down into more detail in a single field. As an example, a contact within a company may respond to marketing campaigns, and you may have information about his or her number of responses. Drilling down, we can ask how many of these responses occurred in the past two weeks, one to three months, or more than six months in the past. This creates three additional binary (yes=1/no=0) data fields for a model. Other drill-down examples include:

  • Cadence: Number of days between consecutive marketing responses by a contact: 1–7, 8–14, 15–21, 21+
  • Multiple responses on same day flag (multiple responses = 1, otherwise =0)


Tip 3: Split data into separate categories also called bins. For example, annual revenue for companies in your database may range from $50 million (M) to over $1 billion (B). Split the revenue into sequential bins: $50–$200M, $201–$500M, $501M–$1B, and $1B+. Whenever a company falls with the revenue bin it receives a one; otherwise the value is zero. There are now four new data fields created from the annual revenue field.  Other examples are:

  • Number of marketing responses by contact: 1–5, 6–10, 10+
  • Number of employees in company: 1–100, 101–500, 502–1,000, 1,001–5,000, 5,000+


Tip 4: Think about ways to combine existing data fields into new ones. As an example, you may want to create a flag (0/1) that identifies whether someone is a VP or higher and has more than 10 years of experience. Other examples of combining fields include:

  • Title of director or below and in a company with less than 500 employees
  • Public company and located in the midwestern United States

You can even multiply, divide, add, or subtract one data field by another to create a new input.


Tip 5: Don’t reinvent the wheel – use variables that others have already fashioned.

Tip 6: Think about the problem at hand and be creative. Don’t worry about creating too many variables at first, just let the brainstorming flow. Feature selection methods are available to deal with a large input list; see the excellent description in Matthew Shardlow’s “An Analysis of Feature Selection Techniques.” Be cautious, however, of creating too many features if you have a small amount of data to fit. In that case you may overfit the data, which can lead to spurious results.


Hundreds, even thousands of new variables can be created using the simple techniques described here. The key is to develop thoughtful additional variables that seem relevant to the target or dependent variable. So go ahead, be creative, have fun, and enjoy the process.


Filter Blog

By author:
By date:
By tag: