1 2 3 14 Previous Next

SAP Predictive Analytics

205 Posts

Over the years, most organizations have focused their attention on the effectiveness and efficiency of separate planning functions. As a new way of doing business, a growing number of organizations have begun to realize the strategic importance of driver based planning powered by SAP Business Planning and Consolidation (BPC) on SAP HANA. SAP BusinessObjects Predictive Analytics (PA) supports an algorithmic approach to forecasting future values in a time series, while at the same time identifying the business drivers. It uses state-of-the-art Machine Learning algorithms.

In an effort to help organizations capture the synergy between PA and BPC, this blog will provide information on how to make better data-driven planning decisions leading to increased operational efficiency by describing how:

  • PA forecasts future values and identifies the business drivers
  • SAP Design Studio application enables the business user to simulate forecasting various scenarios running PA’s Machine Learning algorithms in the background
  • Predictive forecasts can be written back to BPC on SAP HANA

The current blog explains the conceptual process and the second part will focus on the technical implementation.

First, let us explore what predictive forecasting is and how it combines statistical analysis and key trends as they happen to predict results in the future.


The predictive insights enable companies to maximize profitability and operational efficiency when combined with BPC.

Over the years, enterprises have collected data from divergent systems and have been aggregating financial planning information with tools like BPC and offline they are building up drivers. However, it is evident that most of the times the drivers are estimated to their best of knowledge and not based on the patterns found in the data.

The driver based plans are not effective unless you start with accurate inputs as planning accuracy is heavily dependent on the drivers and inputs used in the planning process. This gives life to the idea of combining predictive power to financial forecasting. PA has the ability to predict more accurate inputs for planning systems.

So far we have comprehended that combining the PA and BPC increases the planning accuracy. Now in the real world, inputs such as sales volumes, raw material prices, etc. have a significant impact on forecasted prices and profitability. These inputs constantly change over the time period. This necessitates the ability of Financial Analysts to respond to the changing market dynamics.

Using SAP Design Studio, enterprises can build effective interactive analytical applications targeted towards these Financial Analysts. A solution was built by integrating SAP Design Studio, PA and BPC to provide the ability to simulate the drivers, to predict the forecast using PA’s Automated Analytics Time Series Forecasting and to save these forecasts to BPC.

You may watch the recorded webinar by Jon Essig from SimpleFi Solutions, who elaborated a scenario based planning on cotton price forecasting with multiple drivers. He also explains how a Financial Analyst can simulate the drivers and forecast using the PA and eventually persists the predictive forecast to BPC.

To realize the benefits of PA with BPC, one does not need to possess data scientist skills. Financial Analysts can simply build necessary PA models to identify the drivers, do forecast planning and hand over the baton to BPC.

Please watch out for the next blog which explains how this solution was implemented technically.

As part of a broad announcement made at SAPPHIRE NOW 2016, SAP announced a range of new features and capabilities in its analytics solutions portfolio. Because predictive capabilities play an important role in the portfolio, I thought I’d take this opportunity to share the details of our innovations in both SAP BusinessObjects Cloud and SAP BusinessObjects Predictive Analytics.


Innovations in SAP BusinessObjects Cloud

Predictive analytics capabilities have been added to the SAP BusinessObjects Cloud offering. Business users can use an intuitive graphical user interface to investigate business scenarios by leveraging powerful built-in algorithmic models. For example, users can perform financial projections with time series forecasts, automatically identify key influencers of operational performance, and determine factors impacting employee performance with guided machine discovery.

Learn more about our predictive capabilities in SAP BusinessObjects Cloud.


Innovations in SAP BusinessObjects Predictive Analytics

Predictive analytics features that aim to help analysts easily deliver predictive insights across an enterprise’s business processes and applications are planned for availability in the near term.

Planned innovations include:

  • Automated predictive analysis of Big Data with native Spark modeling in Hadoop environments
  • Enhancements for SAP HANA including in-database social network analysis and embedding expert model chains
  • A new simplified user interface for the predictive factory and automated generation of segmented forecast models
  • Integration of third-party tools and external processes into predictive factory workflows
  • The ability to create and manage customized models that detect complex fraud patterns for the SAP Fraud Management analytic application

Learn more about what SAP BusinessObjects Predictive Analytics has in store.

Upcoming Release of SAP BusinessObjects Predictive Analytics

Watch the video about our upcoming release of SAP BusinessObjects Predictive Analytics for more information.



Don't know how SAP BusinessObjects Predictive Analytics works? See it in action by watching our Nuts and Bolt Series of short videos on YouTube.


Blog originally appeared on Analytics from SAP and has been republished with permission.

Firstly, I would like to send great thanks to Professor Paul Hawking who has given us – the students at Victoria University – an opportunity to attend the CompleteStream KeyInsights Conference 2016 held in Melbourne on 18th & 19th of April, 2016. Among marvelous presentations from the guru experts in SAP application, I was definitely convinced by the brilliant demonstration of SAP Predictive Analytics which was presented by Charles Gadalla from SAP Canada.

value of BI tools.png

Source: Victoria University course syllabus on Business Intelligence 2015.

Generally, “Business Intelligence” or BI is always the dreaming goal for any organization when setting up their information systems, especially for the ERP systems. There is a variety of ways, yet exploiting the SAP solutions can be considered as one of the practical approaches for organization to get an effective BI. Under the recently leading-edge inventions of virtualization, storage, networking and in-memory technologies, the SAP S/4HANA was introduced as a revolution of integration system and BI, especially the SAP Predictive Analytics application which includes the Automated Analytics and Expert Analytics implements.


With just small investments on several specific sensor-tag devices from well-known manufacturers such as Texas Instruments Incorporate, organization can take a great number of automation and prediction advantages. By using these sensors properly, it is amazing to recognize that organization’s office buildings, factory machineries or other fixed assets are now becoming alive and enabling to connect with the SAP systems in order to communicate its working conditions and environment in criteria of light, magnetometer, orientation, temperature, vibration, etc. Thus, these uninterrupted information would be recorded into a database due to lately monitoring, analyzing and predicting purposes.


Additionally, the SAP S/4HANA is developed with the Predictive Analytics Library of a bundle of algorithms which allow to configure flexibly the triggered events under several required conditions, for instance, the system will automatically send an instant message to notice relevant persons if the working machine is going overheat or nearly out of usage time. Such kind of predicting information is really necessary for managers to take crucial maintenance actions before any failure or damage happen to organization.


The era of Internet of Thing or IoT is coming around the corner and this tendency is basically confirmed by the SAP S/4HANA with the powerful in-memory technique and Predictive Analytics tool that allow us to process and forecast the pure data in real-time. Strategic and risk managements become easier and more convenient routine than ever.


Here are some useful links for further information about the IoT and SAP S/4HANA in reality:

  1. https://www.youtube.com/watch?v=dAnjGhLnFhs
  2. https://www.youtube.com/watch?v=UhMKG761l78
  3. https://youtu.be/8NbP07OEGsQ
  4. https://www.youtube.com/watch?v=EiIInSB8pFk&list=PLkzo92owKnVxzjoxwJdaa400E_UqkzE8J
  5. http://scn.sap.com/community/developer-center/cloud-platform/blog/2015/10/26/the-cheapest-and-simplest-way-to-send-sensor-data-to-the-hana-cloud-platform

Reimagining the Predictive Experience


The Advanced Analytics mission is to help companies improve their bottom line by making their business processes and apps better with predictive techniques. Here are the key concepts of our approach...



Embedded in Business Processes and Apps

The real value of predictive analytics can only be achieved by embedding them into apps and tools used by the business users on a daily basis. The integration must be done in such a way that 1- business applications offer specific workflows with off-the-shelf predictive functions; 2- Data scientists have the option of replacing them by proprietary and more advanced predictive functions.


The Predictive Factory


No coding, just configuration!

Full predictive lifecycle – from data preparation, model building/rebuilding, model evaluation, model deployment and monitoring, versioning

Modeling automation – Ability to automate models for multiples segments by creating the first one and letting Predictive Analytics complete the task

Inclusive – embrace collaborative approach among non experts and experts


Low and High Touch User Experiences

Low touch user interface using wizard-based approach and automated

High touch (visual data pipeline framework) user interface supporting open languages (such as R, Scala, etc.)

Interoperability between both modes so results – scores, prediction, forecasts, etc... can be shared easily among coworkers and teams

Predictive Intellectual Property


Generic proprietary algorithms providing very good results in most data situations with no parameter tuning

Niche proprietary algorithms providing unmatched results in unique business context / data situation

Niche predictive IP to be available via a marketplace opened to partners and ISV



All-in-one Cloud for Analytics initially targeting analysts and business users with Exploratory Analytics

Model consumption for pre-packaged business scenarios

HCP Predictive Services targeting partners and integrators

Framework for model consumption in business applications on-demand applications

On Premise


Will be compatible and interoperable with the previous SAP investments:

Analytics platform (known as BI Platform)

SAP HANA computing platform – especially through  XS Advanced and SAP HANA Streaming


Big Data

Data Preparation compatible with ultra-wide data sets

Distribution of algorithms on scale-out architectures

Integration into Big Data streaming environments

In this blog we want to introduce "The Nuts & Bolts of SAP Predictive Analytics" YouTube Video Series. In this series Pierre Leroux, Priti Mulchandani and I have created a number of short 2-3 minute videos to highlight some important predictive use cases.


  1. Data Preparation: In the first installment we walk-through how to prepare your data using SAP HANA as a data source. In this short video we highlight some of the key features in SAP Predictive Analytics - Data Manager that enable users to join, aggregate, expand and derive attributes from data sets for use in predictive modeling.
  2. Automated Analytics: We then focus on SAP Predictive Analytics - Automated Analytics and demonstrate how the wizard-like approach enables users who may not have a degree in Maths to train, build, and test a predictive model.
  3. Real Time with SAP HANA: Next we watch how SAP Predictive Analytics - Automated & Expert Analytics helps you train and apply predictive models in SAP HANA and also easily embed predictive models into your applications and score your data in real time.
  4. Native Spark Modeling: In our latest installment we learn how SAP Predictive Analytics helps you to delegate the predictive modeling processing to Spark on Hadoop and avoid time-consuming data transfer between the predictive engine and the data source.



We will continue to add useful bite-size predictive videos using the SAP Predictive Analytics Suite. Please check back in to SCN or to the Analytics Solutions from SAP YouTube channel for more videos: The Nuts & Bolts of SAP Predictive Analytics - YouTube

SAP Predictive Analytics 2.5 has been delivered on SAP Support Portal on March 17, 2016.

Start with the release announcement: SAP Predictive Analytics 2.5 Now Generally Available!.

SAP Predictive Analytics 2.5.PNG


If you are not yet a SAP Predictive Analytics user, you can download your own 30-day trial of SAP Predictive Analytics 2.5, desktop version here.

Our product managers are blogging about this release:

Here is a curated collection of useful links for SAP Predictive Analytics 2.5:

Links for APL 2.5:

We are currently preparing the next edition of our newsletter. Register now to know more!


Enjoy SAP Predictive Analytics 2.5, ask questions, send us feedback, start discussions in our SCN Predictive Analytics user community.

We are looking to hearing from you!

March is an exciting time for people all around the world – for many, it means winter is melting into spring and Easter (along with the requisite holidays, chocolate eggs, and of course bunnies) are just around the corner.  For our hardworking SAP Predictive Analytics team, it is also a time of celebration as we have just announced the general availability of SAP Predictive Analytics 2.5!


While every release is very special to us, this one is particularly sweet because it introduces some features that our team has been working on for quite some time.   In addition to many product enhancements, optimizations, and new features (see the full “What’s New”, I’d like to highlight a few of the real “biggies” for SAP PA 2.5:



Native Spark Modeling


The thing about Big Data is that… well… it’s BIG… Zillions of rows are great, but where Big Data becomes really interesting is when the data become really wide (i.e. a large number of columns).  How would you end up with thousands of columns? Easy.  Take for example, the complexities of an airplane jet engine – and how many sensors it has to measure everything for engine bearings, temperature, and so on.  Now imagine those tens of thousands of sensors (each represented by a single column) being read every five seconds for the duration of the flight – multiplied by the number of engines.  That’s terabytes of wide data per hour.


Extracting that amount of data for analysis is simply not feasible because many databases can’t even handle that number of columns, and even if they could, the time required for the analysis may make the results meaningless.  The Native Spark Modeling features in SAP Predictive Analytics 2.5 (sometimes called “In Database Modeling/IDBM in the interface) delegates the predictive modeling processing to Spark on Hadoop and the data transfer is avoided between the predictive engine and the data source.


Native Spark Modelling provides the following benefits when analyzing data using Spark on Hadoop:


  • Processing closer to the data source - reducing expensive I/O.
  • Faster response times – training models in less time to enable you to do more.
  • Higher scalability – create more models and use wider datasets than ever before.
  • Better CPU utilization – reduce costs and increase operational efficiency.
  • Easier access to Big Data – now business analysts can work with Hadoop without Spark coding skills.


You can find out more from Priti, one of our product managers, here: Big Data : Native Spark Modeling in SAP Predictive Analytics 2.5



Rare Event Stability


As data volumes increase, we have an even greater ability to find even smaller patterns in the data.   However some events in the data happen so infrequently, it is sometimes hard to determine if a predictive pattern for the pattern exists or if there is coincidental “noise” in the “signal”.  Take for example a jet engine again:  Thankfully jet engines fail very infrequently, but this presents a huge problem in predictive maintenance scenarios because what we are trying to do is find a pattern within the data that could have “predicted” the failure so we can try to prevent the next one.   The consequence of finding a pattern in random data instead of a true set of factors for the failure is potentially a very expensive and unnecessary engine servicing that could not only cost millions, but could also ground the plane it is mounted on.


SAP Predictive Analytics 2.5 has an improved ability to help in these “rare event” cases by generating a predictive model only when there is sufficient indication that the model can be trusted.  If the system determines the generated model cannot be trusted with enough confidence, it will alert you rather than providing a potentially inadequate model.  



IP Protection for Partner Extensions in R


One of the more attractive aspects of the open source language “R” is the ability to easily share and obtain predictive algorithms and libraries.  While the exact number changes all the time, there are currently over 6000+ R libraries freely available today.  Why so many?  Data scientists sometimes create their own algorithms from scratch or modify existing ones to solve specific problems of an industry, target data source, or even a single customer.   In these cases, the creator may not want to share their work as it may either represent a competitive advantage over other companies, or it may be part of their own intellectual property that they wish to protect from being easily viewed or edited (unless they are paid for it!).


SAP Predictive Analytics 2.5 now includes a feature to create R extensions that “encrypted in transit”, meaning they can be transported and used by others without disclosing the recipe to their secret sauce.  Now, customers and partners are able to invest in their R extensions while preserving their intellectual property.   The SAP Analytics Extensions Directory also allows our partners to distribute and even monetize their extensions through an SAP-managed portal that can be directly launched from within the SAP Predictive Analytics interface.



More to Come (Soon!)


As exciting as March is for our Product Team, we’re driving really hard towards Q2 because we’ve got some great things lined up for SAPPHIRE NOW that will be in Orlando between May 17-19, 2016.   At the conference, we are also planning a number of ASUG Educational Sessions that will not only give you roadmap information, demo scenarios, and deep dive details, but also some exciting news about where the future of SAP Predictive Analytics is headed – so make sure you attend if you can!


I would also recommend that you sign up for our SAP Predictive Analytics Newsletter to keep up on the latest, and as always, keep an eye on the Predictive SCN for the latest!

Apache Spark  is the most popular Apache open-source project till date and it has become catalyst for adoption of big data infrastructure. Spark uses in-memory technology and offers high performance for complex computation processes such as Machine Learning, Streaming Analytics and Graph engine.

Providing support for Hadoop and Spark in SAP Predictive Analytics is crucial to serve our customer needs because:

  • Data is getting bigger and wider
  • Performance and speed expectations are rising
  • Customers are looking for optimized processing with proper utilization of their Hadoop resources
  • Customers want to leverage their existing workforce to perform predictive analysis of their big data assets as SAP Predictive Analytics has a business friendly tool which does not demand data science and big data developer skills


Native Spark Modeling

Native Spark Modeling executes the automated predictive models directly on Hadoop using Spark engine.

Before Native Spark Modeling, your predictive modeling engine was essentially the SAP Predictive Analytics desktop or SAP Predictive Analytics server. Now with Native Spark Modeling - data intensive tasks are delegated to Spark and thus the data transfer is avoided between the SAP Predictive Analytics client and the data source platform (in this case the Hadoop Platform).

The left side of diagram shows the existing process without Spark in which case huge data transfer took place and it was costly on performance for big data and then the one on the right shows same process using Native Spark Modeling to execute machine learning computation close to data:




User Flow

  1. On-install of SAP Predictive Analytics tool, users will find “SparkConnector” folder which contains developed functionality in form of ‘jar’.  A user, typically an administrator will need to define Spark and Yarn connection properties in the configuration files for each ODBC DSN that they intend to use for Native Spark Modeling capability.  Refer to Native Spark Modeling configuration section in SAP Predictive Analytics documentation.
  2. They load up SAP Predictive Analytics
  3. They open up Modeler
  4. They have to make sure  'Native Spark Modeling' flag in Model Training Delegation option is switched ON  under "File->Preferences" menu


  5. They choose Classification/Regression (Starting PA 2.5, only Classifications models are supported on Spark, Regression will follow soon).
  6. They can choose an existing Hive Table using 'Use Database Table' option or an Analytical Dataset which is based on Hive tables using the 'Use Data Manager' option. SAP Automated Analytics proprietary algorithms are made to scale across any amount of information – both long and wide.  The wider the datasets, stronger the predictive power! If you are wondering the formation of wider datasets then consider an e-commerce example, where weblogs are analyzed to understand trends behind purchases. As you build aggregates in Data Manager, even more columns get added for analysis. blog_25_wide.JPG
  7. They load the description of the dataset from Hive or from a file or choose “Analyze”
  8. They choose a target field from the loaded dataset to run the training against e.g. Credit_card_Exist(=Yes/No)
  9. They generate the model which would be now executed on spark engine. Notice the progress bar now which shows progress messages for spark blog_25_progressbar.JPG
  10. They notice ongoing Spark jobs from the application WebUI which can be started form the browser using http://localhost:4040/  blog_25_sparkwebUi.JPG
  11. Once the model is generated they have the same choices as if it was a traditional database. e.g.  Smart Variable Contribution report in the Automated blog_25_varsummary.jpg
  12. Finally, they can manage model lifecycle using SAP Predictive Analytics Model Manager component. For example if user wants to retrain the model at frequent intervals; they can schedule the task from Model Manager for their data in Hadoop using the model that was trained on Spark. The retraining in this case is processed on Spark as well.


The data will continue to grow and the enterprises will continue to shift more and more of this data on Hadoop platforms; they can now begin to apply predictive solutions on top to get meaningful insights.

Companies have different options for predictive analysis such as SAP Predictive Analytics or open source machine learning libraries, SAP Predictive Analytics however makes a difference with its full stack support on Hadoop starting from data manipulation on Big Data to model training on Spark and finally to in-database apply/re-train for production-ready Big Data.


In conclusion, Native Spark Modeling is a key foundation in SAP’s Predictive Big Data architecture which enables performance gains of 7-10 times and more for big data. It is also prepared to scale as your data and infrastructure widens in future. With advantages of performance and scalability, Business Analysts can build more predictive models & fail early without having to worry about Big data technology.


This blog describes issues that could occur when installing, configuring or running Native Spark Modeling. It explains the root causes of those issues and if possible provides solutions or workarounds.


The official SAP Predictive Analytics documentation including the "Connecting to your Database Management System on Windows" and "Connecting to your Database Management System on Unix" Guides can be found at SAP Predictive Analytics 2.5 – SAP Help Portal Page .




What is Native Spark Modeling?


Native Spark Modeling builds the Automated predictive models by leveraging the combined data store and processing power of Apache Spark and Hadoop.


Native Spark Modeling is introduced from Predictive Analytics 2.5.  The concept is also sometimes called in-database processing or modeling.  Note that both Data Manager and Model Apply (scoring) already support in-database functionality. For more details on Native Spark Modeling have a look at :

Big Data : Native Spark Modeling in SAP Predictive Analytics 2.5









Native Spark Modeling does not start.  For example you should see the "Negotiating resource allocation with YARN" progress message in the Desktop client when Native Spark Modeling is configured correctly.



Check you have Native Spark Modeling checkbox enabled in the preferences (under Preferences -> Model Training Delegation).


Check you have at least the minimum properties in the configuration files (hadoopConfigDir and hadoopUserName entries in the SparkConnections.ini file for the Hive DSN and the Hadoop client XML files in the folder referenced by hadoopConfigDir property).


SparkConnections.ini file has limited support for full path names with spaces on Windows.


Prefer relative paths instead.

e.g. for a ODBC DSN called MY_HIVE_DSN use the following relative path instead of the full path for the hadoopConfigDir parameter





Error message includes "Connection specific Hadoop config folder doesn't exist".


Check the SparkConnections.ini file contains a valid path to the configuration folder.




Error message contains "For Input String".  For example "Unexpected Java internal error...For Input String "5s"".


Check the hive-site.xml file for the DSN and remove the property that is causing the issue (search for string in the error message).




Error message "JNI doesn't find class".


This can be a JNI (Java Native Interface) classpath issue.  Restarting the desktop client normally fixes it.  Otherwise check the classpath settings in the KJWizardJni.ini file are referring to the correct jar files.



Monitoring and Logging


The logs in native_spark_log.log can be limited.


For Desktop - to increase the amount of information in the native_spark_log.log file change the KJWizardJni.ini configuration file to use the full path for the log4j configuration file (log4j.properties file under the SparkConnector directory).  After that the Spark logs will also be included.

# default value is relative path



The logging level can also be increased by modifying the log4j.properties file (under the SparkConnector directory). For example, change the log4j.rootLogger level from INFO to DEBUG to show more logging information for Spark.



Also refer to the logs on the Hadoop cluster for additional logging and troubleshooting information.

For example use the YARN Resource Manager web UI to monitor the Spark and Hive logs to help troubleshoot Hadoop specific issues.  The Resource Manager web UI URL is normally



Support for Multiple Spark versions


There is a restriction that one spark version (jar file) can be used at one time with Native Spark Modeling.

HortonWorks HDP and Cloudera CDH are running Spark 1.4.1 and Spark 1.5.0 respectively.


It is possible to switch the configuration to one or the other spark version as appropriate before modeling.

See the “Connecting to your Database Management System” guide in the official documentation (SAP Predictive Analytics 2.5 – SAP Help Portal Page) for more information on switching between cluster types.

Please restart the Server or Desktop after making this change.

Training Data Content Advice


There is a limitation that the training data set content cannot contain commas in the data values. For example a field containing a value "Dublin, Ireland".


Pre-process the data to cleanse commas from the data or disable Native Spark Modeling for such data sets.

Also be careful when creating a table in Hive that the data does not contain a header row with the column names.  The Hive “create table” statement will include the header information as a data row.


KxIndex Inclusion


Crash occurs when including KxIndex as an input variable.  By default the KxIndex variable is added by Automated Analytics to the training data set description but it is normally an excluded variable.  There is a limitation that the KxIndex column cannot be included in the included variable list with Native Spark Modeling.


Exclude the KxIndex variable (this is the default behaviour).


HadoopConfigDir Subfolder Creation


The configuration property HadoopConfigDir in Spark.cfg by default uses the temporary directory of the operating system.

This property is used to specify where to copy the Hadoop client configuration XML files (hive-site.xml, yarn-site.xml and core-site.xml).

If this is changed to use a subdirectory (e.g. \tmp\PA_HADOOP_FILES) it is possible to get a race condition that causes the files to be copied before the subdirectory is created.


Manually create the subdirectory.


Memory Configuration Tuning (Desktop only)


The Automated Desktop user interface shares the same Java (JVM) process memory with the Spark connection component (Spark Driver).

It is possible to misconfigure one or the other but no specific warnings will issued in this case.



Modify the configuration parameters to get the correct memory balance for the Desktop user interface and the Spark Driver.

The KJWizardJni.ini configuration file contains the total memory available to the Automated Desktop user interface and SparkDriver.

The Spark.cfg configuration file contains the optional property DriverMemory.  This should be configured to be approximately 25% less than the DriverMemory property.

The SparkConnections.ini configuration file can be further configured to tune the Spark memory.

Please restart the Desktop client after making configuration changes.

e.g. example Automated Desktop memory and Spark configuration settings

In Spark.cfg


In KJWizardJni.ini


In SparkConnections.ini


Spark/YARN Connectivity


Virtual Private Network (VPN) connection issue (mainly Desktop).

Native Spark Modeling uses YARN for the connection to the Hadoop cluster.  There is a limitation that the connectivity does not work over VPN.


Revert to non-VPN connection or connect to a terminal/Virtual Machine that can connect to the cluster without the VPN.


Single SparkContext issue (Desktop only).

A SparkContext is the main entry point for Spark functionality.  There is a known limitation in Spark that there can be only one SparkContext per JVM.

For more information see https://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/SparkContext.html

This issue may appear when a connection to Spark cannot be created correctly (e.g. due to a configuration issue) and subsequently the SparkContext cannot be restarted.  This is an issue that only affects the Desktop installation.


Restart the Desktop client.



Get error message

Unexpected Spark internal error.  Error detail: Cannot call methods on a stopped SparkContext


Troubleshoot by looking in the diagnostic messages or logs on the cluster (for example using the web UI).

One possible cause is committing too many CPU resources in the SparkConnections.ini configuration file.


Example of Hadoop web UI error diagnostics showing over commit of resources :


SparkConnections.ini file content with too many cores specified:





Hive on Tez execution memory issue

Scope HortonWorks clusters only (with Hive on Tez) and Data Manager functionality

HortonWorks HDP uses Hive on Tez to greatly improve SQL execution performance.  The SQL generated by the Data Manager functionality for Analytical Data Sets (ADS) can be complicated.  There is a possibility the Tez engine will run out of memory with default settings.


Increase the memory available to Tez through the Ambari web administrator console.

Go to the tez-configs under Hive and change setting tez.counters.max to 16000.  It is also recommended to increase the tez.task.resource.memory.mb setting.  It is necessary to restart the Hive and Tez services after this change. If this still does not work it is possible to switch the execution engine to Map Reduce again through Ambari.




It is possible to set the database name in the ODBC Hive driver connection configuration.  For example, instead of using the "default" database, it is possible to configure a different database in the ODBC Administrator dialog on Windows or the ODBC connection file for the UNIX operating system.

Native Spark Modeling requires the default database for the Hive connection.


Keep the database setting to default for the Hive DSN connection.  It is still possible to use a Hive table/view in a different database to default.



Data Manager


Time-stamped population with user-defined target field is not contained in a Temporal ADS (Analytical Data Set). i.e. when you train your model using Data Manager with "Time-stamped Population" having a target variable, your target variable may not be visible in the list of variables in the modeler.


If you want to include the target field you can either have it as part of the original data set or define a variable (with relevant equation) in the "Analytical Record" instead.


Metadata Repository


The metadata repository cannot be in Hive.  Also output results cannot be written directly into Hive from In-database Apply (model Scoring) or Model Manager.


Write the results to local filesystem instead.

Let’s review Hadoop and SAP Predictive Analytics features in details and how these solutions can be utilized in the scenario below.


In this scenario customer’s online interaction with the various products needs to be captured from several weblogs to build predictive models for product recommendation in the future to the right group of customers. To harness the power of Big Data, one is required to have an infrastructure that can manage and process huge volumes of structured and unstructured data in real-time, and can protect data privacy and security.

In the above scenario the weblog dataset can be stored in a Hadoop environment. So, what is Hadoop? Let’s have a look at this technology first.


Data storage: Hadoop

Hadoop is an open-source, Java-based programming framework that supports the processing of large data sets in a distributed computing environment (clusters). It is part of the Apache project sponsored by the Apache Software Foundation. Hadoop is becoming widely popular among the big data user community mainly because of:


High Reliability:

Hadoop makes it possible to run applications on systems with thousands of nodes analyzing terabytes of data. It’s distributed file system facilitates rapid data transfer among nodes and allows the system to continue operating uninterrupted in the case of a node failure. This approach lowers the risk of catastrophic system failure, even if a significant number of nodes become inoperative.


Massive parallel processing:

Hadoop has been a game-changer in supporting the enormous processing needs of big data. A large data procedure which may take hours of processing time on a centralized relational database system, may only take few minutes when distributed across a large Hadoop cluster of commodity servers, as processing runs in parallel.

Distributive Design:

Hadoop is not a single product, it is an ecosystem. Same goes for Spark. Let’s cover them one by one.


Source: -Wikipedia




Hadoop Distributed File system

Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.



Part of the core Hadoop project, YARN is a next-generation framework for Hadoop data processing extending MapReduce capabilities by supporting non-MapReduce workloads associated with other programming models.



A table and metadata management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop.



MapReduce Framework (MR)

MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner.



Built on the MapReduce framework, Hive is a data warehouse that enables easy data summarization and ad-hoc queries via an SQL-like interface for large datasets stored in HDFS.


Source: Hortonworks

Key Advantages of implementing Big Data projects using SAP Predictive Analytics

Easy to use Automated Analytics


  • Business users can generate predictive models in just a few clicks using Automated Analytics.
  • Predictive Models on top of Big Data can be managed through Model Manager
  • Deployment of the Predictive Models in Big Data systems can be automated using both Automated Analytics (Code generation or SQL generation)
  • You can monitor model performance and robustness automatically through the auto generated graphical summary reports in Automated Analytics.


Connect SAP Predictive Analytics to Big Data

Using Hive ODBC driver you can connect SAP Predictive Analytics to Hadoop.


Hive ODBC connector:

Apache Hive is the most popular SQL-on-Hadoop solution, including a command-line client. Hive server compiles SQL into scalable MapReduce jobs. Data scientists or expert business users having  good SQL knowledge  can build analytical data set in Hive metastore (service which stores the metadata for Hive tables and partitions in a relational database) by joining multiple Hive tables. Users can then connect to Hadoop system through Automated Analytics Hive Connector for building predictive models using the analytical datasets (already persisted in Hadoop).


DataDirect ODBC driver for Hive is prepackaged with the SAP Predictive Analytics installation and lets you connect to a Hive server without the need to install any other software. Hive technology allows you to use SQL statements on top of a Hadoop system.

  • From the user desktop you can open the ODBC console to create a new Hive ODBC connection.     odbc.png
  • You will need to input the host name of the Hive Server, default port number is 10000 and the Wire protocol version (2- Hive Server 2).


  •    In the general tab, check the option “Use Native Catalog Functions”. Then click “Test Connection” to make sure the connection is working.



For detailed information on how to set up your Hive connectivity, see the Hive section in the Connecting to your Database Management System on Windows or Connecting to your Database Management System on UNIX guide on the SAP Help Portal.


Predictive Modelling Process on HIVE


Create very wide datasets using Data Manager

The options below enable users to prepare datasets in Hadoop for predictive modelling.


  • The Data Manager allows you to prepare your data so it can be processed in Automated Analytics. It offers a variety of data preparation functionalities, including the creation of analytical records and time-stamped populations.


An analytical dataset can be prepared by joining multiple Hadoop tables via Data Manager in Automated Analytics. Using Hive ODBC connector SAP Predictive Analytics can connect to Hive data store and you can build predictive models on Hive table dataset.



·         To create an analytical dataset, open SAP Predictive Analytics application and click on the Data Manager section. Then select the Hive connection to connect to the Hadoop server.


     ·           Select the Hive tables and join them with the keys to create the analytical dataset.



  • An aggregate has been added in Data Manager for example to know number of clicks for a node/page .



Improve operational performance of predictive modelling

  • Automated Analytics provides you with flexibility to scale up the performance of the model linearly with respect to the number of variables through a user friendly self-guided screen.
  • In Automated Analytics Modeler - a predictive model can be trained and in-database scores can be generated directly on the Hadoop server on Hive tables (Use Direct Apply in the Database option should be checked).


Hadoop support in SAP Predictive Analytics

1. Supports full end to end operational predictive life-cycle on Hadoop .

2. (Data preparation through Data Manager->model development through Modeler ->deployment through generating SPARK/Hive SQL, Java code).

3. Automated Analytics support – through HIVE and SPARK SQL ODBC drivers. No coding required.

4. Expert Analytics support through Hive connectivity.

5. Technical details:

Support of Hive


Supports data preparation & scoring directly in Hadoop (no data transfer).



With new release of SAP predictive analytics, model training can be performed on spark. .


For more details on this topic please stay tuned for the upcoming blog by my colleague Priti Mulchandani.

This blog series will provide you with the overview of various Big Data technologies around Hadoop which are supported in SAP Predictive Analytics. It will also cover and how SAP Predictive Analytics can be used to apply predictive techniques to Big Data on Hadoop.



Big Data is more than just a buzzword nowadays; it's changing every way customers are running their business. To get previously unknown information and to bring actionable insights into the business, all information generated by the business needs to be stored. Hadoop, being the scalable platform for storage compared to other databases, is becoming very popular among the customers. The objective of this blog is to get you acquainted with Big Data technologies and briefly describe how you can use Big Data technologies with SAP Predictive Analytics.


What is Big Data?

Gartner analyst Doug Laney came up with famous three Vs back in 2001. 3V patterns are the commonly observed in Big Data.


Source: Forbes


Big Data means a large dataset which cannot be processed using traditional computing techniques. Big Data is not merely a type of data, rather it has become a complete subject which involves various tools, techniques, and frameworks.


  • Volume refers to the vast amount of data generated every second. Think of all the emails, Twitter messages, photos, video clips and sensor data that we produce and share every second. We are not talking terabytes, but Yottabytes or Brontobytes of data.



  • Velocity refers to the speed at which new data is generated and the speed at which data moves around. For example think of social media messages going viral in minutes, or the speed at which credit card transactions are checked for fraudulent activities.


  • Variety refers to the different types of data we can use. Data is generated from „Internet of Things“ is growing exponentially (for example, sensors on the planes getting tons of data).


Source: Blog.sap.com ,HP


Big Data involves the data produced by different devices and applications. Below are the types of data that fall under the umbrella of Big Data.


  • Structured data: Relational data, etc.
  • Semi Structured data: XML data, etc.
  • Unstructured data: Word, PDF, Text, Media Logs, weblogs, etc.


SAP Predictive Analytics on Big Data

Example Scenario: Increasing online sales by analyzing weblogs


In this section, let us review a typical predictive scenario of an online retail store and understand how SAP Predictive Analytics solution along withBig Data technology (Hadoop) can work together. Nowadays, online retailers collect massive amounts of data by having access to clickstream data,user profiles, advertising data, and social network data – just to name a few. This huge amount of data can be stored in a Hadoop cluster. The Hadoop system can be scaled up very easily to store and manage this continuously growing data getting generated from all different kind of sources. Hadoop contains  in memory application like Spark and pre packaged machine learning libraries like Mlib to be utilized in building predictive models efficiently.

A typical example of SAP Predictive Analytics project on a Big Data scenario would be connect to the Hadoop system, prepare meaningful dataset in-database and train the predictive model, build and deploy the predictive model in the Hadoop system.

SAP Predictive Analytics enables the analyst to create predictive models that can identify the key influencers of customers going through with an online purchase.


For example, we may find out that customers under 25 are more likely to purchase products after 1 am on weekends, when certain types of advertising are shown and when they are redirected from YouTube.


Using the clustering module in SAP Predictive Analytics, a marketing manager can identify customer groups that have similar characteristics.These clusters can then be used during targeted marketing campaigns in the future.




In the next blog "Working with SAP Predictive Analytics and Big Data to Increase Online Sales by Analyzing Weblogs" I have discussed how SAP Predictive Analytics can be used to build predictive models for this Big Data scenario.



Several users of SAP Predictive Analytics encountered some issues running the R nodes in Expert Analytics after installing R for the first time using the download link provided in the user interface.


The provided download link of R is the 3.2.1 version where the R development community committed some changes in the list of packages and dependencies causing some issues when trying to use some of the most common packages like the Carret package which miss pbkrtest.


For example, when you try to use the R-CNR-Tree algorithm, you may get the following message:


     Error from R: Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :

       there is no package called 'pbkrtest'


Here the link describing the package https://cran.r-project.org/web/packages/pbkrtest/index.html


After digging a little, it seems that other packages have changed in the 3.2.1 version like Carret.


So let's fix this one!


First let's check why we have some troubles:

  • Open R Studio (by running th shortcut on the desktop or or the start menu or "C:\Users\Public\R-3.2.1\bin\x64\Rgui.exe")
  • Type in

require ("car")

  • You will get the following message

Loading required package: car

Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :

  there is no package called ‘pbkrtest’

In addition: Warning message:package ‘car’ was built under R version 3.2.3

  • Type in


  • You will get a prompt to select a CRAN mirror

--- Please select a CRAN mirror for use in this session ---

  • But this is will result in the following message:

Warning message:

package ‘pbkrtest’ is not available (for R version 3.2.1)



So we need to manually install the package:

"C:/Users/Public/R-3.2.1/bin/x64/R" CMD INSTALL -l "C:\Users\Public\R-3.2.1\library" "C:\temp\pbkrtest_0.4-4.tar.gz"

This is the logs you should get:


* installing *source* package 'pbkrtest' ...

** package 'pbkrtest' successfully unpacked and MD5 sums checked

** R

** data

** inst

** preparing package for lazy loading

Warning: package 'lme4' was built under R version 3.2.3

** help

*** installing help indices

** building package indices

** installing vignettes

** testing if installed package can be loaded

Warning: package 'lme4' was built under R version 3.2.3

* DONE (pbkrtest)



Now if you go back to R Studio and type in:

require ("car")

It should all be ok and you can start using SAP Predictive Analytics.


However, if you get the following message while installing pbkrtest:


Warning: package 'lme4' was built under R version 3.2.3

Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :  namespace 'nlme' 3.1-120 is being loaded, but >= 3.1.123 is required

Error : package 'lme4' could not be loaded

ERROR: lazy loading failed for package 'pbkrtest'

* removing 'C:/Users/Public/R-3.2.1/library/pbkrtest'


Then run the following command from the R studio:



This should give you the following output:

     trying URL 'http://cran.irsn.fr/bin/windows/contrib/3.2/lme4_1.1-11.zip'

     Content type 'application/zip' length 4746682 bytes (4.5 MB)

     downloaded 4.5 MB


     package ‘lme4’ successfully unpacked and MD5 sums checked


     The downloaded binary packages are in



Then run the following from a DOS prompt:

    "C:/Users/Public/R-3.2.1/bin/x64/R" CMD INSTALL -l "C:\Users\Public\R-3.2.1\library" "C:\Users\i304811\AppData\Local\Temp\Rtmp2xL4SS\downloaded_packages\lme4_1.1-11.zip"


And run again run the following command line from a DOS prompt:

"C:/Users/Public/R-3.2.1/bin/x64/R" CMD INSTALL -l "C:\Users\Public\R-3.2.1\library" "C:\temp\pbkrtest_0.4-4.tar.gz"


Hope this helps!



Hello dear Predictive community!


There is an active Idea Place for SAP Predictive Analytics where you can submit your product improvement ideas and get people to vote them.


This place is monitored on a regular basis by our Product Management team and the most relevant ideas get a chance to be integrated in our future releases.


The "top" ideas (looking at the number of votes) as of now are the following.



You certainly have your own ideas, so please do not wait and feel free to create them!


Looking forward to your feedback and great product ideas!


Antoine Chabert



As part of my day to day activities, I'm checking the Idea Place for Predictive and try to provide feedback on some of the items.


One of them was about the need to handle Multi-class Classification in Automated Analytics:





First let's define what a Multi-class Classification model is:

“In machine learning, multi-class or multinomial classification is the problem of classifying instances into one of the more than two classes (classifying instances into one of the two classes is called binary classification). While some classification algorithms naturally permit the use of more than two classes, others are by nature binary algorithms; these can, however, be turned into multinomial classifiers by a variety of strategies.


There is 2 ways to address a Multi-class Classification problem:

  • One-vs.-rest (OvR)

The one-vs.-rest (or one-vs.-all) strategy involves training a single classifier per class, with the samples of that class as positive samples and all other samples as negatives. This strategy requires the base classifiers to produce a real-valued confidence score for its decision, rather than just a class label; discrete class labels alone can lead to ambiguities, where multiple classes are predicted for a single sample.

  • One-vs.-one (OvO)

In the one-vs.-one, one trains K (K − 1) / 2 binary classifiers for a K-way multiclass problem; each receives the samples of a pair of classes from the original training set, and must learn to distinguish these two classes. At prediction time, a voting scheme is applied: all K (K − 1) / 2 classifiers are applied to an unseen sample and the class that got the highest number of "+1" predictions gets predicted by the combined classifier.



Source: https://en.wikipedia.org/wiki/Multiclass_classification


In the Automated Analytic mode of SAP Predictive Analytics, it provides a way to build binary classification only out of the box. The Expert Analytics mode may provide a way to handle that using one of the out of the box algorithms and for sure via an open source R script. But this blog post will only focus on the Automated Analytic mode and we won't discuss the pros and cons of OvR or OvO.



There are multiple ways to handle an “n-way” multi-class model problem:

  • The “multi-target model” approach
    • Prepare a data set with n target variables for OvR or n * (n − 1) / 2 target variables for OvO
    • Build one model with all the targets


         The final model will probably be the worse as it will have to fit all the targets, and will not be optimal (encoding, binning, variable reduction etc.).


  • The “build one, then replicate” approach
    • Prepare a data set with n target variables for OvR or n * (n − 1) / 2 target variables for OvO
    • Build one model with one of the targets
    • Then replicate using a KxShell script to run the n-1 other models for OvR or (n * (n − 1) / 2) - 1 other models for OvO


        The final models will be optimal for each target (encoding, binning, variable reduction etc.), but there will be many models to be built


We will focus on the “build one, then replicate” approach as this will provide more "optimal" models and since SAP Predictive Analytics provides all the tools to "productize" models in mass, it won't be an issue.

Now, the trick or hard part is on the way to prepare the data set.

Prepare the Data Set

I’m a lazy guy so I don’t want to build a fix data set with n target variables for OvR or n * (n − 1) / 2 target variables for OvO, because if a new class appear, I will have modify my data set to add the new class, and that's the last thing I want to do!!

This is why I love Data Manager so much!


I will assume that everyone knows the different elements in play to build your Analytical Data Set in Data Manager.

Anyway, if you don't here is a short summary of the objects that needs to be created:

  • Entity: the subject of the analysis, your customer id or event id
  • Time Stamp Population: the list of entities to be used for training or scoring your model at a reference date (snapshot). It also includes your target if the population is to be used for training purpose
  • Analytical Record: the list of attributes to be associated with the entity at that reference date (time stamp)


So how to handle "Multi-class Classification in Automated Analytics" with Data Manager?

You will only need one Time Stamp Population! And you will be able to handle both OvR and OvO!


Let’s take an example for this. Our Multi-class Classification will have 26 class from “A” to “Z” but could be from “1” to “26”.

Time Stamp Population for OvR:

  • I will assume that you already have your “class” variable/attribute with a value between “A” to “Z” available in your Timestamp Population (via a merge, a condition etc.)
  • You will need a prompt that will define the “one” class you want to use versus the rest. Let’s say it’s a String and the default value will be “A”


  • Once defined, we will use the prompt in a condition/expression to generate the target


  • And save it as "KxTarget" (this naming convention ensure surfacing the target variable)


  • Now you have your target variable defined


  • Click “Next”, and switch to the “Target” tab where you can assign your target


  • If you click on “View Data”, you will get a prompt asking you for the “One” class you want to use



Time Stamp Population for OvO:

  • I will assume that you already have your “class” variable/attribute with a value between “A” to “Z” available in your Timestamp Population (via a merge, a condition etc.)
  • You will need a prompt that will define the “one” class you want to use versus the rest. Let’s say it’s a String and the default value will be “A”


  • You will need a prompt that will define the “other one” class you want to use versus the rest. Let’s say it’s a String and the default value will be “B”


  • Once defined, we will use the prompt in a condition/expression to generate the target like in OvR, so that KxTarget = 1 means it’s equal to “TheOne” and KxTarget = 0 means it’s equal to “TheOtherOne”
  • Then you will need to define a filter to exclude everything but the class equal to the “TheOne” or “TheOtherOne”


  • Now you have your target variable defined. Click “Next”, and switch to the “Target” tab where you can assign your target as for OvR
  • If you click on “View Data”, you will get a prompt asking you for the “One” class and the "Other One" class you want to use



So we are done with the data set generation. Let's build the models!


Build the models


So when you use Data Manager while building your classification, you will get the prompt popup that will ask you to enter the values to be used to extract the data set. Here is an example with OvR and an additional prompt:


You can click "Next", "OK", "Analyze", "Next", "Next" to reach the last step before creating the mode itself for class "A".



Using KxShell Scripts

Click on "Export KxShell Script..." and save the "Learn" script on your Desktop for example.


if you open it in a text editor you will find that the prompt values are stored in KxShell "macros" (like programming variables)


So, if I want train that model using the generated script I will have to execute the following command in a DOS prompt:

     "C:\Program Files\SAP Predictive Analytics\Desktop\Automated\EXE\Clients\CPP\KxShell.exe" "C:\Users\i304811\Desktop\learn.kxs"


Now if I want to run it for class "B", I will run:

     "C:\Program Files\SAP Predictive Analytics\Desktop\Automated\EXE\Clients\CPP\KxShell.exe" "C:\Users\i304811\Desktop\learn.kxs" -DTRAINING_STORE_PROMPT_1=B


and you can alter any of the macros from the script in the command line.


For OvO approach the same logic applies, except that you will need to build n * (n − 1) / 2 models which may require a little script to do the iteration properly.

Hope this was helpful and off course feel free to comment.


PS: I tried to keep the flow simple so I may have took some shortcut or be brief in the explanation to keep the entry short.

The next major release of SAP Predictive Analytics (3.x product line) is currently planned to be released in the second half of 2016.


SAP plans to remove Solaris operating system support in this major release, in order to speed up product innovation deliveries on other operating systems, including Windows 64-bits and Linux.


The last SAP Predictive Analytics release that will provide Solaris operating system support is a minor release (version 2.5, part of the 2.x product line) that should be released during the first quarter of the year 2016. 


For customers who wish to continue using SAP Predictive Analytics (SAP InfiniteInsight) on Solaris operating system, the end of mainstream maintenance dates are the following :

  • For SAP InfiniteInsight 7.x for Solaris : December 31st, 2018
  • SAP Predictive Analytics 2.x for Solaris : February 10th, 2017


To take advantage of the new features of SAP Predictive Analytics 3.x, we recommend you to contact your customer support and adopt one of the supported operating systems (same as the current PAM with the exception of Solaris).


For more information about this communication, feel free to contact me directly via email (see my SCN profile).


Filter Blog

By author:
By date:
By tag: