1 2 3 14 Previous Next

SAP Predictive Analytics

203 Posts

Firstly, I would like to send great thanks to Professor Paul Hawking who has given us – the students at Victoria University – an opportunity to attend the CompleteStream KeyInsights Conference 2016 held in Melbourne on 18th & 19th of April, 2016. Among marvelous presentations from the guru experts in SAP application, I was definitely convinced by the brilliant demonstration of SAP Predictive Analytics which was presented by Charles Gadalla from SAP Canada.

value of BI tools.png

Source: Victoria University course syllabus on Business Intelligence 2015.


Generally, “Business Intelligence” or BI is always the dreaming goal for any organization when setting up their information systems, especially for the ERP systems. There is a variety of ways, yet exploiting the SAP solutions can be considered as one of the practical approaches for organization to get an effective BI. Under the recently leading-edge inventions of virtualization, storage, networking and in-memory technologies, the SAP S/4HANA was introduced as a revolution of integration system and BI, especially the SAP Predictive Analytics application which includes the Automated Analytics and Expert Analytics implements.

 

With just small investments on several specific sensor-tag devices from well-known manufacturers such as Texas Instruments Incorporate, organization can take a great number of automation and prediction advantages. By using these sensors properly, it is amazing to recognize that organization’s office buildings, factory machineries or other fixed assets are now becoming alive and enabling to connect with the SAP systems in order to communicate its working conditions and environment in criteria of light, magnetometer, orientation, temperature, vibration, etc. Thus, these uninterrupted information would be recorded into a database due to lately monitoring, analyzing and predicting purposes.

 

Additionally, the SAP S/4HANA is developed with the Predictive Analytics Library of a bundle of algorithms which allow to configure flexibly the triggered events under several required conditions, for instance, the system will automatically send an instant message to notice relevant persons if the working machine is going overheat or nearly out of usage time. Such kind of predicting information is really necessary for managers to take crucial maintenance actions before any failure or damage happen to organization.

 

The era of Internet of Thing or IoT is coming around the corner and this tendency is basically confirmed by the SAP S/4HANA with the powerful in-memory technique and Predictive Analytics tool that allow us to process and forecast the pure data in real-time. Strategic and risk managements become easier and more convenient routine than ever.

 

Here are some useful links for further information about the IoT and SAP S/4HANA in reality:

  1. https://www.youtube.com/watch?v=dAnjGhLnFhs
  2. https://www.youtube.com/watch?v=UhMKG761l78
  3. https://youtu.be/8NbP07OEGsQ
  4. https://www.youtube.com/watch?v=EiIInSB8pFk&list=PLkzo92owKnVxzjoxwJdaa400E_UqkzE8J
  5. http://scn.sap.com/community/developer-center/cloud-platform/blog/2015/10/26/the-cheapest-and-simplest-way-to-send-sensor-data-to-the-hana-cloud-platform

Reimagining the Predictive Experience

 

The Advanced Analytics mission is to help companies improve their bottom line by making their business processes and apps better with predictive techniques. Here are the key concepts of our approach...

SevenPillars.png

Embedded.png

Embedded in Business Processes and Apps

The real value of predictive analytics can only be achieved by embedding them into apps and tools used by the business users on a daily basis. The integration must be done in such a way that 1- business applications offer specific workflows with off-the-shelf predictive functions; 2- Data scientists have the option of replacing them by proprietary and more advanced predictive functions.

 

The Predictive Factory

Factory.png

No coding, just configuration!

Full predictive lifecycle – from data preparation, model building/rebuilding, model evaluation, model deployment and monitoring, versioning

Modeling automation – Ability to automate models for multiples segments by creating the first one and letting Predictive Analytics complete the task

Inclusive – embrace collaborative approach among non experts and experts

ux.png

Low and High Touch User Experiences

Low touch user interface using wizard-based approach and automated

High touch (visual data pipeline framework) user interface supporting open languages (such as R, Scala, etc.)

Interoperability between both modes so results – scores, prediction, forecasts, etc... can be shared easily among coworkers and teams

Predictive Intellectual Property

IP.png

Generic proprietary algorithms providing very good results in most data situations with no parameter tuning

Niche proprietary algorithms providing unmatched results in unique business context / data situation

Niche predictive IP to be available via a marketplace opened to partners and ISV

Cloud.png

Cloud

All-in-one Cloud for Analytics initially targeting analysts and business users with Exploratory Analytics

Model consumption for pre-packaged business scenarios

HCP Predictive Services targeting partners and integrators

Framework for model consumption in business applications on-demand applications

On Premise

Prem.png

Will be compatible and interoperable with the previous SAP investments:

Analytics platform (known as BI Platform)

SAP HANA computing platform – especially through  XS Advanced and SAP HANA Streaming

bd.png

Big Data

Data Preparation compatible with ultra-wide data sets

Distribution of algorithms on scale-out architectures

Integration into Big Data streaming environments

In this blog we want to introduce "The Nuts & Bolts of SAP Predictive Analytics" YouTube Video Series. In this series Pierre Leroux, Priti Mulchandani and I have created a number of short 2-3 minute videos to highlight some important predictive use cases.

 

  1. Data Preparation: In the first installment we walk-through how to prepare your data using SAP HANA as a data source. In this short video we highlight some of the key features in SAP Predictive Analytics - Data Manager that enable users to join, aggregate, expand and derive attributes from data sets for use in predictive modeling.
  2. Automated Analytics: We then focus on SAP Predictive Analytics - Automated Analytics and demonstrate how the wizard-like approach enables users who may not have a degree in Maths to train, build, and test a predictive model.
  3. Real Time with SAP HANA: Next we watch how SAP Predictive Analytics - Automated & Expert Analytics helps you train and apply predictive models in SAP HANA and also easily embed predictive models into your applications and score your data in real time.
  4. Native Spark Modeling: In our latest installment we learn how SAP Predictive Analytics helps you to delegate the predictive modeling processing to Spark on Hadoop and avoid time-consuming data transfer between the predictive engine and the data source.

 

 

We will continue to add useful bite-size predictive videos using the SAP Predictive Analytics Suite. Please check back in to SCN or to the Analytics Solutions from SAP YouTube channel for more videos: The Nuts & Bolts of SAP Predictive Analytics - YouTube

SAP Predictive Analytics 2.5 has been delivered on SAP Support Portal on March 17, 2016.


Start with the release announcement: SAP Predictive Analytics 2.5 Now Generally Available!.

SAP Predictive Analytics 2.5.PNG

 

If you are not yet a SAP Predictive Analytics user, you can download your own 30-day trial of SAP Predictive Analytics 2.5, desktop version here.


Our product managers are blogging about this release:


Here is a curated collection of useful links for SAP Predictive Analytics 2.5:


Links for APL 2.5:


We are currently preparing the next edition of our newsletter. Register now to know more!

 

Enjoy SAP Predictive Analytics 2.5, ask questions, send us feedback, start discussions in our SCN Predictive Analytics user community.


We are looking to hearing from you!

March is an exciting time for people all around the world – for many, it means winter is melting into spring and Easter (along with the requisite holidays, chocolate eggs, and of course bunnies) are just around the corner.  For our hardworking SAP Predictive Analytics team, it is also a time of celebration as we have just announced the general availability of SAP Predictive Analytics 2.5!

 

While every release is very special to us, this one is particularly sweet because it introduces some features that our team has been working on for quite some time.   In addition to many product enhancements, optimizations, and new features (see the full “What’s New”, I’d like to highlight a few of the real “biggies” for SAP PA 2.5:

 

 

Native Spark Modeling

 

The thing about Big Data is that… well… it’s BIG… Zillions of rows are great, but where Big Data becomes really interesting is when the data become really wide (i.e. a large number of columns).  How would you end up with thousands of columns? Easy.  Take for example, the complexities of an airplane jet engine – and how many sensors it has to measure everything for engine bearings, temperature, and so on.  Now imagine those tens of thousands of sensors (each represented by a single column) being read every five seconds for the duration of the flight – multiplied by the number of engines.  That’s terabytes of wide data per hour.

 

Extracting that amount of data for analysis is simply not feasible because many databases can’t even handle that number of columns, and even if they could, the time required for the analysis may make the results meaningless.  The Native Spark Modeling features in SAP Predictive Analytics 2.5 (sometimes called “In Database Modeling/IDBM in the interface) delegates the predictive modeling processing to Spark on Hadoop and the data transfer is avoided between the predictive engine and the data source.

 

Native Spark Modelling provides the following benefits when analyzing data using Spark on Hadoop:

 

  • Processing closer to the data source - reducing expensive I/O.
  • Faster response times – training models in less time to enable you to do more.
  • Higher scalability – create more models and use wider datasets than ever before.
  • Better CPU utilization – reduce costs and increase operational efficiency.
  • Easier access to Big Data – now business analysts can work with Hadoop without Spark coding skills.

 

You can find out more from Priti, one of our product managers, here: Big Data : Native Spark Modeling in SAP Predictive Analytics 2.5

 

 

Rare Event Stability

 

As data volumes increase, we have an even greater ability to find even smaller patterns in the data.   However some events in the data happen so infrequently, it is sometimes hard to determine if a predictive pattern for the pattern exists or if there is coincidental “noise” in the “signal”.  Take for example a jet engine again:  Thankfully jet engines fail very infrequently, but this presents a huge problem in predictive maintenance scenarios because what we are trying to do is find a pattern within the data that could have “predicted” the failure so we can try to prevent the next one.   The consequence of finding a pattern in random data instead of a true set of factors for the failure is potentially a very expensive and unnecessary engine servicing that could not only cost millions, but could also ground the plane it is mounted on.

 

SAP Predictive Analytics 2.5 has an improved ability to help in these “rare event” cases by generating a predictive model only when there is sufficient indication that the model can be trusted.  If the system determines the generated model cannot be trusted with enough confidence, it will alert you rather than providing a potentially inadequate model.  

 

 

IP Protection for Partner Extensions in R

 

One of the more attractive aspects of the open source language “R” is the ability to easily share and obtain predictive algorithms and libraries.  While the exact number changes all the time, there are currently over 6000+ R libraries freely available today.  Why so many?  Data scientists sometimes create their own algorithms from scratch or modify existing ones to solve specific problems of an industry, target data source, or even a single customer.   In these cases, the creator may not want to share their work as it may either represent a competitive advantage over other companies, or it may be part of their own intellectual property that they wish to protect from being easily viewed or edited (unless they are paid for it!).

 

SAP Predictive Analytics 2.5 now includes a feature to create R extensions that “encrypted in transit”, meaning they can be transported and used by others without disclosing the recipe to their secret sauce.  Now, customers and partners are able to invest in their R extensions while preserving their intellectual property.   The SAP Analytics Extensions Directory also allows our partners to distribute and even monetize their extensions through an SAP-managed portal that can be directly launched from within the SAP Predictive Analytics interface.

 

 

More to Come (Soon!)

 

As exciting as March is for our Product Team, we’re driving really hard towards Q2 because we’ve got some great things lined up for SAPPHIRE NOW that will be in Orlando between May 17-19, 2016.   At the conference, we are also planning a number of ASUG Educational Sessions that will not only give you roadmap information, demo scenarios, and deep dive details, but also some exciting news about where the future of SAP Predictive Analytics is headed – so make sure you attend if you can!

 

I would also recommend that you sign up for our SAP Predictive Analytics Newsletter to keep up on the latest, and as always, keep an eye on the Predictive SCN for the latest!

Apache Spark  is the most popular Apache open-source project till date and it has become catalyst for adoption of big data infrastructure. Spark uses in-memory technology and offers high performance for complex computation processes such as Machine Learning, Streaming Analytics and Graph engine.

Providing support for Hadoop and Spark in SAP Predictive Analytics is crucial to serve our customer needs because:

  • Data is getting bigger and wider
  • Performance and speed expectations are rising
  • Customers are looking for optimized processing with proper utilization of their Hadoop resources
  • Customers want to leverage their existing workforce to perform predictive analysis of their big data assets as SAP Predictive Analytics has a business friendly tool which does not demand data science and big data developer skills

 

Native Spark Modeling

Native Spark Modeling executes the automated predictive models directly on Hadoop using Spark engine.

Before Native Spark Modeling, your predictive modeling engine was essentially the SAP Predictive Analytics desktop or SAP Predictive Analytics server. Now with Native Spark Modeling - data intensive tasks are delegated to Spark and thus the data transfer is avoided between the SAP Predictive Analytics client and the data source platform (in this case the Hadoop Platform).

The left side of diagram shows the existing process without Spark in which case huge data transfer took place and it was costly on performance for big data and then the one on the right shows same process using Native Spark Modeling to execute machine learning computation close to data:

blog_25_Arch.JPG

 

 

User Flow

  1. On-install of SAP Predictive Analytics tool, users will find “SparkConnector” folder which contains developed functionality in form of ‘jar’.  A user, typically an administrator will need to define Spark and Yarn connection properties in the configuration files for each ODBC DSN that they intend to use for Native Spark Modeling capability.  Refer to Native Spark Modeling configuration section in SAP Predictive Analytics documentation.
  2. They load up SAP Predictive Analytics
  3. They open up Modeler
  4. They have to make sure  'Native Spark Modeling' flag in Model Training Delegation option is switched ON  under "File->Preferences" menu

    blog_25_UI_Pref.JPG

  5. They choose Classification/Regression (Starting PA 2.5, only Classifications models are supported on Spark, Regression will follow soon).
  6. They can choose an existing Hive Table using 'Use Database Table' option or an Analytical Dataset which is based on Hive tables using the 'Use Data Manager' option. SAP Automated Analytics proprietary algorithms are made to scale across any amount of information – both long and wide.  The wider the datasets, stronger the predictive power! If you are wondering the formation of wider datasets then consider an e-commerce example, where weblogs are analyzed to understand trends behind purchases. As you build aggregates in Data Manager, even more columns get added for analysis. blog_25_wide.JPG
  7. They load the description of the dataset from Hive or from a file or choose “Analyze”
  8. They choose a target field from the loaded dataset to run the training against e.g. Credit_card_Exist(=Yes/No)
  9. They generate the model which would be now executed on spark engine. Notice the progress bar now which shows progress messages for spark blog_25_progressbar.JPG
  10. They notice ongoing Spark jobs from the application WebUI which can be started form the browser using http://localhost:4040/  blog_25_sparkwebUi.JPG
  11. Once the model is generated they have the same choices as if it was a traditional database. e.g.  Smart Variable Contribution report in the Automated blog_25_varsummary.jpg
  12. Finally, they can manage model lifecycle using SAP Predictive Analytics Model Manager component. For example if user wants to retrain the model at frequent intervals; they can schedule the task from Model Manager for their data in Hadoop using the model that was trained on Spark. The retraining in this case is processed on Spark as well.

Summary:

The data will continue to grow and the enterprises will continue to shift more and more of this data on Hadoop platforms; they can now begin to apply predictive solutions on top to get meaningful insights.

Companies have different options for predictive analysis such as SAP Predictive Analytics or open source machine learning libraries, SAP Predictive Analytics however makes a difference with its full stack support on Hadoop starting from data manipulation on Big Data to model training on Spark and finally to in-database apply/re-train for production-ready Big Data.

 

In conclusion, Native Spark Modeling is a key foundation in SAP’s Predictive Big Data architecture which enables performance gains of 7-10 times and more for big data. It is also prepared to scale as your data and infrastructure widens in future. With advantages of performance and scalability, Business Analysts can build more predictive models & fail early without having to worry about Big data technology.

       

This blog describes issues that could occur when installing, configuring or running Native Spark Modeling. It explains the root causes of those issues and if possible provides solutions or workarounds.

 

The official SAP Predictive Analytics documentation including the "Connecting to your Database Management System on Windows" and "Connecting to your Database Management System on Unix" Guides can be found at SAP Predictive Analytics 2.5 – SAP Help Portal Page .

 

 

 

What is Native Spark Modeling?

 

Native Spark Modeling builds the Automated predictive models by leveraging the combined data store and processing power of Apache Spark and Hadoop.

 

Native Spark Modeling is introduced from Predictive Analytics 2.5.  The concept is also sometimes called in-database processing or modeling.  Note that both Data Manager and Model Apply (scoring) already support in-database functionality. For more details on Native Spark Modeling have a look at :

Big Data : Native Spark Modeling in SAP Predictive Analytics 2.5

 

 

 

Troubleshooting

 

 

Configuration

Issue

Native Spark Modeling does not start.  For example you should see the "Negotiating resource allocation with YARN" progress message in the Desktop client when Native Spark Modeling is configured correctly.

NegotiatingResourceAllocationWithYARN.png

Solution

Check you have Native Spark Modeling checkbox enabled in the preferences (under Preferences -> Model Training Delegation).

PreferencesModelTrainingDelegation.png

Check you have at least the minimum properties in the configuration files (hadoopConfigDir and hadoopUserName entries in the SparkConnections.ini file for the Hive DSN and the Hadoop client XML files in the folder referenced by hadoopConfigDir property).



Issue

SparkConnections.ini file has limited support for full path names with spaces on Windows.

Solution

Prefer relative paths instead.

e.g. for a ODBC DSN called MY_HIVE_DSN use the following relative path instead of the full path for the hadoopConfigDir parameter

SparkConnection.MY_HIVE_DSN.hadoopConfigDir=../../../SparkConnector/hadoopConfig/MY_HIVE_DSN

 

Issue

Error message includes "Connection specific Hadoop config folder doesn't exist".

Solution

Check the SparkConnections.ini file contains a valid path to the configuration folder.

 

Issue

Error message contains "For Input String".  For example "Unexpected Java internal error...For Input String "5s"".

Solution

Check the hive-site.xml file for the DSN and remove the property that is causing the issue (search for string in the error message).

 

Issue

Error message "JNI doesn't find class".

Solution

This can be a JNI (Java Native Interface) classpath issue.  Restart the desktop client and double-check the settings in the KJWizard.ini file.

 

 

Monitoring and Logging

Issue

The logs in native_spark_log.log can be limited.

Solution

For Desktop - to increase the amount of information in the native_spark_log.log file change the KJWizard.ini configuration file to use the full path for the log4j configuration file (log4j.properties file under the SparkConnector directory).  After that the Spark logs will also be included.

# default value is relative path

vmarg.3=-Dlog4j.configuration=file:..\..\..\SparkConnector\log4j.properties

 

The logging level can also be increased by modifying the log4j.properties file (under the SparkConnector directory). For example, change the log4j.rootLogger level from INFO to DEBUG to show more logging information for Spark.

log4j.rootLogger=DEBUG,file

 

Also refer to the logs on the Hadoop cluster for additional logging and troubleshooting information.

For example use the YARN Resource Manager web UI to monitor the Spark and Hive logs to help troubleshoot Hadoop specific issues.  The Resource Manager web UI URL is normally

http://{resourcemanager-hostname}:8088/cluster/apps

 

Support for Multiple Spark versions

Issue

There is a restriction that one spark version (jar file) can be used at one time with Native Spark Modeling.

HortonWorks HDP and Cloudera CDH are running Spark 1.4.1 and Spark 1.5.0 respectively.

Solution

It is possible to switch the configuration to one or the other spark version as appropriate before modeling.

See the “Connecting to your Database Management System” guide in the official documentation (SAP Predictive Analytics 2.5 – SAP Help Portal Page) for more information on switching between cluster types.

Please restart the Server or Desktop after making this change.


Training Data Content Advice

Issue

There is a limitation that the training data set content cannot contain commas in the data values. For example a field containing a value "Dublin, Ireland".

Solution

Pre-process the data to cleanse commas from the data or disable Native Spark Modeling for such data sets.

Also be careful when creating a table in Hive that the data does not contain a header row with the column names.  The Hive “create table” statement will include the header information as a data row.

 

KxIndex Inclusion

Issue

Crash occurs when including KxIndex as an input variable.  By default the KxIndex variable is added by Automated Analytics to the training data set description but it is normally an excluded variable.  There is a limitation that the KxIndex column cannot be included in the included variable list with Native Spark Modeling.

Solution

Exclude the KxIndex variable (this is the default behaviour).

 

HadoopConfigDir Subfolder Creation

Issue

The configuration property HadoopConfigDir in Spark.cfg by default uses the temporary directory of the operating system.

This property is used to specify where to copy the Hadoop client configuration XML files (hive-site.xml, yarn-site.xml and core-site.xml).

If this is changed to use a subdirectory (e.g. \tmp\PA_HADOOP_FILES) it is possible to get a race condition that causes the files to be copied before the subdirectory is created.

Solution

Manually create the subdirectory.

 

Memory Configuration Tuning (Desktop only)

Issue

The Automated Desktop user interface shares the same Java (JVM) process memory with the Spark connection component (Spark Driver).

It is possible to misconfigure one or the other but no specific warnings will issued in this case.

Solution

Modify the configuration parameters to get the correct memory balance for the Desktop user interface and the Spark Driver.

The KJWizard.ini configuration file contains the total memory available to the Automated Desktop user interface and SparkDriver.

The Spark.cfg configuration file contains the optional property DriverMemory.  This should be configured to be approximately 25% less than the DriverMemory property.

The SparkConnections.ini configuration file can be further configured to tune the Spark memory.

Please restart the Desktop client after making configuration changes.

e.g. example Automated Desktop memory and Spark configuration settings

In Spark.cfg

Spark.DriverMemory=6144

In KJWizard.ini

vmarg.1=-Xmx8096m

In SparkConnections.ini

SparkConnection.MY_HIVE_DSN.native."spark.driver.maxResultSize"="4g"


Spark/YARN Connectivity

Issue

Virtual Private Network (VPN) connection issue (mainly Desktop).

Native Spark Modeling uses YARN for the connection to the Hadoop cluster.  There is a limitation that the connectivity does not work over VPN.

Solution

Revert to non-VPN connection or connect to a terminal/Virtual Machine that can connect to the cluster without the VPN.


Issue

Single SparkContext issue (Desktop only).

A SparkContext is the main entry point for Spark functionality.  There is a known limitation in Spark that there can be only one SparkContext per JVM.

For more information see https://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/SparkContext.html

This issue may appear when a connection to Spark cannot be created correctly (e.g. due to a configuration issue) and subsequently the SparkContext cannot be restarted.  This is an issue that only affects the Desktop installation.

Solution

Restart the Desktop client.

 

Issue

Get error message

Unexpected Spark internal error.  Error detail: Cannot call methods on a stopped SparkContext

Solution

Troubleshoot by looking in the diagnostic messages or logs on the cluster (for example using the web UI).

One possible cause is committing too many CPU resources in the SparkConnections.ini configuration file.

 

Example of Hadoop web UI error diagnostics showing over commit of resources :

OvercommitResources.png

SparkConnections.ini file content with too many cores specified:

OvercommitResourcesSparkConnections.png


 

Hive

Issue

Hive on Tez execution memory issue

Scope HortonWorks clusters only (with Hive on Tez) and Data Manager functionality

HortonWorks HDP uses Hive on Tez to greatly improve SQL execution performance.  The SQL generated by the Data Manager functionality for Analytical Data Sets (ADS) can be complicated.  There is a possibility the Tez engine will run out of memory with default settings.

Solution

Increase the memory available to Tez through the Ambari web administrator console.

Go to the tez-configs under Hive and change setting tez.counters.max to 16000.  It is also recommended to increase the tez.task.resource.memory.mb setting.  It is necessary to restart the Hive and Tez services after this change. If this still does not work it is possible to switch the execution engine to Map Reduce again through Ambari.

 

 

Issue

It is possible to set the database name in the ODBC Hive driver connection configuration.  For example, instead of using the "default" database, it is possible to configure a different database in the ODBC Administrator dialog on Windows or the ODBC connection file for the UNIX operating system.

Native Spark Modeling requires the default database for the Hive connection.

Solution

Keep the database setting to default for the Hive DSN connection.  It is still possible to use a Hive table/view in a different database to default.

HiveDriverDatabaseSetting.png

 

Data Manager

Issue

Time-stamped population with user-defined target field is not contained in a Temporal ADS (Analytical Data Set). i.e. when you train your model using Data Manager with "Time-stamped Population" having a target variable, your target variable may not be visible in the list of variables in the modeler.

Solution

If you want to include the target field you can either have it as part of the original data set or define a variable (with relevant equation) in the "Analytical Record" instead.

 

Metadata Repository

Issue

The metadata repository cannot be in Hive.  Also output results cannot be written directly into Hive from In-database Apply (model Scoring) or Model Manager.

Solution

Write the results to local filesystem instead.

Let’s review Hadoop and SAP Predictive Analytics features in details and how these solutions can be utilized in the scenario below.

http://scn.sap.com/community/predictive-analytics/blog/2016/03/14/sap-predictive-analytics-on-big-data-for-beginners

In this scenario customer’s online interaction with the various products needs to be captured from several weblogs to build predictive models for product recommendation in the future to the right group of customers. To harness the power of Big Data, one is required to have an infrastructure that can manage and process huge volumes of structured and unstructured data in real-time, and can protect data privacy and security.

In the above scenario the weblog dataset can be stored in a Hadoop environment. So, what is Hadoop? Let’s have a look at this technology first.

 

Data storage: Hadoop

Hadoop is an open-source, Java-based programming framework that supports the processing of large data sets in a distributed computing environment (clusters). It is part of the Apache project sponsored by the Apache Software Foundation. Hadoop is becoming widely popular among the big data user community mainly because of:

 

High Reliability:

Hadoop makes it possible to run applications on systems with thousands of nodes analyzing terabytes of data. It’s distributed file system facilitates rapid data transfer among nodes and allows the system to continue operating uninterrupted in the case of a node failure. This approach lowers the risk of catastrophic system failure, even if a significant number of nodes become inoperative.

 

Massive parallel processing:

Hadoop has been a game-changer in supporting the enormous processing needs of big data. A large data procedure which may take hours of processing time on a centralized relational database system, may only take few minutes when distributed across a large Hadoop cluster of commodity servers, as processing runs in parallel.


Distributive Design:

Hadoop is not a single product, it is an ecosystem. Same goes for Spark. Let’s cover them one by one.

 

Source: -Wikipedia

 

hadoop.png

 

Hadoop Distributed File system

Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.

 

YARN

Part of the core Hadoop project, YARN is a next-generation framework for Hadoop data processing extending MapReduce capabilities by supporting non-MapReduce workloads associated with other programming models.

 

HCatalog

A table and metadata management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop.

 

hcatalog.png

MapReduce Framework (MR)

MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner.

 

Hive

Built on the MapReduce framework, Hive is a data warehouse that enables easy data summarization and ad-hoc queries via an SQL-like interface for large datasets stored in HDFS.

 

Source: Hortonworks


Key Advantages of implementing Big Data projects using SAP Predictive Analytics

Easy to use Automated Analytics


click.png

  • Business users can generate predictive models in just a few clicks using Automated Analytics.
  • Predictive Models on top of Big Data can be managed through Model Manager
  • Deployment of the Predictive Models in Big Data systems can be automated using both Automated Analytics (Code generation or SQL generation)
  • You can monitor model performance and robustness automatically through the auto generated graphical summary reports in Automated Analytics.

 

Connect SAP Predictive Analytics to Big Data


Using Hive ODBC driver you can connect SAP Predictive Analytics to Hadoop.

 

Hive ODBC connector:

Apache Hive is the most popular SQL-on-Hadoop solution, including a command-line client. Hive server compiles SQL into scalable MapReduce jobs. Data scientists or expert business users having  good SQL knowledge  can build analytical data set in Hive metastore (service which stores the metadata for Hive tables and partitions in a relational database) by joining multiple Hive tables. Users can then connect to Hadoop system through Automated Analytics Hive Connector for building predictive models using the analytical datasets (already persisted in Hadoop).

 

DataDirect ODBC driver for Hive is prepackaged with the SAP Predictive Analytics installation and lets you connect to a Hive server without the need to install any other software. Hive technology allows you to use SQL statements on top of a Hadoop system.


  • From the user desktop you can open the ODBC console to create a new Hive ODBC connection.     odbc.png
  • You will need to input the host name of the Hive Server, default port number is 10000 and the Wire protocol version (2- Hive Server 2).

        conect2.png

  •    In the general tab, check the option “Use Native Catalog Functions”. Then click “Test Connection” to make sure the connection is working.

        con3.png

 

For detailed information on how to set up your Hive connectivity, see the Hive section in the Connecting to your Database Management System on Windows or Connecting to your Database Management System on UNIX guide on the SAP Help Portal.

 

Predictive Modelling Process on HIVE

 

Create very wide datasets using Data Manager


The options below enable users to prepare datasets in Hadoop for predictive modelling.

 

  • The Data Manager allows you to prepare your data so it can be processed in Automated Analytics. It offers a variety of data preparation functionalities, including the creation of analytical records and time-stamped populations.

PD.png

An analytical dataset can be prepared by joining multiple Hadoop tables via Data Manager in Automated Analytics. Using Hive ODBC connector SAP Predictive Analytics can connect to Hive data store and you can build predictive models on Hive table dataset.

 

 

·         To create an analytical dataset, open SAP Predictive Analytics application and click on the Data Manager section. Then select the Hive connection to connect to the Hadoop server.

     con4.png

     ·           Select the Hive tables and join them with the keys to create the analytical dataset.

 

     cc1.png

  • An aggregate has been added in Data Manager for example to know number of clicks for a node/page .

       bc.jpg

 

Improve operational performance of predictive modelling


  • Automated Analytics provides you with flexibility to scale up the performance of the model linearly with respect to the number of variables through a user friendly self-guided screen.
  • In Automated Analytics Modeler - a predictive model can be trained and in-database scores can be generated directly on the Hadoop server on Hive tables (Use Direct Apply in the Database option should be checked).

     apply.png

Hadoop support in SAP Predictive Analytics


1. Supports full end to end operational predictive life-cycle on Hadoop .

2. (Data preparation through Data Manager->model development through Modeler ->deployment through generating SPARK/Hive SQL, Java code).

3. Automated Analytics support – through HIVE and SPARK SQL ODBC drivers. No coding required.

4. Expert Analytics support through Hive connectivity.

5. Technical details:

Support of Hive

 

Supports data preparation & scoring directly in Hadoop (no data transfer).

 

 

With new release of SAP predictive analytics, model training can be performed on spark. .

 

For more details on this topic please stay tuned for the upcoming blog by my colleague Priti Mulchandani.

This blog series will provide you with the overview of various Big Data technologies around Hadoop which are supported in SAP Predictive Analytics. It will also cover and how SAP Predictive Analytics can be used to apply predictive techniques to Big Data on Hadoop.

 

Introduction

Big Data is more than just a buzzword nowadays; it's changing every way customers are running their business. To get previously unknown information and to bring actionable insights into the business, all information generated by the business needs to be stored. Hadoop, being the scalable platform for storage compared to other databases, is becoming very popular among the customers. The objective of this blog is to get you acquainted with Big Data technologies and briefly describe how you can use Big Data technologies with SAP Predictive Analytics.

 

What is Big Data?

Gartner analyst Doug Laney came up with famous three Vs back in 2001. 3V patterns are the commonly observed in Big Data.

 

Source: Forbes

3v.png

Big Data means a large dataset which cannot be processed using traditional computing techniques. Big Data is not merely a type of data, rather it has become a complete subject which involves various tools, techniques, and frameworks.

 

  • Volume refers to the vast amount of data generated every second. Think of all the emails, Twitter messages, photos, video clips and sensor data that we produce and share every second. We are not talking terabytes, but Yottabytes or Brontobytes of data.

          bronto.png

         

  • Velocity refers to the speed at which new data is generated and the speed at which data moves around. For example think of social media messages going viral in minutes, or the speed at which credit card transactions are checked for fraudulent activities.

 

  • Variety refers to the different types of data we can use. Data is generated from „Internet of Things“ is growing exponentially (for example, sensors on the planes getting tons of data).

 

Source: Blog.sap.com ,HP

 

Big Data involves the data produced by different devices and applications. Below are the types of data that fall under the umbrella of Big Data.

 

  • Structured data: Relational data, etc.
  • Semi Structured data: XML data, etc.
  • Unstructured data: Word, PDF, Text, Media Logs, weblogs, etc.

 

SAP Predictive Analytics on Big Data

Example Scenario: Increasing online sales by analyzing weblogs

 

In this section, let us review a typical predictive scenario of an online retail store and understand how SAP Predictive Analytics solution along withBig Data technology (Hadoop) can work together. Nowadays, online retailers collect massive amounts of data by having access to clickstream data,user profiles, advertising data, and social network data – just to name a few. This huge amount of data can be stored in a Hadoop cluster. The Hadoop system can be scaled up very easily to store and manage this continuously growing data getting generated from all different kind of sources. Hadoop contains  in memory application like Spark and pre packaged machine learning libraries like Mlib to be utilized in building predictive models efficiently.

A typical example of SAP Predictive Analytics project on a Big Data scenario would be connect to the Hadoop system, prepare meaningful dataset in-database and train the predictive model, build and deploy the predictive model in the Hadoop system.


SAP Predictive Analytics enables the analyst to create predictive models that can identify the key influencers of customers going through with an online purchase.

 

For example, we may find out that customers under 25 are more likely to purchase products after 1 am on weekends, when certain types of advertising are shown and when they are redirected from YouTube.

 

Using the clustering module in SAP Predictive Analytics, a marketing manager can identify customer groups that have similar characteristics.These clusters can then be used during targeted marketing campaigns in the future.

 

weblog.png

 

In the next blog "Working with SAP Predictive Analytics and Big Data to Increase Online Sales by Analyzing Weblogs" I have discussed how SAP Predictive Analytics can be used to build predictive models for this Big Data scenario.


Hi,

 

Several users of SAP Predictive Analytics 2.4 encountered issues when installing R for the first time.

 

There has been some changes in the R packages delivered in the version we provide a download link from the Expert Analytics module.

 

For example the pbkrtest missing package with the following error message when you try to use the R-CNR-Tree algorithm:

 

     Error from R: Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :

       there is no package called 'pbkrtest'

 

Here the link describing the package https://cran.r-project.org/web/packages/pbkrtest/index.html

 

After digging a little, it seems that other packages have changed in the 3.2.1 version like Carret.

 

So let's fix this one!

 

First let's check why we have some troubles:

  • Open R Studio (by running th shortcut on the desktop or or the start menu or "C:\Users\Public\R-3.2.1\bin\x64\Rgui.exe")
  • Type in

require ("car")

  • You will get the following message

Loading required package: car

Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :

  there is no package called ‘pbkrtest’

In addition: Warning message:package ‘car’ was built under R version 3.2.3

  • Type in

install.packages("pbkrtest")

  • You will get a prompt to select a CRAN mirror

--- Please select a CRAN mirror for use in this session ---

  • But this is will result in the following message:

Warning message:

package ‘pbkrtest’ is not available (for R version 3.2.1)

 

 

So we need to manually install the package:

"C:/Users/Public/R-3.2.1/bin/x64/R" CMD INSTALL -l "C:\Users\Public\R-3.2.1\library" "C:\temp\pbkrtest_0.4-4.tar.gz"

This is the logs you should get:

 

* installing *source* package 'pbkrtest' ...

** package 'pbkrtest' successfully unpacked and MD5 sums checked

** R

** data

** inst

** preparing package for lazy loading

Warning: package 'lme4' was built under R version 3.2.3

** help

*** installing help indices

** building package indices

** installing vignettes

** testing if installed package can be loaded

Warning: package 'lme4' was built under R version 3.2.3

* DONE (pbkrtest)

 

 

Now if you go back to R Studio and type in:

require ("car")

It should all be ok and you can start using SAP Predictive Analytics.

 

However, if you get the following message while installing pbkrtest:

 

Warning: package 'lme4' was built under R version 3.2.3

Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :  namespace 'nlme' 3.1-120 is being loaded, but >= 3.1.123 is required

Error : package 'lme4' could not be loaded

ERROR: lazy loading failed for package 'pbkrtest'

* removing 'C:/Users/Public/R-3.2.1/library/pbkrtest'

 

Then run the following command from the R studio:

     install.packages("lme4")

 

This should give you the following output:

     trying URL 'http://cran.irsn.fr/bin/windows/contrib/3.2/lme4_1.1-11.zip'

     Content type 'application/zip' length 4746682 bytes (4.5 MB)

     downloaded 4.5 MB

 

     package ‘lme4’ successfully unpacked and MD5 sums checked

 

     The downloaded binary packages are in

             C:\Users\i304811\AppData\Local\Temp\Rtmp2xL4SS\downloaded_packages

 

Then run the following from a DOS prompt:

    "C:/Users/Public/R-3.2.1/bin/x64/R" CMD INSTALL -l "C:\Users\Public\R-3.2.1\library" "C:\Users\i304811\AppData\Local\Temp\Rtmp2xL4SS\downloaded_packages\lme4_1.1-11.zip"

 

And run again run the following command line from a DOS prompt:

"C:/Users/Public/R-3.2.1/bin/x64/R" CMD INSTALL -l "C:\Users\Public\R-3.2.1\library" "C:\temp\pbkrtest_0.4-4.tar.gz"

 

Hope this helps!

 

@bdel

Hello dear Predictive community!

 

There is an active Idea Place for SAP Predictive Analytics where you can submit your product improvement ideas and get people to vote them.

 

This place is monitored on a regular basis by our Product Management team and the most relevant ideas get a chance to be integrated in our future releases.

 

The "top" ideas (looking at the number of votes) as of now are the following.

 

 

You certainly have your own ideas, so please do not wait and feel free to create them!

 

Looking forward to your feedback and great product ideas!

 

Antoine Chabert

Hi,

 

As part of my day to day activities, I'm checking the Idea Place for Predictive and try to provide feedback on some of the items.

 

One of them was about the need to handle Multi-class Classification in Automated Analytics:

    https://ideas.sap.com/D32143

 

Context

 

First let's define what a Multi-class Classification model is:

“In machine learning, multi-class or multinomial classification is the problem of classifying instances into one of the more than two classes (classifying instances into one of the two classes is called binary classification). While some classification algorithms naturally permit the use of more than two classes, others are by nature binary algorithms; these can, however, be turned into multinomial classifiers by a variety of strategies.

 

There is 2 ways to address a Multi-class Classification problem:

  • One-vs.-rest (OvR)

The one-vs.-rest (or one-vs.-all) strategy involves training a single classifier per class, with the samples of that class as positive samples and all other samples as negatives. This strategy requires the base classifiers to produce a real-valued confidence score for its decision, rather than just a class label; discrete class labels alone can lead to ambiguities, where multiple classes are predicted for a single sample.

  • One-vs.-one (OvO)

In the one-vs.-one, one trains K (K − 1) / 2 binary classifiers for a K-way multiclass problem; each receives the samples of a pair of classes from the original training set, and must learn to distinguish these two classes. At prediction time, a voting scheme is applied: all K (K − 1) / 2 classifiers are applied to an unseen sample and the class that got the highest number of "+1" predictions gets predicted by the combined classifier.

 

 

Source: https://en.wikipedia.org/wiki/Multiclass_classification

 

In the Automated Analytic mode of SAP Predictive Analytics, it provides a way to build binary classification only out of the box. The Expert Analytics mode may provide a way to handle that using one of the out of the box algorithms and for sure via an open source R script. But this blog post will only focus on the Automated Analytic mode and we won't discuss the pros and cons of OvR or OvO.

 

Approach

There are multiple ways to handle an “n-way” multi-class model problem:

  • The “multi-target model” approach
    • Prepare a data set with n target variables for OvR or n * (n − 1) / 2 target variables for OvO
    • Build one model with all the targets

         

         The final model will probably be the worse as it will have to fit all the targets, and will not be optimal (encoding, binning, variable reduction etc.).

 

  • The “build one, then replicate” approach
    • Prepare a data set with n target variables for OvR or n * (n − 1) / 2 target variables for OvO
    • Build one model with one of the targets
    • Then replicate using a KxShell script to run the n-1 other models for OvR or (n * (n − 1) / 2) - 1 other models for OvO

 

        The final models will be optimal for each target (encoding, binning, variable reduction etc.), but there will be many models to be built

 

We will focus on the “build one, then replicate” approach as this will provide more "optimal" models and since SAP Predictive Analytics provides all the tools to "productize" models in mass, it won't be an issue.


Now, the trick or hard part is on the way to prepare the data set.


Prepare the Data Set

I’m a lazy guy so I don’t want to build a fix data set with n target variables for OvR or n * (n − 1) / 2 target variables for OvO, because if a new class appear, I will have modify my data set to add the new class, and that's the last thing I want to do!!


This is why I love Data Manager so much!

 

I will assume that everyone knows the different elements in play to build your Analytical Data Set in Data Manager.

Anyway, if you don't here is a short summary of the objects that needs to be created:

  • Entity: the subject of the analysis, your customer id or event id
  • Time Stamp Population: the list of entities to be used for training or scoring your model at a reference date (snapshot). It also includes your target if the population is to be used for training purpose
  • Analytical Record: the list of attributes to be associated with the entity at that reference date (time stamp)

 

So how to handle "Multi-class Classification in Automated Analytics" with Data Manager?

You will only need one Time Stamp Population! And you will be able to handle both OvR and OvO!

 

Let’s take an example for this. Our Multi-class Classification will have 26 class from “A” to “Z” but could be from “1” to “26”.


Time Stamp Population for OvR:

  • I will assume that you already have your “class” variable/attribute with a value between “A” to “Z” available in your Timestamp Population (via a merge, a condition etc.)
  • You will need a prompt that will define the “one” class you want to use versus the rest. Let’s say it’s a String and the default value will be “A”

2016-03-09_12-03-53.jpg

  • Once defined, we will use the prompt in a condition/expression to generate the target

2016-03-09_12-09-38.jpg

  • And save it as "KxTarget" (this naming convention ensure surfacing the target variable)

2016-03-09_12-10-57.jpg

  • Now you have your target variable defined

2016-03-09_12-11-50.jpg

  • Click “Next”, and switch to the “Target” tab where you can assign your target

2016-03-09_12-12-47.jpg

  • If you click on “View Data”, you will get a prompt asking you for the “One” class you want to use

2016-03-09_12-14-41.jpg

 

Time Stamp Population for OvO:

  • I will assume that you already have your “class” variable/attribute with a value between “A” to “Z” available in your Timestamp Population (via a merge, a condition etc.)
  • You will need a prompt that will define the “one” class you want to use versus the rest. Let’s say it’s a String and the default value will be “A”

2016-03-09_12-20-50.jpg

  • You will need a prompt that will define the “other one” class you want to use versus the rest. Let’s say it’s a String and the default value will be “B”

2016-03-09_12-18-21.jpg

  • Once defined, we will use the prompt in a condition/expression to generate the target like in OvR, so that KxTarget = 1 means it’s equal to “TheOne” and KxTarget = 0 means it’s equal to “TheOtherOne”
  • Then you will need to define a filter to exclude everything but the class equal to the “TheOne” or “TheOtherOne”

2016-03-09_12-22-56.jpg

  • Now you have your target variable defined. Click “Next”, and switch to the “Target” tab where you can assign your target as for OvR
  • If you click on “View Data”, you will get a prompt asking you for the “One” class and the "Other One" class you want to use

2016-03-09_13-23-31.jpg

 

So we are done with the data set generation. Let's build the models!

 

Build the models

 

So when you use Data Manager while building your classification, you will get the prompt popup that will ask you to enter the values to be used to extract the data set. Here is an example with OvR and an additional prompt:

2016-03-09_13-30-05.jpg

You can click "Next", "OK", "Analyze", "Next", "Next" to reach the last step before creating the mode itself for class "A".

2016-03-09_13-32-31.jpg

 

Using KxShell Scripts

Click on "Export KxShell Script..." and save the "Learn" script on your Desktop for example.

2016-03-09_13-34-48.jpg

if you open it in a text editor you will find that the prompt values are stored in KxShell "macros" (like programming variables)

2016-03-09_13-36-10.jpg

So, if I want train that model using the generated script I will have to execute the following command in a DOS prompt:

     "C:\Program Files\SAP Predictive Analytics\Desktop\Automated\EXE\Clients\CPP\KxShell.exe" "C:\Users\i304811\Desktop\learn.kxs"

 

Now if I want to run it for class "B", I will run:

     "C:\Program Files\SAP Predictive Analytics\Desktop\Automated\EXE\Clients\CPP\KxShell.exe" "C:\Users\i304811\Desktop\learn.kxs" -DTRAINING_STORE_PROMPT_1=B

 

and you can alter any of the macros from the script in the command line.

 

For OvO approach the same logic applies, except that you will need to build n * (n − 1) / 2 models which may require a little script to do the iteration properly.


Hope this was helpful and off course feel free to comment.

 

PS: I tried to keep the flow simple so I may have took some shortcut or be brief in the explanation to keep the entry short.

The next major release of SAP Predictive Analytics (3.x product line) is currently planned to be released in the second half of 2016.

 

SAP plans to remove Solaris operating system support in this major release, in order to speed up product innovation deliveries on other operating systems, including Windows 64-bits and Linux.

 

The last SAP Predictive Analytics release that will provide Solaris operating system support is a minor release (version 2.5, part of the 2.x product line) that should be released during the first quarter of the year 2016. 

 

For customers who wish to continue using SAP Predictive Analytics (SAP InfiniteInsight) on Solaris operating system, the end of mainstream maintenance dates are the following :

  • For SAP InfiniteInsight 7.x for Solaris : December 31st, 2018
  • SAP Predictive Analytics 2.x for Solaris : February 10th, 2017

 

To take advantage of the new features of SAP Predictive Analytics 3.x, we recommend you to contact your customer support and adopt one of the supported operating systems (same as the current PAM with the exception of Solaris).

 

For more information about this communication, feel free to contact me directly via email (see my SCN profile).

You are an SAP partner and want to shine within the SAP ecosystem?

You have built some expertise on predictive analysis and want all SAP Predictive Analytics’ customers know about it?

Your customers expect you to deliver cloud solutions embedding high end predictive technology?

 

As a member of the Analytics Product Management family, I’m happy to present two programs which have been built for you.

 

HANA Cloud Platform predictive services

Just a week ago, we have released our HANA Cloud Platform predictive services.

I warmly encourage you to read Ashish’s blog to know more about this offer, but in a nutshell, you can now leverage the power of Hana Cloud Platform and the richness of SAP Predictive Analytics technology to build ad-hoc cloud applications for your customers. The easy to use, RESTful Web Services, help you focus on the business value as you don’t require data scientist nor deep development technical skills to leverage them.

As we are convinced that this mix offers huge potential, we’re actively looking for early adopters with whom we want to collaborate and help build solutions which addresses real life use cases.

 

 

SAP Analytics Extensions Directory

The other opportunity I want to shed some light on is the SAP Analytics Extensions Directory.

As you can see at the first glance, this website exposes partner extensions in order for SAP customers to build their solutions with a mix of products from SAP and extensions (add-ons, tools, content) from partners.

The beauty of it is that we help customers to identify partners which can help them solve their business needs. Once the connection is established, we, SAP, are not anymore in the middle. In other words, the relationship with customers is up to you, partners, as well as the business model you want to use (paid extensions / free extensions + paid service).

In the context of SAP Predictive Analytics, this is a unique opportunity to showcase your know-how by publishing ad hoc R algorithms you have developed.

If we add that coming in SAP Predictive Analytics 2.5 (planned to be generally available in the near future) there will be an icon linking to the directory plus a way to encrypt your R code (and protect your IP), I guess the benefit for partners to participate to the program is obvious.

Not mentioning the fact that participation to this program is totally free of charge.

 

Still need yet another argument to jump in?

The directory recently opened and the traffic is already pretty high.

Don’t miss out this opportunity to showcase your expertise in predictive analysis, increase your visibility and eventually your business!

Actions

Filter Blog

By author:
By date:
By tag: