Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
AndreasForster
Product and Topic Expert
Product and Topic Expert

Stratified sampling creates a subset of data with similar distribution in a select variable. This component add this functionality to SAP Predictive Analytics, Expert Mode.

Disclaimer

Please note that this component is not an official release by SAP and that it is provided as-is without any guarantee or support. Please test the component to ensure it works for your purposes.

Prerequisites

R library caTools must be installed.

Limitations

Please let me know should you encounter any limitations.

Usage

These parameters can be set by the user.

ParameterDescription
Desired Split (Percent or Count)

Specifies the size of the stratified subset. You can enter a percentage (ie 0.3) or the absolute number of records (ie 2000).

Stratification ColumnThe categorical column, whose distribution will be reproduced in the stratified subset.
Random SeedNumerical value that allows to produce random but reproducable samples.
Label 1st SubsetLabel that identifies the stratified subset in a newly added column.
Label 2nd SubsetLabel that identifies the remainder of the dataset.

Output column added by this component

ColumnDescription
SplitLabel

Identifies which subset the individual record belongs to. See above "Label 1st Subset" and "Label 2nd Subset".

How to Implement

The component can be downloaded as .spar file from GitHub. Then deploy it as described here. You just need to import it through the option "Import/Model Component", which you will find by clicking on the plus-sign at the bottom of the list of the available algorithms.

Example

You can try the Stratified Sampling on the common Census01.csv file from Automated  Mode for instance. The file is automatically installed with SAP Predictive Analytics. In Version 2.3 you will find it in "C:\Program Files\SAP Predictive Analytics\Desktop 2.3\Automated\Samples\Census". The configuration below for instance creates a sample with 30% of the records of the whole dataset. The stratification is based on the "relationship" colum, so that the sample will have a very similar distribution in this column as does the total dataset.