Box Plots in R+HANA

Former Member · ‎09-14-2016

I decided to write this blog after facing an issue that it seems it does not have a simple solution in the standard SAP BI frontend tools, so here is the challenge;

Challenge: To be able to make a boxplot chart to obtain the data distribution of sensors with millions of data points per year, you want to chart for each sensor one box per month (Data on HANA).

So lets review the standard SAP frontend suit as September 2016,

- Lumira, comes with a very neat boxplot solution which im using extensively, works like a charm...for less than 10,000 data points

- Design Studio, does not comes with boxplot solution out of the box, there are some SDK code that can be use to make one.

There is a couple of forum threads that highlight the issue quite clearly;

- Boxplots in Lumira How to define measure in HANA Studio for Lumira Box Plot

- Boxplots in Design Studio Lessons learned in creating SDK extensions. (yep he coded the whole thing)

Solution: As the chart was for an offline report to a client i decided it was a good chance to put those R skills to work, R have one of the most advance chart capabilities around.

Here is the step by step, assuming you don't even have R installed;

(*Note) For this exercise you just need the HANA server and your laptop/desktop, you will use R as a desktop app to analyze and chart the data.

Lets install R -> How to Install R - dummies (sorry for the "dummies" it picked up from the web page title)
Now lets install Rstudio (a nicer frontend for R) -> How to Install and Configure RStudio - dummies (same here)
Now we need two libraries, one for the OBDC connector to HANA and the nicer plot library (in R there are many way to do the same thing, in this case there is a base plot library, we will go for the nicer one as it the one it will be expected in a work environment)

First open Rstudio and go to Tools -> Install Packages and then install RODBC and plotly

4. Now we create the ODBC connection, please follow the steps in point 1) on this guide (also is a very good read on more details on how HANA+R connects) -> SAP HANA: My experiences on using SAP HANA with R

5. Below if the full code (start a new R script as in the image below), then i explain steps by step what each lines is used for.



library(RODBC)

library(plotly)

ch<-odbcConnect("hana_odbc_connection",uid="hana_user",pwd="*****")

sensors<-sqlQuery(ch,"SELECT * FROM \"SCHEMA_NAME\".\"T_MD_METERS\"")

for (meter in sensors$meter){

  q1<-paste0('SELECT * FROM "SCHEMA_NAME"."TABLE/VIEW_SENSOR_DATA" WHERE "Record_Flag" = \'OK\' AND "Sensor" = \'',meter,'\'', sep="")

  res<-sqlQuery(ch,q1)

  names(res)[2]<-'CWF'

  res$year_month<-substr(res$Time,0,7)

  bp<-ggplot(res, aes(x=year_month, y=CWF, fill=year_month)) + geom_boxplot()

  fn<-paste0(meter,'.jpg')

  ggsave(filename=fn, plot=bp)

}

(Note*) Full operational code in R version 3.3.1, volume of data in HANA aprox. 100 million Records for a full year of data for 40+ sensors

(Note*) You will see "<-" and "$" a lot, so it may deserve an explanation by itself, for "<-" in R you dont actually write code to define variables name or type, the action of "definition"&"assign data" happens all at the same type, so for example if i write "a<-1" immediately R interprets as Variable "a" type Vector (simple list of values) type Numeric have in its first position the value "1" (cool right?), this type of assignment works from simple values to chart, img, files, connections to DB, etc. For "$" is simply how you refer an specific column in a matrix using the column name, so for a matrix/data frame for an entity cars you will have the columns: maker, model, type, so if you just want to work or reference only the maker you will go for cars$maker

a. Call to libraries, this will load to your session the needed set of functions for your code to work, in this case we will be using functions to connect to a ODBC database (HANA) so we call RODBC and the prettier version of a boxplot chart function from "plotly" library



library(RODBC)

library(plotly)

b. Now we use the RODBC function to create a connection from R to HANA, for that we use odbcConnect with the key parameters as the ODBC connection name (as defined on step 4) HANA user name and password, finally we pass this function call to "ch"



ch<-odbcConnect("hana_odbc_connection",uid="hana_user",pwd="*****")

c. Here is when my first actual access to HANA occurs, we are bringing the list of sensors in the database from the table that contains the list of them all, for this we use sqlQuery function from RODBC package as well, which requires a connection (defined in "ch") and a query in HANA SQL valid language, the parsing of the querys can be a bit of a pain, but there is a couple of examples to guide in the code. If all successful the variable "sensors" will receive a table of X number records and X number of columns in this case (all sensors and its attributes), by default this transform "sensors" into a data frame (kind of internal memory table).



sensors<-sqlQuery(ch,"SELECT * FROM \"SCHEMA_NAME\".\"T_MD_METERS\"")

d. Then we start a loop operation (there is MANY other ways of doing this loop bit), i used "for" just as familiarity and the limited number of loops. this will execute for each meter as there are quite a few is not realistic to do it manually hence we automate this part, for just a few records this loop can be omitted



for (meter in sensors$meter){

e. now we elaborate the query to the HANA table using a variable component (the sensor name contained for each loop in "meter") for it we use "paste0" which is a concatenate function (part of the base R package so does not require extra libraries to be called) then we pass the query as a string vector to "q1"



  q1<-paste0('SELECT * FROM "SCHEMA_NAME"."TABLE/VIEW_SENSOR_DATA" WHERE "Record_Flag" = \'OK\' AND "Sensor" = \'',meter,'\'', sep="")

f. We execute the query and pass the results to "res", the data set will contain at least a time dimension, the sensor ID and the metric we want to plot



res<-sqlQuery(ch,q1)

g. We apply a couple of data manipulation functions (optional and very specific to my scenario, but worth mentioning) the first use the R base function "names" to change the name of a column, why do i need to do this? because the name of the column in HANA contain a SPACE and it is a bit hard to reference it later, then i use the function "substr" that is exactly what it looks like it is meant for, from a date "2016-09-01 12:01:01 PM" it will retrieve just "2016-09" and allow the plot function to aggregate per year-month, note that the column "year_month" does not exist in HANA and it is created on the fly in this code line.



  names(res)[2]<-'CWF'

  res$year_month<-substr(res$Time,0,7)

h. Finally the plot itself, to fully explain "ggplot" usage i will need a blog for itself, but in short is quite a powerful chart function that can plot most of all the basic and advance plots types with an extensive customizing capabilities even capable of overlying plot layers (a boxplot + a line chart) see more here (Quick-R: ggplot2 Graphs) in short we are calling ggplot from the plotly library with the parameters x axis = year_month, y axis = sensor flow metric (CWF) and plot type = geom_boxplot then the outcome we pass it to "bp", if you want to see the boxplot on your screen you just execute the line without the variable assignment ("bp<-") on the console



  bp<-ggplot(res, aes(x=year_month, y=CWF, fill=year_month)) + geom_boxplot()

i. finally we wrap it up creating a name for the file (one plot per sensor) and saving it to the R root folder, for the file name we simply paste the sensor name in "meter" and the extension ".jpg", a path could be added here for other output folders. the we use the function from plotly again ggsave, that requires a plot and a path, so we pass the plot stored in "bp" and the file name in "fn", if all successful your root folder will start getting filled with boxplots.



  fn<-paste0(meter,'.jpg')

  ggsave(filename=fn, plot=bp)

(Final Note *) Since we are on R i finalize with a real life story on how to extend its use, the plot use on point 5.h raised some question about the validity of the data plotted as ZEROs should not be use for the plots, if you check the plot by itself you will rise the same question as Boxplots use 5% 25% median 75% 95% and outliers representation (math of it here Box-and-Whisker Plots: Five-Number Summary) so to prove yourself right you can use a couple of simple base R functions, see below the use of "min", "median" and "quantiles" quite prove that no zero's were included, sure you can simply search for zero's on the data but this also prove the chart right

Box Plots in R+HANA

Challenge: To be able to make a boxplot chart to obtain the data distribution of sensors with millions of data points per year, you want to chart for each sensor one box per month (Data on HANA).

SAP PI for Beginners

ABAP 7.40 Quick Reference

Fiori: technical installation and configuration of one app from A - Z