Cleaning data before adding to a fact table

Former Member · ‎07-16-2013

Welcome to part 4 in this series of demonstrating how to analyze Wikipedia page hit data using Hadoop with SAP HANA One and Hive. In part 3 of this series “Importing Wikipedia Hive data into SAP HANA One”, I discovered an anomaly in the March 2013 data. I will show how I used HANA Studio’s data preview feature to isolate the problem and then fix our resulting fact table. Next, I will show how to create dimension tables to add “color” to the data model so that we are not looking at coded values in our analysis of the data. The resulting data model will be the foundation for creating an Analytic View that I will cover in part 5. To put this blog into perspective, I will focus SAP HANA database and SAP HANA Studio components shown below.

Let’s get started!

Cleaning data before adding to a fact table

If you have been following along, open up SAP HANA Studio and make the connection to your HANA database. In the last blog post, I showed how to use the Data Preview feature in HANA Studio to look at the number of page hits per month by going to the object browser and navigating to the WIKIDATA schema -> Tables->STAGE-PAGEHITS and right clicking on the table and selecting the Open Data Preview command. Then, click on the Analysis tab. Drag the MONTH column into the Label axis region and then drag the PAGEHITCOUNTFORHOUR column into the Values axis to get a chart as shown below.

It looked odd that March has such a large number of page hits, so it’s time to dig in a little deeper. I am going to drag the DAY column into the Label axis region as well to look for the anomalies in March.

Clicking on the two abnormally long bars reveal that these correspond to Day=01 in March with pagecount of 14,830,164,116 and Day=05 in March with pagecount of 15,257,088,562. I am going to add the HOUR statistics as well by dragging it to the Labels axis. Looks like there are a few hours when something odd occurred.

To see the records in question, switch to the Raw Data tab. Click on the PAGEHITCOUNTFORHOUR column name twice to sort it in descending order.

You can see the four records with zh code with a large number of pagecounts. zh is the code for Chinese language and these rows have most likely resulted from a denial of service (DoS) attack on Wikipedia. This would not be the first time – see “Technical glitch causes Wikipedia outage”. In addition, the values for pages QQ and ICQ: related to an instant messaging client, so one has to wonder if the attack was made through this client. Another interesting record is the large number of hits on the de.d – 0152 page. It turns out that this page in the German Wiktionary does not exist – could be another DoS attempt. When it comes time to move these into the staging table, I’ll delete them since they are clearly a problem.

Checking for duplicate rows in the staging table

Initially, I thought that the large number of records was due to duplicate rows. This is because when I attempted to load the data into HANA in part three, the error list looked like this:

Notice that the duplicate year, month, day and hour for the first two errors. These could add up. So, I came up with the following query to look at the number of duplicate records in the month of March.

SELECT

"PROJECTCODE","PAGENAME",SUM("PAGEHITCOUNTFORHOUR"),COUNT("PAGENAME"),"YEAR", "MONTH","DAY","HOUR"

FROM

"WIKIDATA"."STAGE-PAGEHITS"

WHERE "MONTH"='03'

GROUP BY "PROJECTCODE","PAGENAME","YEAR","MONTH","DAY","HOUR"

HAVING (COUNT("PROJECTCODE")>1) and (COUNT("PAGENAME")>1) and (COUNT("YEAR")>1)

and (COUNT("MONTH")>1) and (COUNT("DAY")>1) and (COUNT("HOUR")>1)

ORDER BY 3 DESC;

By using a combination of the GROUP BY clause on the fields that I want to compare for duplicates and the HAVING clause looking for a COUNT(field) > 1 on each of the fields, we can identify duplicate records as shown below.

As you can see, there are some relatively large values that could skew the results. The problem is, I don’t know if these are by design or just a bug with the way Wikipedia posts the data.

Creating the Fact Table

Without going into a lot of detail, the fact table contains the specific data we want to track (measurements, metrics and facts of a business process) and foreign keys that form relationships to dimension tables that contain descriptive attributes. To learn more about how fact and dimension tables work together, check out the Data Warehouse topic on Wikipedia.

NOTE: For data warehouse experts, please show a little mercy when looking at resulting design. I have intentionally bypassed creating surrogate keys in favor of using natural keys to keep things simple. Given the repetitive nature of the data, I would expect even better compression of the data if I separated out the PAGENAME into a dimension table and used a surrogate key in the fact table.

In order to create the fact table from the staging table, we will need to do the following high-level tasks:

Create the fact table
Develop a SELECT statement that splits the language and Wikimedia project codes contained in the PROJECTCODE and concatenates the Year, Month and Day values to create a natural key column used to form a relationship to a date dimension table and inserts the data into the fact table. I will also need to convert the BYTESDOWNLOADEDFORHOUR column from a VARCHAR data type to a BIGINT data type to account for the original "NULL" values from the Hive import.
Remove the dirty records from the fact table. NOTE: Normally I would delete the dirty records from the staging table, but I wanted to save time in preparing this article.

Create the fact table

It turns out that SAP HANA COLUMN tables are ideal for fact tables. This is because much of the data ends is repeated and HANA compressions the repeated values with several different encoding methods. Use the following statement to create the fact table:

CREATE COLUMN TABLE "WIKIDATA"."PAGEHITSFACTTABLE"

("PAGENAME" VARCHAR(2000),

"YEAR" VARCHAR(4),

"MONTH" VARCHAR(2),

"DAY" VARCHAR(2),

"HOUR" VARCHAR(2),

"PAGEHITCOUNTFORHOUR" BIGINT,

"BYTESPERHOUR" BIGINT,

"EVENTDATE" VARCHAR(8),

"LANGCODE" VARCHAR(50),

"PROJCODE" VARCHAR(50)

);

Create computed columns

Next, I need to split the PROJECTCODE into two separate values. If the column has a '.' (period), then the value has a combination of a language value and wiki-project value, otherwise it’s the language code for the main Wikipedia project. I will use the code 'wp' for Wikipedia values in the fact table as a lookup value to the projects dimension table. Here is a test query to make sure I got the logic right.

SELECT DISTINCT "PROJECTCODE",

Case When ("PROJECTCODE" like '%.%')

Then LOCATE("PROJECTCODE",'.')-1

Else 0

End AS "POSITION",

Case When ("PROJECTCODE" like '%.%')

Then Substring("PROJECTCODE", 1, LOCATE("PROJECTCODE",'.')-1)

Else "PROJECTCODE"

End AS LANGCODE,

Case When ("PROJECTCODE" like '%.%')

Then Substring("PROJECTCODE", LOCATE("PROJECTCODE",'.')+1)

Else 'wp'

End AS PROJCODE

FROM "WIKIDATA"."STAGE-PAGEHITS"

WHERE Month = '03' AND Day = '01' AND Hour = '00' LIMIT 100;

The first Case When clause uses the like operator ("PROJECTCODE" like '%.%') to see if there is a period for the column value. If this evaluates to non-null expression, then I use the locate function LOCATE("PROJECTCODE",'.') to find the position of the period. By subtracting 1 from the value, I end up with the length of the language value. If the period was not located in the value, then I return a 0. Finally, I alias the resulting column value as POSITION. NOTE: I will not use this debug value in the fact table query.

The second Case When clause performs the same Case When ("PROJECTCODE" like '%.%') check, but this time use the substring function Substring("PROJECTCODE", 1, LOCATE("PROJECTCODE",'.')-1) to extract the language code. If there was no period, the Else clause Else "PROJECTCODE" returns the column value which is just the language code.

The third Case When clause extracts the project code located just after the period using the substring statement Then Substring("PROJECTCODE", LOCATE("PROJECTCODE",'.')+1). If the period not in the value, the Else clause returns value of 'wp' for denoting the Wikipedia project.

NOTE: I added a WHERE clause to limit the data that HANA has to process. Here is the result:

To concatenate text from YEAR, MONTH, DAY columns into a new column called EVENTDATE I will use the concatenate || operator "YEAR"||"MONTH"||"DAY" AS EVENTDATE. The EVENTDATE column will be used later on to lookup the data in the date dimension table I’ll create later on.

Load the data into the fact table

The following SELECT … INTO statement copies the data from the staging table into the fact table with the project and language values split out and with the new EVENTDATE column.

SELECT "PAGENAME","YEAR","MONTH","DAY","HOUR",

"PAGEHITCOUNTFORHOUR",

Case When ("BYTESDOWNLOADEDFORHOUR" LIKE '%N%')

Then 0

Else TO_BIGINT("BYTESDOWNLOADEDFORHOUR")

End AS BYTESPERHOUR,

"YEAR"||"MONTH"||"DAY" AS EVENTDATE,

Case When ("PROJECTCODE" LIKE '%.%')

Then Substring("PROJECTCODE", 1, LOCATE("PROJECTCODE",'.')-1)

Else "PROJECTCODE"

End AS LANGCODE,

Case When ("PROJECTCODE" LIKE '%.%')

Then Substring("PROJECTCODE", LOCATE("PROJECTCODE",'.')+1)

Else 'wp'

End AS PROJCODE

FROM "WIKIDATA"."STAGE-PAGEHITS"

INTO "WIKIDATA"."PAGEHITSFACTTABLE";

The Case When ("BYTESDOWNLOADEDFORHOUR" LIKE '%N%') clause looks for the original “NULL” values that contained a value of N instead of a number. If the LIKE condition is true, the value for BYTESPERHOUR returns as 0. Otherwise, use the TO_BIGINT() function to convert the numeric value stored in the VARCHAR column to a number.

Exclude bad records

I have noted the page hit counts from the outliers shown in the preceding section Exploring March Data. Before I ever perform a DELETE statement, I like to test the WHERE clause out. The query below should display the five suspect records that I will delete from the fact table.

SELECT * FROM "WIKIDATA"."PAGEHITSFACTTABLE"

WHERE "PAGEHITCOUNTFORHOUR" > 53000000;

Now that I am happy with the results, the following DELETE statement removes the five records.

DELETE FROM "WIKIDATA"."PAGEHITSFACTTABLE"

WHERE "PAGEHITCOUNTFORHOUR" > 53000000;

HANA Studio confirms the execution of the statement with 5 affected rows!

Now that we have our fact table, let us create the dimension tables.

Create the Wikipedia dimension tables

Time to create the two dimension tables for projects and languages. I am going to create three dimension tables: DIMPROJECT, DIMLANGUAGE, and use the HANA Studio feature to generate M_TIME_DIMENSION in the _SYS_BI schema. I will first create TWO CSV files and import them into the respective tables. I’ll then show how to generate the M_TIME_DIMENSION table.

Creating the .csv file for the project dimension table

Let’s start with the projects. The “Page view statistics for Wikimedia projects” page contains the list of project abbreviations that includes the following:

wikibooks: ".b"

wiktionary: ".d"

wikimedia: ".m"

wikipedia mobile: ".mw"

wikinews: ".n"

wikiquote: ".q"

wikisource: ".s"

wikiversity: ".v"

mediawiki: ".w"

The challenge is, when I went to browse the data with the distinct values feature of HANA Studio, I got a few more than listed.

After digging around the “WikiMedia Projects” page and looking at the page names used in the missing values, I created a .CSV file with the unique project codes as follows:

wikibooks,b

wiktionary,d

wikimedia,m

wikipedia mobile,mw

wikinews,n

wikiquote,q

wikisource,s

wikiversity,v

mediawiki,w

wikivoyage,voy

wikimedia foundation,f

wikipedia,wp

wikimedia labs,labs

org,org

us,us

wikidata,wd

You can download the CSV file at http://wikipedia-proj-lang-codes.s3.amazonaws.com/UniqueProjectCodes.csv. You should save this file into your Downloads folder.

Creating the .csv file for the language dimension table

Wikimedia maintains a list of the languages at http://meta.wikimedia.org/wiki/Template:List_of_language_names_ordered_by_code. Trying to create this list of over 300 languages can be a bit tricky, so I decided to go with Microsoft Excel to import the data. To save time, I used the new Microsoft Power Query for Excel add-in that works with either Excel 2010 or 2013. You can download the add-in from here. To save space in the blog, I put together a short video that shows the steps.

.

You can download the CSV file at http://wikipedia-proj-lang-codes.s3.amazonaws.com/UniqueLanguageCodes.csv. You should save this file into your Downloads folder.

Creating the two dimension tables in HANA

To create the two dimension tables for projects and languages, execute the following commands:

CREATE COLUMN TABLE "WIKIDATA"."DIMPROJECT"

("ProjectTitle" VARCHAR(50) NOT NULL,

"ProjectCode" VARCHAR(10) NOT NULL

);

CREATE COLUMN TABLE "WIKIDATA"."DIMLANGUAGE"

("LanguageCode" VARCHAR(25) NOT NULL,

"Language" VARCHAR(50) NOT NULL

);

You have a couple of options with smaller comma separated files. You can upload the csv files using the SFTP plug-in like we did back in “Importing Wikipedia Hive data into SAP HANA One” blog post and then issue the following two IMPORT statements.

IMPORT FROM CSV FILE '/wiki-data/UniqueProjectCodes.csv'

INTO "WIKIDATA"."DIMPROJECT"

WITH RECORD DELIMITED BY '\n'

FIELD DELIMITED BY ',';

IMPORT FROM CSV FILE '/wiki-data/UniqueLanguageCodes.csv'

INTO "WIKIDATA"."DIMLANGUAGE"

WITH RECORD DELIMITED BY '\n'

FIELD DELIMITED BY ',';

Or, you can use the File | Import… command in SAP HANA Studio guided by the blog post “Export and Import feature in HANA Studio”

Without going into detail, here is how you would import the UniqueProjectCodes.csv file using HANA Studio assuming that you downloaded it from the S3 bucket at http://wikipedia-proj-lang-codes.s3.amazonaws.com/UniqueProjectCodes.csv.

Choose the File | Import… command
Type the word data into the Select an import source search box to locate Data from Local File and click Next >.
Select the HANA database you are working with and click Next >.
Click Browse and select the UniqueProjectCodes.csv file and click OK.
Click in the Target Table | Existing option and then click the Select Table button.
Type DIMPROJECT in the Select Table dialog, select WIKIDATA.DIMPROJECT and click OK.
Click Next to Manage Table Definition and Data Mappings step.
Drag COLUMN_O in the Source File list over to the ProjectTitle row in the Target Table list.
Drag COLUMN_1 in the Source File list over to the ProjectCode row in the Target Table list.

Click Finish to perform the import operation.

Do the same steps to load the UniqueLanguageCodes.csv file into the DIMLANGUAGE table.

Create the Date dimension table for name lookups of date values

HANA Studio has a feature to generate a date dimension table. You first need to be in the Modeler Perspective. Go to the Window menu and issue the Open Perspective > Modeler. If you never have launched the Modeler perspective, choose the Other… command and then select Modeler from the list as shown below.

You should then see the Welcome to Modeler page. Then, click on the Generate Time Data… command in the Data section.

In the Generate Time Data pop-up, select Calendar type as Gregorian, enter 2007 and 2014 for year range, select Day for Granularity and Monday for First day of the week. Then click on Generate.

HANA Studio creates the M_TIME_DIMENSION table in the _SYS_BI schema. To see the structure of the table, enter in the following command into a SQL Console window.

SELECT * FROM "_SYS_BI"."M_TIME_DIMENSION" LIMIT 50;

I will use the DATE_SAP column as the JOIN column to the EVENTDATE column in the PAGEHITSFACTTABLE table when building out the Analytic View in the next blog post.

I want to enhance the M_TIME_DIMENSION table by adding the English day of the week to correspond to the DAY_OF_WEEK_INT value where 0 = Monday and so on. Also, I want to add the English month based on the MONTH_INT value where 1 = January and so on. Here is how I added the columns.

ALTER TABLE "_SYS_BI"."M_TIME_DIMENSION"

ADD (ENGLISH_DAY_OF_WEEK VARCHAR(10) NULL);

UPDATE "_SYS_BI"."M_TIME_DIMENSION"

SET ENGLISH_DAY_OF_WEEK=

CASE DAY_OF_WEEK_INT

WHEN 0 THEN 'Monday'

WHEN 1 THEN 'Tuesday'

WHEN 2 THEN 'Wednesday'

WHEN 3 THEN 'Thursday'

WHEN 4 THEN 'Friday'

WHEN 5 THEN 'Saturday'

WHEN 6 THEN 'Sunday'

END

WHERE ENGLISH_DAY_OF_WEEK IS NULL;

Note: I used ENGLISH_ as the prefix so that I could include other languages in the future for the day of the week. For example, for German days, I would create a column called GERMAN_DAY_OF_WEEK and add it to my Analytic View so that German users would see the correct day of the week value. For example:

WHEN 0 THEN 'Montag'

We can so a similar set of statements for the ENGLISH_MONTH column.

ALTER TABLE "_SYS_BI"."M_TIME_DIMENSION"

ADD (ENGLISH_MONTH VARCHAR(10) NULL);

UPDATE "_SYS_BI"."M_TIME_DIMENSION"

SET ENGLISH_MONTH=

CASE MONTH_INT

WHEN 1 THEN 'January'

WHEN 2 THEN 'February'

WHEN 3 THEN 'March'

WHEN 4 THEN 'April'

WHEN 5 THEN 'May'

WHEN 6 THEN 'June'

WHEN 7 THEN 'July'

WHEN 8 THEN 'August'

WHEN 9 THEN 'September'

WHEN 10 THEN 'October'

WHEN 11 THEN 'November'

WHEN 12 THEN 'December'

END

WHERE ENGLISH_MONTH IS NULL;

To see that the table looks like, you can issue the following query:

SELECT DATE_SAP, YEAR, QUARTER, MONTH_INT, ENGLISH_MONTH, WEEK, DAY_OF_WEEK_INT, ENGLISH_DAY_OF_WEEK

FROM "_SYS_BI"."M_TIME_DIMENSION" LIMIT 50;

Now we have our fact table and dimension tables ready in HANA. In the next blog post in this series, I will create an Analytic & Calculation View from these tables that SAP Lumira needs to do detailed analysis on the Wikipedia data.

Enhancing the Wikipedia data model using fact and dimension tables with SAP HANA

Cleaning data before adding to a fact table

Checking for duplicate rows in the staging table

Creating the Fact Table

Create the fact table

Create computed columns

Load the data into the fact table

Exclude bad records

Create the Wikipedia dimension tables

Creating the .csv file for the project dimension table

Creating the .csv file for the language dimension table

Creating the two dimension tables in HANA

Create the Date dimension table for name lookups of date values

SAP PI for Beginners

ABAP 7.40 Quick Reference

Fiori: technical installation and configuration of one app from A - Z