Data Profiling and Data Cleansing - Use Cases and ...

NielsWeigel · ‎06-12-2013

This blog is the starting point for a series of blog articles of a guest lecture at the Hasso-Plattner-Institut für Softwaresystemtechnik in Potsdam, Germany

The topic of data quality in general is not brand new. However the maturity level of the organization with regards to data quality or data governance is still different. While some organizations have set up enterprise wide data governance projects including, managing and tightly integrating People, Processes, Policies & Standards, Metrics and Tools other companies are still in the starting phase of mostly departmental data cleansing activities. So recent researches like this one from Gartner still indicate that the poor data quality is a primary reason for about 40% of all business initiatives failing.

At SAP we are looking at Information Governance as a discipline that oversees the management of your enterprise’s information. By bringing together the right people, processes, policies, and metrics, Information Governance delivers trusted, consistent, and clear information throughout your organization to enable organizations to meet their business goals in various areas like Business Process Efficiency Business Analytics, Merger & Acquisition, Compliance.
Various Information Management Initiative (and especially the combination of these initiatives) are supporting the organizations to achieve their business
goals.

Data Profiling and Data Cleansing are two essential building blocks or components of these Information Management initiatives and before we get closer into the details of the use cases for these components here are some basic definition from Wikipedia:

Data Profiling (http://en.wikipedia.org/wiki/Data_profiling)

“Data profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data”

Data Validation (http://en.wikipedia.org/wiki/Data_validation)

“In computer science, data validation is the process of ensuring that a program operates on clean, correct and useful data.”

Data Quality Assessment (http://en.wikipedia.org/wiki/Data_quality_assessment)

“Data quality assessment is the process of exposing technical and business data issues in order to plan data cleansing and data enrichment strategies.”

Data Cleansing (http://en.wikipedia.org/wiki/Data_cleansing)

“Data cleansing, data cleaning or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database.”

After this high-level definition, let's take a look into specific use cases where especially the Data Profiling capabilities are supporting the end users (either
technical IT or business users) in their day-by-day work and enabling them to get better understanding or insight into the data they are using.

Let's start with a classical ETL / Data Warehouse use case, where an Data Architect or Data Warehouse designer is working on the data provisioning side of the data warehouse and is going to integrate data from multiple data sources. For the ETL job designer it is essential that he can not only get the information what tables, views and technical objects are available at the databases or data sources, but also to get an understanding what kind of content is stored within the tables. A label or headline like Color is only an indicator what might be stored within the column. Data Profiling enables users to get a quick insight e.g. by using the frequency distribution to see how many different attributes exist and what are the most frequent attributes. Based on
data profiling results the ETL job designer can then also identify if different, inconsistent representations of same information exists. Based on that finding he can already set up his mapping tables within the data flow to standardize and unify the field content.

Another typical use case is Data Modeling. Typically a data or enterprise architect has the challenge to create and define new data models including logical record representations or specific lookup or reference tables to be used within the enterprise system landscape or for business intelligence or reporting solutions. He needs to find and join data from different database tables or even different databases. To do so, he needs to understand and know where the
required information is stored within the enterprise's system landscape, what tables to join the data from and especially needs to get the right level of transparency what kind of data is really stored within the tables or database. Again the statistical data profiling is giving great first insight into content by providing minimum and maximum information, length statistics, frequency distribution but also allowing uniqueness, dependency or redundancy profiling.

Customers who are starting their Data Cleansing or Data Quality Management activities are in many cases already aware of some high-impact data quality issues that they want to resolve. It might be they do not have the exact number of issues or problems they have, but they know that they "have to do something". However with the usage of automated Data Profiling functionalities on the large amount of data they have stored, they are able to get the first initial insight into data content for their cleansing activities by quickly identifying outliner or inconsistent data formats or representations.

Applying data discovery or data profiling methods to legacy data sources before their data is to be moved into a new SAP ERP or CRM system is one of the very common activities in the Use Case of Data Migration. Be it the challenge of moving data just from one single source into the new system or even migrating and consolidating data from several systems into one single target, it is anyway important to understand how the data is stored in the legacy system(s), what content or data domains are defined there, how long is the content in different systems defined (and populated), are there different representations for same kind of information like textual or numerical color codes. Data profiling provides an understanding whether your source data is fit for use in your new system.

You can find more information on classical IM Use Cases and the related SAP solutions within the EIM Use Case Wiki (http://wiki.sdn.sap.com/wiki/display/EIM/Enterprise+Information+Management+Use+Case+Wiki) here on the SAP Community Network.

Stay tuned for next articles one the usage of SAP Information Steward and SAP Data Services to support your Data Quality Management activities following this step-by-step approach from initial data asset discovery to ongoing improvement and monitoring of the current data quality for any data, in any application.

Related links including detailed Product Tutorials:

http://scn.sap.com/community/information-steward

http://scn.sap.com/community/data-services

Data Profiling and Data Cleansing - Use Cases and Solutions at SAP

Now live: 2014 SAP HANA and SAP HANA Cloud Applications Challenge voting

My Personal Ux, Fiori, Portal and Cloud Cheat Sheet

Web Dynpro ABAP Demonstration Videos