Comparing the Data Warehouse approach with CalcVie...

werner_daehn · ‎12-01-2014

Apparently there is some discussion about Hana Live potentially replacing a Data Warehouse, either BW or built on the database natively. This stems probably from SAPs very own statements like “Hana is perfect for OLTP and OLAP” or “with Hana, you can do all your reporting on the ERP system”.
The latter statement is intriguing but not new.
In short: Use Calculation Views, like Hana Live does, for operational reporting. Build a Data Warehouse for Business Intelligence, for finding new insight. The good thing is, in past the Data Warehouse has very often been misused for operational reporting as well for query performance reasons. Thanks to Hana we can put the operational reports back to where they belong, the operational system. Might be hard though to draw the line, what is an operational report, what is Business Intelligence.

The idea of a Data Warehouse was to have one central place, with all the data required, in an easy to understand fashion so that everybody can do analysis without the help of the IT department. Offloading that to a database of its own was the technical approach chosen. Not because it is perfect but cost effective. It was a compromise and over the years software companies tried various approaches in order to get rid of this compromise with all its induced downsides.

SAPs own Business Objects propagated the multi-connection universe: “Hey, no need to have all data at a central place. The reporting tool can merge that data!”
Do you recall EII (Enterprise Information Integration) popular back in 2005? “Agreed, you need a central database but why move the data? Let this be a virtual database, each table actually being a Database View only, pointing to the remote table!”
The ETL (Extract Transform Load) tools, like SAP Data Services, on the other hand made the movement of the data from source systems to the central Data Warehouse database so simple these days.

The problem with Hana is, you have all three options easily accessible.Using Smart Data Access remote tables can be made accessible as Hana tables, with Calculation views a virtual data model can be designed and Hana as a database is very fast when querying data.

Examples with real life numbers can be found in part 2: Comparing the Data Warehouse approach with CalcViews - Examples

Common arguments against a Data Warehouse

Maybe we should start with the question "What are the downsides of a Data Warehouse?" in order to understand what technology to favor over the other.

The top response you always get is about storing the data twice. There is the source data and now you should store the data a second time in another database? For what reason? Actually, this downside has three aspects
1. Doubling the hardware and maintenance costs.
2. Moving the data does cost time and resources.
3. Building the ETL logic to define the data movement logic is cumbersome.
With Hana the realtime aspect is a key argument always. A typical Data Warehouse is loaded just once a day, not realtime. Not even near realtime.
Many Data Warehouse projects fail because the data is not trusted by the users.
Building a Data Warehouse is a complex and time consuming project involving Data Modeling, ETL, user management, BI tools.

All of these statements by themselves are correct. The consequence of these could be that it is too expensive to do Business Intelligence on the data or that the Data Warehouse approach as a technical solution is outdated and e.g. Calculation Views should be preferred. Let's try to put each statement into perspective...

1.1. Doubling the costs.

Let's take a very naive approach and assume we need 100 CPUs for the OLTP users entering orders and 20 CPUs for the OLAP users analyzing the data. We could hence upgrade the ERP hardware by another 20 CPUs. The Data Warehouse suggestion is to keep the OLTP system as is and buy a 20 CPU server for the OLAP users. Or we could argue that the 100 CPUs are needed at peak times only, most of the time the server is rather idle and can cope with the OLAP queries easily. All we need is a resource management to prioritize OLTP over OLAP queries.

There is no good answer which of the three options is the best. But there are a few more things to consider. If the OLTP server is a high end machine already, increasing its size is much more costly than adding another server. Also the service availability of an OLTP system has to be much higher than a OLAP system, because what happens if you cannot enter orders anymore? Disaster. What happens if you cannot run a report? Too bad. Point is, increasing the OLTP system hardware is way more expensive than adding yet another server with a much lower service level agreement.

So at the end, having a second server might be the cheapest way to add more computing power and to make sure no OLAP query does impact the OLTP users.

1.2. Moving data costs time and resources

Very true. On the other hand, if the data is moved during off business hours, the argument does not apply. And actually, reading the changed sales orders of a single day is less costly than a user reporting on the entire year's sales orders, simply because of the volume of data involved.

1.3. Building ETL logic is cumbersome

This argument is one huge misperception. Yes, building the transformations is complex compared to not doing anything. But is this really what we want to compare with? Building a Data Warehouse versus not doing Business Intelligence at all? No, we need to compare the effort to implement the ETL logic with calculation views and how difficult it is there.

The main mistake in this argument is, all options require the same transformations. All that is different is where this logic is entered. If you read the data off a table directly you might use a select statement like “select *, if(CANCELLED=’X’, NULL, AMOUNT) as REVENUE from BOOKINGS” in order to transform an AMOUNT into a REVENUE KPI. When using a Database View this logic will be part of a View. And when building a Data Warehouse this logic would be part of the ETL process.

Hence the logic has to be implemented anyway. Only question is what tool is more effective. ETL tools are very good at that, but you need to implement the loading and the delta logic. With Calculation View, Database Views or direct queries you are limited to the language used (SQL, SQL Script, Calc Scenarios) and require to implement it in a way so that the queries are fast.

2. Realtime

I have read statements that data should not be refreshed frequently, as then everybody will see different data constantly. Why is today's revenue so low? The user is searching and searching but cannot find the answer. And when he does refresh the initial report again he will find the revenue is not low at all any longer. What happened? The front desk entered a return of -1'000'000 and corrected it to -100'000 a second later.

Of course there is an argument in the other direction as well. The user is querying the manufacturing line status and all looks good - as of yesterday's data.

Guess it would be best if we would have the option, either to report on stable data or on realtime data.

Up to now there was no option for that, you had to decide to either build a Data Warehouse and refresh that at given times. Or to report directly on the operational data.

With the Hana SP9 Enterprise Information Management Option SAP made the first step towards realtime Data Warehouse, thus allowing to have realtime and stable data in one system.

Hana SP9 Data Provisioning - Overview

SAP HANA Enterprise Information Management – SAP Help Portal Page

3. Trusted data

This is again a so-so argument. When doing something wrong in the ETL logic or in the Calculation View, either way the data shown will be wrong. I would actually argue that the Data Warehouse approach has the lead here, as you can do much more checks and transformations when loading the change data as compared to execute the very same logic whenever somebody does access the data via a View.

4. Complexity of a Data Warehouse

Yes, building a Business Intelligence solution to be used by end users is a difficult job. But if this is implemented using the Data Warehouse philosophy or as virtual datamodel using Calculation Views and the such, that is a small difference only.

In addition there are multiple advantages the Data Warehouse approach has

5. Historical correct data

What happens when a customer moves from one city to another? In the ERP system the customer record is updated to the new address and whoever queries the sales revenue based on the city of the customer will find the entire revenue on the new address. Is this desired? Very likely not, old revenue should be assigned to the old region. When somebody queries the last year's revenue per region, the number should not have changed.

When reporting directly on the OLTP data the user is bound to the consistency of the ERP system. For data history is kept, historical correct reports can be built, for all others it is simply not possible. But when loading a Data Warehouse it can be decided how to change the data. Simply update it as well or keeping the history.

6. External data

What is the key to getting new insight? Comparing the data with other data. A historical comparison like "this year's revenue compared to last year" but also comparing with other data, data from other systems or public data even. Very likely you will not be allowed to copy data into the OLTP system just for reference. And even if, who maintains the join relationship with the external data? Every single query? In a Data Warehouse adding external data, cleansing and matching it is a normal operation, it is built for that.

With these thoughts I'd like to come up with a first finding...

Conclusion 1:

When using Views or querying the tables directly, you do not need an ETL tool because you are not moving any data. But the transformations previously performed in the ETL tool need to be implemented now in SQL or in the View.

In order to proof the points and to put some numbers behind, in the next posts we will go thought a couple of concrete examples, comparing the two approaches. Comparing the Data Warehouse approach with CalcViews - Examples

Example: Order Entry Amount per period for a customer

Example: Order Entry amount for all periods, all customers

Example: Joining master data

Example: Searching for data

Example: Data consistency with foreign keys

Example: Dealing with doubtful data

Example: Historical data

Summary

The idea of Calculation Views is to provide one view containing all data required for a typical query. Joining multiple Views together is to be avoided.

As long as the Views and the queries executed are not too complex, performance and resource consumption is low enough to not impact the ERP system performance. But certain kinds of transformations simply cannot be done.

Most important is data validation and data standardization, which cannot be done on the fly. The goal of Views therefore is to show the data as is.

Historical correct queries cannot be achieved in general as most master data tables and all relationships contain the current values only.

Complexity of transformations is high still, in addition under a tight execution time constraint.

As long as simple queries/Calculation Views are used the existing ERP Hana database can cope with the additional load executing these, hence saves a lot of money as no second database server for reporting is needed.

The more complex the queries get and the more heavily this is used, the ERP server needs to be more powerful, which could be expensive.

The idea of a Data Warehouse is to provide a simple to understand, easy to use data model which allows the database to query the data much faster than when using the ERP data model optimized for data entry.

Complex transformations and data preparation is no problem as this is done once only, whenever a source record got changed. This allows preserving history, validating and standardizing data and their relationships and other things which would never be possible at query time.

The complexity of transformations can be categorized in three areas:

Mandatory transformations, basically the same as required in Hana Live. These are not different to Hana Live, except that now the throughput is almost irrelevant as not performed at query time but load time.
Additional complexity due to delta loading. This is the task of the toolset to help as much as possible.
Additional transformations not possible in Hana Live but valuable.

As the database is hosted on a server of its own no negative impact on the source system performance due to resource intensive queries is prevented. However a second server is required although it would be a cheaper one as it does not have the same requirements on availability as the core ERP system.

Hana Live and Calculation Views in general are the perfect solution for operational reporting. The user wants to see the data as is, the report complexity is known and can be controlled.

Using Hana as a Data Warehouse server is the perfect solution for Business Intelligence, in other words to analyze fast amounts of data in order to find new insights. The complexity of queries will be rather high, read large amounts of data and the danger of drawing the wrong conclusions due to data inconsistency should be minimized as much as possible.