Technology Blogs by Members
Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the mix!
cancel
Showing results for 
Search instead for 
Did you mean: 
Former Member

HANA and Hadoop are very good friends. HANA is a great place to store high-value, often used data, and Hadoop is a great place to persist information for archival and retrieval in new ways - especially information which you don't want to structure in advance, like web logs or other large information sources. Holding this stuff in an in-memory database has relatively little value.

As of HANA SP06 you can connect HANA into Hadoop and run batch jobs in Hadoop to load more information into HANA, which you can then perform super-fast aggregations on within HANA. This is a very co-operative existance.

However, Hadoop is capable - in theory - of handling analytic queries. If you look at documentation from Hadoop distributions like Hortonworks or Cloudera, they suggest that this isn't the primary purpose of Hadoop, but it's clear that Hadoop is headed in this direction. Paradoxically, as Hadoop heads in this direction, Hadoop has evolved to contained structured tables using Hive or Impala. And with ORC and Parquet file formats within the HDFS filesystem, Hadoop also uses columnar storage.

So in some sense Hadoop and HANA are converging. I was interested to see from an aggregation perspective, how Hadoop and HANA compare. With HANA, we get very good parallelization even across a very large system and near-linear scalability. This translates to between 9 and 30m aggregations/sec/core depending on query complexity. For most of my test examples, I expect to get around 14m - with a moderate amount of grouping, say 1000 groups. On my 40-core HANA system that means that I get about 500m aggregations/second.

My research appears to show that Cloudera Impala has the best aggregation engine, so I've started with that. I'd like to know your feedback.

Setup Environment

I'm using one 32-core AWS EC2 Compute Optimized C3.8xlarge 60GB instance. In practice this is about 40% faster core-core than my 40-core HANA system. Yes that's a nice secret - HANA One uses the same tech, and HANA One is also 40% faster core-core than on-premise HANA systems.

I've decked it out with RedHat Enterprise Linux 6.4 and the default options. A few notes on configuring Cloudera:

- Make sure you set an Elastic IP for your box and bind it to the primary interface

- Ensure that port 8080 is open in your security group

- Disable selinux by editing /etc/selinux/config and setting SELINUX to disabled

- Make sure you configure a fully qualified hostname in files /etc/sysconfig/network and /etc/hosts

- Reboot after the last two steps

- Disable iptables during installation using chkconfig iptables off && /etc/init.d/iptables stop

Installation is straightforward - just login as root and run the following:

wget http://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin && chmod +x cloudera-manager-installer.bin && ./cloudera-manager-installer.bin

The only thing to note during the installation is to use fully qualified hostnames, login to all hosts as ec2-user, and use your AWS Private Key as the Authentication Method. This works for Cloudera and Hortonworks alike.

Testing using Hive

The first thing I did was to benchmark my test using Hive. My test data is some Financial Services market data and I'm using 28m rows for initial testing. With HANA, we get 100ms response times when aggregating this data, but let's start small and work up.

I can load data quickly enough - 5-10 seconds. We can't compare this to HANA (which takes a similar time) because HANA also orders, compresses and dictionary keys the data when it loads. Hadoop just dumps it into a filesystem. Running a simple aggregation when using TEXTFILE storage on Hive runs in around a minute - 600x slower than HANA.


That's roughly what we would expect, because Hive isn't optimized in any way.

CREATE TABLE trades ( tradetime TIMESTAMP, exch STRING, symb STRING, cond STRING, volume INT, price DOUBLE ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH '/var/load/trades.csv' INTO TABLE trades;

select symbol, sum(price*volume)/sum(volume) from trades group by symbol;

Moving from Hive to Impala

I struggled a bit here because Cloudera 5.0 Beta is more than a little buggy. Sometimes you could see the Hive tables from Parquet, sometimes not. Sometimes it would throw up random errors. This is definitely not software you could use in production.

I used Parquet Snappy compression which should provide a blend of performance and compression. You can't load tables directly into Impala - instead, you have to load into Hive and then Impala. That's quite frustrating.

create table trades_parquet like trades;

set PARQUET_COMPRESSION_CODEC=snappy;

insert into trades_parquet select * from trades;

Query: insert into trades_parquet select * from trades

Inserted 28573433 rows in 127.11s

So now we are loading at around 220k rows/sec - on equivalent hardware we could expect nearer 5m from HANA. This appears to be because Impala doesn't parallelize loading so we are CPU bound in one thread. I've read that they didn't optimize writes for Impala yet so that makes sense.

select symbol, sum(price*volume)/sum(volume) from trades group by symbol;

Now the first time this runs, it takes 40 seconds. However, the next time it runs it takes just 7 seconds (still 70x slower than HANA). I see 4 active CPUs, and so we have 10x less parallelization than HANA, and around 7x less efficiency, which translates to 7x less throughput in multi-user scenarios, at a minimum.

Final Words

For me, this confirms what I already suspected to be the case - Hadoop is pretty good at consuming data (and I'm sure with more nodes it would be even better) and good at batch jobs to process data. It's not any better than HANA in this respect, but this $/GB is much lower of course, and if your data isn't that valuable to you and isn't accessed often, storing it in HANA will be cost-prohibitive.

But when it comes to aggregating, even in the best case scenario, Hadoop is 7x less efficient on the same hardware, and the number of features that HANA has, simplicity of operation and storing data only once - if your data is hot, and is accessed and aggregated often in different ways, HANA is the king.

And we didn't even cover in this blog the number of features that HANA has and the incredible maturity of HANA's SQL and OLAP engines compared to what is in Hadoop, plus the fact that Impala is the fastest engine but it is only supported by Cloudera and very immature.

Since with Smart Data Access, we can store our hot data in HANA and cold data in Hadoop, this makes HANA and Hadoop very good friends, rather than competition.

What do you think?

20 Comments
Labels in this area