From my previous blog (http://scn.sap.com/blogs/SomaMukherjee/2015/08/06/hadoop-overview) you have got basic idea about Hadoop. Now let’s go a little deeper and learn different components of Hadoop ecosystem. This blog talks about one component of Hadoop ecosystem that is Hive.
You have learnt how the data is stored in Hadoop file system. But we should know how we can use that data for analysis.
What is Hive?
Hive is a data warehouse system for Hadoop that facilitates ad-hoc queries and the analysis of large datasets stored in Hadoop. Hive provides a SQL-like language called HiveQL. Due its SQL-like interface, Hive is increasingly becoming the technology of choice for using Hadoop.
The data is organized in three different formats in HIVE.
Tables: They are very similar to RDBMS tables and contains rows and columns. Hive is just layered over the Hadoop File System (HDFS), hence tables are directly mapped to directories of the filesystems. It also supports tables stored in other native file systems.
Partitions: Hive tables can have more than one partition. They are mapped to subdirectories and file systems as well.
Buckets: In Hive, data may be divided into buckets. Buckets are stored as files in partition in the underlying file system.
Why do Hive and Pig exist when they seem to do much of the same thing?
Pig: Pig is a high-level platform for creating MapReduce programs used in Hadoop. The language for this platform is called Pig Latin.
Hive | Pig |
Because of its SQL like query language it’s often used as the interface to an Apache Hadoop based data warehouse. Hive is considered friendlier and more familiar to users who are used to using SQL for querying data. | Pig fits in through its data flow strengths where it takes on the tasks of bringing data into Apache Hadoop and working with it to get it into the form for querying. |
Hive is for structured data | Pig is for unstructured data. |
Developed by Yahoo | Developed by Facebook |
Language name HiveQL | Language name Pig Latin |
Hive has web interface | Pig does not have web interface |
HiveQL is a declarative language like SQL | PigLatin is a data flow language. The output of one PigLatin construct can be sent as input to another PigLatin construct and so on. |
Steps to explore Hive on your own:
- Need a system with Hadoop installed, up and running.
You can install the sandbox system and start working on Hadoop.
Please find the links below,
Download Virtual box:
https://www.virtualbox.org/wiki/Downloads
Download the sandbox for virtual box:
http://hortonworks.com/hdp/downloads/
2. Start the sandbox system.
3. Launch the Hive using below link.
http://127.0.0.1:8000/
4. Click on the Hive UI on the top left corner.
Here you can write queries and explore more about Hive.
QUERIES
Function | MySQL | HiveQL |
Retrieving information | SELECT from_columns FROM table WHERE conditions; | SELECT from_columns FROM table WHERE conditions; |
All values | SELECT * FROM table; | SELECT * FROM table; |
Some values | SELECT * FROM table WHERE rec_name = “value”; | SELECT * FROM table WHERE rec_name = "value"; |
Multiple criteria | SELECT * FROM table WHERE rec1=”value1” AND rec2=”value2”; | SELECT * FROM TABLE WHERE rec1 = "value1" AND rec2 = "value2"; |
Selecting specific columns | SELECT column_name FROM table; | SELECT column_name FROM table; |
Retrieving unique output records | SELECT DISTINCT column_name FROM table; | SELECT DISTINCT column_name FROM table; |
Sorting | SELECT col1, col2 FROM table ORDER BY col2; | SELECT col1, col2 FROM table ORDER BY col2; |
Sorting backward | SELECT col1, col2 FROM table ORDER BY col2 DESC; | SELECT col1, col2 FROM table ORDER BY col2 DESC; |
Counting rows | SELECT COUNT(*) FROM table; | SELECT COUNT(*) FROM table; |
Grouping with counting | SELECT owner, COUNT(*) FROM table GROUP BY owner; | SELECT owner, COUNT(*) FROM table GROUP BY owner; |
Maximum value | SELECT MAX(col_name) AS label FROM table; | SELECT MAX(col_name) AS label FROM table; |
Selecting from multiple tables (Join same table using alias w/”AS”) | SELECT pet.name, comment FROM pet, event WHERE pet.name = event.name; | SELECT pet.name, comment FROM pet JOIN event ON (pet.name = event.name); |
METADATA
Function | MySQL | HiveQL |
Listing databases | SHOW DATABASES; | SHOW DATABASES; |
Listing tables in a database | SHOW TABLES; | SHOW TABLES; |
Creating a database | CREATE DATABASE db_name; | CREATE DATABASE db_name; |
Dropping a database | DROP DATABASE db_name; | DROP DATABASE db_name (CASCADE); |
General Information:
Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT. In addition, analysts can combine primitive data types to form complex data types, such as struts, maps and arrays.
Pros
Similarity with SQL
Multiple schemas can be projected on the same data -- no ETL required
Works with popular query tools that connect via JDBC and ODBC
Able to query very large data sets interactively
Cons
DDL requires a bit of learning and different thinking
Tuning Hive is very different from tuning RDBMS
Here are the basics of Hive which will enable you to start exploring more about Hive. If you want to know more about how sap is connecting HANA to Hadoop, you can follow the below links.
http://scn.sap.com/thread/3773791
http://scn.sap.com/community/developer-center/hana/blog/2013/08/22/smart-data-access-and-hadoop
http://scn.sap.com/thread/3226387
References:
http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/
http://stackoverflow.com/questions/3356259/difference-between-pig-and-hive-why-have-both
http://www.dezyre.com/article/difference-between-pig-and-hive-the-two-key-components-of-hadoop-ecosy...
http://www.thecloudavenue.com/2012/12/introduction-to-apache-hive-and-pig.html
http://www.orzota.com/hive-tutorial-for-beginners/
http://www.infoworld.com/article/2608271/hadoop/hadoop-review-apache-hive-brings-real-time-queries-t...
http://hortonworks.com/wp-content/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLtoHive.pdf
http://blog.sqlauthority.com/2013/10/21/big-data-data-mining-with-hive-what-is-hive-what-is-hiveql-h...