Hive Basics

Former Member · ‎08-20-2015

From my previous blog (http://scn.sap.com/blogs/SomaMukherjee/2015/08/06/hadoop-overview) you have got basic idea about Hadoop. Now let’s go a little deeper and learn different components of Hadoop ecosystem. This blog talks about one component of Hadoop ecosystem that is Hive.

You have learnt how the data is stored in Hadoop file system. But we should know how we can use that data for analysis.

What is Hive?

Hive is a data warehouse system for Hadoop that facilitates ad-hoc queries and the analysis of large datasets stored in Hadoop. Hive provides a SQL-like language called HiveQL. Due its SQL-like interface, Hive is increasingly becoming the technology of choice for using Hadoop.

The data is organized in three different formats in HIVE.

Tables: They are very similar to RDBMS tables and contains rows and columns. Hive is just layered over the Hadoop File System (HDFS), hence tables are directly mapped to directories of the filesystems. It also supports tables stored in other native file systems.

Partitions: Hive tables can have more than one partition. They are mapped to subdirectories and file systems as well.

Buckets: In Hive, data may be divided into buckets. Buckets are stored as files in partition in the underlying file system.

Why do Hive and Pig exist when they seem to do much of the same thing?

Pig: Pig is a high-level platform for creating MapReduce programs used in Hadoop. The language for this platform is called Pig Latin.

Hive	Pig
Because of its SQL like query language it’s often used as the interface to an Apache Hadoop based data warehouse. Hive is considered friendlier and more familiar to users who are used to using SQL for querying data.	Pig fits in through its data flow strengths where it takes on the tasks of bringing data into Apache Hadoop and working with it to get it into the form for querying.
Hive is for structured data	Pig is for unstructured data.
Developed by Yahoo	Developed by Facebook
Language name HiveQL	Language name Pig Latin
Hive has web interface	Pig does not have web interface
HiveQL is a declarative language like SQL	PigLatin is a data flow language. The output of one PigLatin construct can be sent as input to another PigLatin construct and so on.

Steps to explore Hive on your own:

Need a system with Hadoop installed, up and running.

You can install the sandbox system and start working on Hadoop.

Please find the links below,

Download Virtual box:

https://www.virtualbox.org/wiki/Downloads

Download the sandbox for virtual box:

http://hortonworks.com/hdp/downloads/

2. Start the sandbox system.

3. Launch the Hive using below link.

http://127.0.0.1:8000/

4. Click on the Hive UI on the top left corner.

Here you can write queries and explore more about Hive.

QUERIES

Function	MySQL	HiveQL
Retrieving information	SELECT from_columns FROM table WHERE conditions;	SELECT from_columns FROM table WHERE conditions;
All values	SELECT * FROM table;	SELECT * FROM table;
Some values	SELECT * FROM table WHERE rec_name = “value”;	SELECT * FROM table WHERE rec_name = "value";
Multiple criteria	SELECT * FROM table WHERE rec1=”value1” AND rec2=”value2”;	SELECT * FROM TABLE WHERE rec1 = "value1" AND rec2 = "value2";
Selecting specific columns	SELECT column_name FROM table;	SELECT column_name FROM table;
Retrieving unique output records	SELECT DISTINCT column_name FROM table;	SELECT DISTINCT column_name FROM table;
Sorting	SELECT col1, col2 FROM table ORDER BY col2;	SELECT col1, col2 FROM table ORDER BY col2;
Sorting backward	SELECT col1, col2 FROM table ORDER BY col2 DESC;	SELECT col1, col2 FROM table ORDER BY col2 DESC;
Counting rows	SELECT COUNT(*) FROM table;	SELECT COUNT(*) FROM table;
Grouping with counting	SELECT owner, COUNT(*) FROM table GROUP BY owner;	SELECT owner, COUNT(*) FROM table GROUP BY owner;
Maximum value	SELECT MAX(col_name) AS label FROM table;	SELECT MAX(col_name) AS label FROM table;
Selecting from multiple tables (Join same table using alias w/”AS”)	SELECT pet.name, comment FROM pet, event WHERE pet.name = event.name;	SELECT pet.name, comment FROM pet JOIN event ON (pet.name = event.name);

METADATA

Function	MySQL	HiveQL
Listing databases	SHOW DATABASES;	SHOW DATABASES;
Listing tables in a database	SHOW TABLES;	SHOW TABLES;
Creating a database	CREATE DATABASE db_name;	CREATE DATABASE db_name;
Dropping a database	DROP DATABASE db_name;	DROP DATABASE db_name (CASCADE);

General Information:

Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT. In addition, analysts can combine primitive data types to form complex data types, such as struts, maps and arrays.

Pros

Similarity with SQL

Multiple schemas can be projected on the same data -- no ETL required

Works with popular query tools that connect via JDBC and ODBC

Able to query very large data sets interactively

Cons

DDL requires a bit of learning and different thinking

Tuning Hive is very different from tuning RDBMS

Here are the basics of Hive which will enable you to start exploring more about Hive. If you want to know more about how sap is connecting HANA to Hadoop, you can follow the below links.

http://scn.sap.com/thread/3773791

http://scn.sap.com/community/developer-center/hana/blog/2013/08/22/smart-data-access-and-hadoop

http://scn.sap.com/thread/3226387

References:

http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/

http://stackoverflow.com/questions/3356259/difference-between-pig-and-hive-why-have-both

http://www.dezyre.com/article/difference-between-pig-and-hive-the-two-key-components-of-hadoop-ecosy...

http://www.thecloudavenue.com/2012/12/introduction-to-apache-hive-and-pig.html

http://www.orzota.com/hive-tutorial-for-beginners/

http://www.infoworld.com/article/2608271/hadoop/hadoop-review-apache-hive-brings-real-time-queries-t...

http://hortonworks.com/wp-content/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLtoHive.pdf

http://blog.sqlauthority.com/2013/10/21/big-data-data-mining-with-hive-what-is-hive-what-is-hiveql-h...