Monday, September 16, 2013

Setting up and Running Hive on Hadoop


Hive is a data warehousing infrastructure based on the Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) on commodity hardware.

Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Hadoop is a batch processing system and Hadoop jobs tend to have high latency and incur substantial overheads in job submission and scheduling. As a result - latency for Hive queries is generally very high (minutes) even when data sets involved are very small (say a few hundred megabytes). As a result it cannot be compared with systems such as Oracle where analyses are conducted on a significantly smaller amount of data but the analyses proceed much more iteratively with the response times between iterations being less than a few minutes. Hive aims to provide acceptable (but not optimal) latency for interactive data browsing, queries over small data sets or test queries.

Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs).

Below we discuss the process to set up and run hive on our hadoop cluster. Please note that this blog doesnot discuss the concepts and architecture of hive. For that You should go thru The Hadoop Definitive Guide from Tom White or Programming Hive from O'Reilly.

For this tutorial the prerequisite is that we already have a hadoop cluster up and running. If you don't have one, then don't worry, You can visit my previous blogs about setting up hadoop:

Installing Hive:

Step 1: Download a stable Hive release version from the Apache Hive Download Website : http://www.apache.org/dyn/closer.cgi/hive/ . I downloaded Hive 0.11 for my set up.

Step 2: Place the tar.gz file in your bigdata folder and then untar it.(Remember the folder structure I had created for my hadoop Install in my previous blog?)

cd bigdata
tar -xvf hive-0.11.0.tar.gz

Step 3: Rename the extracted folder to hive :
mv hive-0.11.0 hive

Step 4: Add export commands for HIVE_HOME and add the bin directory of hive to the SYSTEM PATH as we had done earlier for HADOOP and PIG.

export HIVE_HOME=/home/hduser/bigdata/hive
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$PIG_HOME/bin:$HIVE_HOME/bin.

Step 5: Create the warehouse folder on hdfs which is  /user/hive/warehouse by default as defined to hive. You can see this in the hive-default.xml.template file in $HIVE_HOME/conf.
hadoop fs -mkdir /user/hive
hadoop fs -mkdir /user/hive/warehouse
hadoop fs -chmod -R 755 /user/hive/warehouse

Hive needs a RDBMS to store its Meta information (databases, tables, indexes etc. info)
Derby: Hive has an embedded database called Apache Derby (java DB). It is useful for development activities. It has some limitations like it allows only one connection; there is no global database management. It uses current directory has the database location.
MySQL: We can store Meta information in Full featured RDBMS like MySQL or Oracle for production deployments where many developers trying to connect Hadoop cluster concurrently. MySQL is the preferred option here for choosing our metadata repository database.
We have the following options to run MySQL server.
 Run MySQL on local machine

 Run MySQL on dedicated machine (remote machine)
Here MySQL runs on one machine all other hive clients are connect to the shared Meta Information which is in this MySQL installation. This communication happens via a protocol called thrift. We have to run hive-server for this communication.

Step 6: Installing and configuring MySQL.
sudo apt-get install mysql-server
While installing, it will ask to set a password for "root" user. Set it a password that you don't forget.
After install, to make sure your mysql server is running, issue the below command
sudo netstat -tap | grep mysql

Step 7: Download MySql Java connector from http://dev.mysql.com/downloads/connector/j/#downloads and place the connector jar file that we get after extracting the tar.gz file in $HIVE_HOME/lib to get it added to the class path.

Step 8: Create a MySql user for Hive.
Enter mysql using the command:
mysql -u root -p
mysql> create user 'hive'@'hostname' identified by 'password';
I used localhost as my hostname here.

Step 9: Create a database for hive metastore:
mysql> Create database hivemetastore;

Step 10: Grant all privileges to hive user on the database hivemetastore:
mysql> Grant all on hivemetastore.* to 'hive'@'hostname';
mysql> flush privileges;
mysql> exit;
Again hostname is localhost here.

Step 11: Copy hive-default.xml.template to hive-site.xml in $HIVE_HOME/conf directory. Now edit hive-site.xml file to set the following properties:
<property>
  <name>hive.metastore.uris</name>
  <value>thrift://hostname:10000</value>
  <description>IP address (or fully-qualified domain name) and port of the metastore host. I used hostname as localhost here.</description>
</property>
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://hostname:3306/hivemetastore</value>
  <description>I used hostname as localhost here.</description>
</property>
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hive</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>YourPassword</value>
</property>

Step 12: Start the thrift server by issuing the below command
hive --service hiveserver &

Step 13: Start the Hive Metastore with the below command
hive --service metastore &

Now hive is up and running on Your client. You can enter Hive Shell by issuing the command hive. Lets start testing and discovering hive features by issuing the create database command.
hive> create database nyse; 
Whenever database is created then hive creates the directory in the dataware house location with an extentsion of .db.
We can override the location by using the following syntax:
hive>create database databaseName location '/our own path' comment 'some comments' (used for describing something about database)
hive>describe database nyse;
Now lets create our stocks table to hold the daily prices data. Lets make this a partitioned external table partitioned by the first character of stock name.
For this lets create the stocks table as below:
CREATE EXTERNAL TABLE IF NOT EXISTS stocks ( exchange STRING, symbol STRING, state STRING, price_open FLOAT, price_high FLOAT, price_low FLOAT, price_close FLOAT, volume INT, price_adj_close FLOAT)  PARTITIONED BY(symbol_char STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
Now lets create a temp table to hold the stocks price data temporarily from the file that we already have in HDFS:
CREATE EXTERNAL TABLE IF NOT EXISTS stockstemp ( exchange STRING, symbol STRING, state STRING, price_open FLOAT, price_high FLOAT, price_low FLOAT, price_close FLOAT, volume INT, price_adj_close FLOAT)  ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
LOAD DATA INPATH 'NYSE/prices/' OVERWRITE INTO TABLE stockstemp;
Now insert data into the main stocks table from the stockstemp table loading partition by partition based on the symbol char.
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='A') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'A'; 
Now lets put all other queries to load other partitions starting from B to Z in a separate file with .q extension.
cd $HIVE_HOME/scripts
vim loadStocks
The hive script loadStocks holds the below HQL statements:
use nyse;
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='B') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'B';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='L') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'L';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='C') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'C';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='D') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'D';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='E') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'E';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='F') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'F';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='G') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'G';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='H') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'H';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='I') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'I';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='J') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'J';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='K') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'K';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='M') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'M';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='N') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'N';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='O') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'O';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='P') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'P';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='Q') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'Q';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='R') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'R';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='S') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'S';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='T') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'T';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='U') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'U';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='V') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'V';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='W') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'W';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='X') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'X';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='Y') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'Y';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='Z') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'Z';

Now run this hql script:
 hive -f loadStocks.q
We see that a directory is created for each of the above partitions. The data is stored in files inside these directories. You can run select query on the whole stocks table and look out for more advanced features supported by hive. Programming Hive from O'Reilly is a good book to start learning hive.

2 comments:

  1. This is the exact piece of information that I was searching for a long time(Hadoop Training in Chennai). Processing data is the biggest issue that every cloud based companies are facing worldwide(Hadoop training institutes in chennai). Handling this problem made easy with the introduction of big data. Thank you so much for your worth able content here. Keep Posting article like this.

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete

Popular

Featured

Three Months of Chadhei