icon_CloudMgmt icon_DollarSign icon_Globe icon_ITAuto icon_ITOps icon_ITSMgmt icon_Mainframe icon_MyIT icon_Ribbon icon_Star icon_User icon_Users icon_VideoPlay icon_Workload icon_caution icon_close s-chevronLeft s-chevronRight s-chevronThinRight s-chevronThinRight s-chevronThinLeft s-chevronThinLeft s-trophy s-chevronDown
BMC Contact Options

Select the link below that best matches your interest.

For questions about BMC products, solutions, and services, you can also phone the number below:



Hive is very similar to Apache Pig. What it does is let you create tables and load external files into tables using SQL. Then it creates MapReduce jobs in Java.  Java is a very wordy language so using Pig and Hive is simpler.

Some have said that Hive is a data warehouse tool (Bluntly put, that means an RDBMS used to do analytics before Hadoop was invented.). In fact you can use Apache Sqoop to load data into Hive or Hadoop from a relational database.


In this document we will introduce Hive mainly by using examples. You need to install Apache Hadoop first. Then you can install MySQL to store metadata. (Or you could use Derby. Here we use MySQL). Hive will create data in Hadoop.

So first install Hadoop. Then execute these steps:

apt-get install mysql-server
apt-get install libmysql-java
Download and extract Hive to /usr/local/hive. Then in .bashrc set:

export HIVE_HOME=/usr/local/hive/apache-hive-2.0.1-bin
export PATH=$PATH:$HIVE_HOME/bin

Make this softlink:

ln -s /usr/share/java/mysql-connector-java.jar $HIVE_HOME/lib/mysql-connector-java.jar

Then open MySQL to create the schema,etc:

mysql -u root -p
mysql> CREATE DATABASE metastore;
mysql> USE metastore;

Here use the schema that most nearly matches the version you are using. (In this tutorial we are using Hive 2.0.1 but there is no 2.0.1 schema. So we used 2.0.0.)

mysql> SOURCE /usr/local/hive/apache-hive-2.0.1-bin/scripts/metastore/upgrade/mysql/hive-schema-2.0.0.mysql.sql

mysql> CREATE USER 'hiveuser'@'%' IDENTIFIED BY password';

mysql> GRANT all on *.* to 'hiveuser'@localhost identified by 'password';
mysql>  flush privileges;

Now edit $HIVE_HOME/conf/hive-site.xml:

      <description>metadata is stored in a MySQL server</description>
      <description>MySQL JDBC driver class</description>
      <description>user name for connecting to mysql server</description>
      <description>password for connecting to mysql server</description>

If you have set up Hadoop as a cluster you might need to run the next command. You will know if that is the case if you get an error about that.

$HADOOP_HOME/bin/hadoop dfsadmin -safemode leave

Also if you have installed Spark you could choose to execute this to get rid of a warning message-


Now run the Hive shell:


Load external CSV File

Much of what you do with Hive is load external files stored in Hadoop so that you can use SQL to work with them. So here is an example.

Download this file (It is an ssh log.) then open it with Google sheets or Excel or use sed and vi to replace the spaces with commas so that it will be comma-delimited.

Then the top of the file looks like this:

Now copy the file to Hadoop:

hadoop fs -put /home/walker/Downloads/ssh.csv /data/ssh/

Notice that you put the directory name and not the name of the file in Hadoop.

Then we load the file and create the schema all in one step:

sshtime STRING,
sshkey STRING,
sourceIP STRING,
socket STRING,
targetIP STRING,
port STRING,
status STRING,
direction STRING,
software STRING,
device STRING,
junk1 STRING,
junk2 STRING,
junk3 STRING,
Junk5 STRING )
   COMMENT 'ssh log'
    location '/data/ssh/';

Python User Defined Function (UDF)

Before we show some SQL commands (Technically they are called HQL, Hive Query Language.) we will show how to write a Python UDF:

Create the Python program ssh.py. Below we explain how it works.

import sys
	import datetime
	import re

def bi(s):
    if g:
        return bin(int(g.group(1))) + bin(int(g.group(2))) + bin(int(g.group(3)))

for line in sys.stdin:
	x, y, z = line.strip().split('\t')
	print x,bi(y),bi(z)

Add it to make it available for Hive to execute this command in the Hive shell.

add file /home/walker/Downloads/ssh.py

Then run this transform operation calling the Python UDF:

SELECT TRANSFORM (sshtime, sourceip, targetip) USING 'python /home/walker/Downloads/ssh.py' as (sshtime, sourceIP, targetip) FROM sshlog;

1332017778 0b110000000b101010000b11001010 0b110000000b101010000b10101	NULL	NULL
1332017793 0b110000000b101010000b11001010 0b110000000b101010000b10101	NULL	NULL

The Python function works off stdin and stdout. We split the input into a time and then two IP address fields. Then we run a function to convert the IP address to bit format. The TRANSFORM statement means pass those three values to the UDF.

Exercise: why does it return 2 null columns?

Create table insert data

No we go back and do something simpler - create a table and insert some data into it to illustrate HQL functions.

First create a table. People who know SQL will see that it is almost the same syntax.

create table students (student string, age int);

Then add some data into it:

insert into table students values('Walker', 33);
insert into table students values('Sam', 33);
insert into table students values('Sally', 33);
insert into table students values('Sue', 72);
insert into table students values('George', 56);
insert into table students values('William', 64);
insert into table students values('Ellen', 24);
insert into table students values('Jose', 72);
insert into table students values('Li', 56);
insert into table students values('Chris', 64);
insert into table students values('Ellen', 24);
insert into table students values('Sue', 72);
insert into table students values('Ricardo', 56);
insert into table students values('Wolfgang', 64);
insert into table students values('Melanie', 24);
insert into table students values('Monica', 36);


Now we can run regular SQL commands over that. Remember this is not a relational database. Hadoop as you will recall does not allow updating files: only adding and deleting them. So it creates new files for every operation.

select count(*) from students;

Here we can find all students whose birthday is a multiple of 3 by using the modulo (is divisible by) function. (Hive has many math functions.)

select age,pmod(age,3) from students where pmod(age,3) = 0;

Now, without discriminating against older people, we create two new tables: one with people older than 45 and one for people younger than that:

create table old as select * from students where age > 45;

eate table young as select * from students where age <= 45;


Now we can make the intersection of two tables by showing students from the old and new table whose name is 3 letters long:

select young.student, old.student from young join old on (length(young.student) = 3) and (length(old.student) = 3);

It responds:

Sam	Sue
Sam	Sue


Hive supports maps, structs, and array complex types. But it does not support the ability to add data to those with SQL yet. So we show the sort of awkward way of doing that below.

First, create a table with a map column, meaning a (key->value) column:

create table prices(product map);

Now we use the students table we created above to stand in as a proxy for this insert operation. You can use any table for that

insert into prices select map("abc", 1) from students limit 1;

Now you can see the data we just inserted:

select * from prices;

Where to go next

From here there are many areas where you could focus your learning as Hive has many features. For example you can learn about partitions, the decimal data type, working with dates, using Hive on Amazon, and using Hive with Apache Spark. You could learn about Beeline which is a newer Hive command line interface. (Beeline will replace the Hive cli in the future.) And you can dig into architectural like topics like SerDe, which is the Hive serializer-deserializer and Hive file storage formats.

Explore other BMC Guides ›

Share on Social Media

BMC Solutions for Hadoop & Big Data

Control-M for Big Data

Realize the benefits of big data faster

Learn More | Free Trial

Control-M Application Integrator

Automate the critical workloads that power your business

Learn More

TrueSight Capacity Optimization for Hadoop

Plan and optimize Hadoop IT infrastructure resources

Learn More | Free Trial

BMC Discovery (formerly ADDM)

Identify and model your complete big data infrastructure

Learn More | Free Trial