Apache Hive tool that works on Hadoop systems that allow querying data stored in HDFS as if it were a SQL relational database. Hive is a high-level abstraction on top of MapReduce that allows us to generate jobs using statements in a language very similar to SQL, called HiveQL. Using Hive is much faster and easier than writing your own MapReduce jobs in Java, and in addition, you get similar performance. Furthermore, it is possible to set Spark as its execution engine instead of MapReduce, improving the performance.

We have to keep in mind that Hive does not transform HDFS data into a relational database, it only allows access to them using HiveQL queries.

How to use Hive?

We can make use of Hive from a web browser by using Hue (Hadoop User Experience), through a command line using the Beeline tool, or programmatically using the appropriate JDBC or ODBC connector for it. In any case, it will be necessary to indicate the IP address, port, and access credentials to Hive.

Hive operates with tables. So the first thing to do is to define these tables in Hive. To create a table, we have to give it a name and indicate the fields of each column (like we would do in a SQL database). In addition, we must define the directory in which the data to be loaded will be located, and in what format it will be stored (by default it will be in the directory “/user/hive/warehouse/<tablename>”). We must be careful that the structure of the fields we define when creating the table matches the fields of the data in our files.

In this way, Hive creates metadata indicating the structure of the data and where this data is stored, so we can query data as we wish. In the same way that in a relational database, we can perform complex queries by joins between tables and using some of the SQL functions.

List of useful commands for Hive

Then, I compile a list of useful commands to use Hive that I have extracted from the manuals of the official Cloudera courses: “Big Data Developer for Apache Spark Hadoop” and “Designing and Buiding Big Data Applications”. If you need more information, you can always check the official Hive documentation here.

Init Beeline

beeline -n <usr> -p <pwd> \

-u jdbc:hive2://<host>:<port>/<db>

Launch HiveQL query with Beeline

beeline -n <usr> -p <pwd> \

-u jdbc:hive2://<host>:<port>/<db>

-e "<query>"

Launch HiveQL query from a file with Beeline

beeline -n <usr> -p <pwd> \

-u jdbc:hive2://<host>:<port>/<db>

-f <file.hql>

HiveQL Commands

Show tables

SHOW TABLES;

Show table description

DECRIBE <table>;

Comments (only Hue and Scripts)

Using “–“

Display list of Hive functions

SHOW FUNCTIONS;

Show function description

DESCRIBE FUNCTION <function>;

Show detailed function description

DESCRIBE FUNTION EXTENDED <function>;

Common functions

ROUND(<number>,<n-decimals>)

CEIL(<number>)

FLOOR(<number>)

YEAR(<timestamp>)

SUBSTRING(<string>,<ini>,<end>)

TO_UTC_TIMESTAMP(<timestamp>,'UTC')

CAST(<val> as <type>)

Aggregation functions for GROUP BY

COUNT(*)

COUNT(<col>)

COUNT(DISTINTC <col>)

MAX(<col-number>)

MIN(<col-number>)

SUM(<col-number>)

AVG(<col-number>)

Column types in Hive

STRING

BOOLEAN

TIMESTAMP

INT

BIGINT

FLOAT

DOUBLE

Create a table in Hive (TextFile format by default)

CREATE TABLE <table> (

         <name> <type>,

         ...

);

Create a table with a different field delimiter

CREATE TABLE <table> (

         <name> <type>,

         ...

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '<symb>';

Create a table from SequenceFile file

CREATE TABLE <table> (

         <name> <type>,

         ...

)

STORED AS SEQUENCEFILE;

Create a table from Avro file

CREATE TABLE <table> (

         <name> <type>,

         ...

)

STORED AS AVRO;

Drop table (BE CAREFUL! remove metadata and files in HDFS) 

DROP TABLE <table>;

Drop table if exists

DROP TABLE IF EXISTS <table>;

Create a table indicating  HDFS directory

CREATE TABLE <table> (

         <name> <type>,

         ...

)

LOCATION '<dir>';

Create an external table (files are not deleted when DROP table)

CREATE EXTERNAL TABLE <table> (

         <name> <type>,

         ...

)

LOCATION '<dir>';

Populate table copying files in the table directory

hdfs dfs -mv <file> <table-dir>

Populate table from with HiveQL command

LOAD DATA INPATH '<file>' INTO TABLE <table>;

Add rows to an existing table

INSERT INTO TABLE <table>

<query>

Create a table and populate (CTAS)

CREATE TABLE <table> AS

<query>

Create a table and populate in a specified format

CREATE TABLE <table>

STORED AS <format>

AS

<query>