Cloudera Impala is another tool that allows queries with a language very similar to SQL over data stored in Hadoop file systems. This tool is designed to return results with low latency, which makes it ideal for interactive queries.

It can be very similar to Hive, since, in essence, they have the same purpose, retrieve HDFS data with SQL statements. However, there are certain differences that make Impala more efficient for certain occasions.

Impala or Hive? When should we use each one?

On the one hand, Hive works on top of MapReduce, which makes it a very robust product. This gives fault tolerance, a feature that Impala lacks, and also makes it work better for complex queries that require a lot of computational resources.

On the other hand, Impala is programmed in C ++ and optimized to provide good performance, since it can return results in a fraction of a second. This makes it ideal for interactive queries. But as I mentioned before, it does not provide tolerance to failures, and also has certain limitations of memory that makes it not an appropriate tool for heavy queries.

In summary, Impala should be used for those cases we need intensive querying and we could assume some level of failure in the results, such as the case of little complex queries made from the front-end of a web application, or Business Intelligence tools. On the other hand, Hive should be used to execute queries that require a higher computational cost and are not going to be carried out intensively, or for those occasions in which the failure is not allowed.

How to use Impala?

Like in the case of Hive, we can use Impala from Hue, command line with Impala-Shell, or programmatically by using the appropriate JDBC or ODBC connector.

The way of operating with Impala is practically identical as in Hive. Impala also uses tables to perform queries on the data stored in HDFS, so we have to create the tables in Impala to be able to access them. The language that Impala uses is the same as Hive, with the exception of some functions.

The relationship with Hive is so strong that the tables created with this tool can be accessed from Impala and vice versa! This is because they share the metadata. However, it is important to note that if any modification is made to the data in a table from Hive or another external way, the Impala metadata must be refreshed manually to make the changes in this tool effective.

List of useful commands for Impala

Then, I compile a list of useful commands to use Hive that I have extracted from the manuals of the official Cloudera courses: “Big Data Developer for Apache Spark Hadoop” and “Designing and Buiding Big Data Applications”. If you need more information, you can always check the official Impala documentation here.

Launch Impala Shell

impala-shell

Connect to Impala Shell in another server

impala-shell -i <host>:<port>

Launch query from file

impala-shell -f <file.sql>

Launch query from CLI

impala-shell -q '<query>'

Launch query and write output in a file

impala-shell -f <file.sql> \

--delimited \

--output_delimiter='<symbol>'

--o <file-dir>

Insert row in a table

INSERT INTO <table> VALUES(<val1>,<val2>,...,<valN>);

Comments

/* */ everywhere

Show Impala functions

SHOW FUNCTIONS;

Update all Impala metadata

INVALIDATE METADATA;

Update table metadata

INVALIDATE METADATA <table>;

Create a table in Impala

CREATE EXTERNAL TABLE <table>(

         <name> <type>

         ...

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '<symbol>'

LOCATION '<dir>';

Create a table from a query (CTAS):

CREATE TABLE <table>

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '<symbol>'

AS

<query>;

Supported file formats

TEXTFILE

SEQUENCEFILE

AVRO

RCFILE

PARQUET

Create a table from file with a specific format

CREATE TABLE <table>(

         <name> <type>

         ...

)

STORED AS <file-format>;