When working with Big Data in Hadoop environments, a very useful command line tool is Apache Sqoop. This allows us to import data stored in relational databases into HDFS, as well as to export data in HDFS to relational databases. The name of this tool comes from SQL + Hadoop, Sqoop, and it is based on MapReduce jobs to perform its function.

An important appreciation of Sqoop is that the data extracted from a database loses its “relational” aspect, that is, we are extracting the raw information. We store it in the file format that we believe is appropriate, whether raw text, comma-separated-values, tabulated-separated-values, or even special formats such as Avro or Parquet.

In the case of exporting data to a database from HDFS, we also have to take into account that the fields of our source file have to match the fields of the target table.

If later on, we want to work with imported data in HDFS in a “relational” way, we will have to use other tools such as Hive or Impala.

How to use Sqoop?

In general, for each action that we want to perform with Sqoop, we will have to connect to the desired database through a JDBC connector indicating IP address, port, and access credentials. Next, we must add the parameters that will configure our request.

Then, I compile a list of useful commands to use Sqoop that I have extracted from the manuals of the official Cloudera courses: “Big Data Developer for Apache Spark Hadoop” and “Designing and Buiding Big Data Applications”. If you need more information, you can always check the official Sqoop documentation here.

Basic Syntax

Sqoop is a command-line utility with several subcommands, called tools. There are tools for import, export, listing database contents, and more.

List all tools

sqoop help

Basic syntax of an invocation

sqoop <tool-name> <options>

Exploring a Database

List all tables in a Database

sqoop list-tables \

--connect jdbc:mysql://<host>/<db> \

--username <usr> --password <pwd>

Query to a database

sqoop eval \

--query "<query>" \

--connect jdbc:mysql://<host>/<db> \

--username <usr> --password <pwd>

Importing Data

Imports are performed using MapReduce jobs. Sqoop begins examining the table to be imported, determines the PK if possible, runs a boundary query to see how many records will be imported, divides the result of the bounday query by the number of tasks. Sqoop generates a Java src file for each table being imported, the file remains after import

Importing an Entire Database

Import all tables from a database

sqoop import-all-tables \

--connect jdbc:mysql://<host>/<db> \

--username <usr> --password <pwd>

Import all tables from a data warehouse (different base directory)

sqoop import-all-tables \

--connect jdbc:mysql://<host>/<db> \

--username <usr> --password <pwd> \

--warehouse-dir <hdfsdir>

Importing a Single Table

Import a single table

sqoop import --table <table> \

--connect jdbc:mysql://<host>/<db> \

--username <usr> --password <pwd>

Importing Partial Tables

Import only specified columns from a table

sqoop import --table <table> \

--connect jdbc:mysql://<host>/<db> \

--username <usr> --password <pwd> \

--columns "<col1>,<col2>,..,<colN>"

Import only matching rows from a table

sqoop import --table <table> \

--connect jdbc:mysql://<host>/<db> \

--username <usr> --password <pwd> \

--where "<conditions>"

Import free-form query result

sqoop import \

--connect jdbc:mysql://<host>/<db> \

--username <usr> --password <pwd> \

--target-dir <dir> \

--split-by <primarykey> \

--query '<query>'

Importing File Options

Specifying a File Location

Import a table in an alternate directory (by default in a subdir in HDFS /home)

sqoop import --table <table> \

--connect jdbc:mysql://<host>/<db> \

--username <usr> --password <pwd> \

--target-dir <dir>

Import table with an alternate delimiter (CSV by default)

sqoop import --table <table> \

--connect jdbc:mysql://<host>/<db> \

--username <usr> --password <pwd> \

--fields-terminated-by "<symb>"

Using Compression

Import table in a compressed file (Gzip format)

sqoop import --table <table> \

--connect jdbc:mysql://<host>/<db> \

--username <usr> --password <pwd> \

-z

Import table in a compressed file (Snappy format)

sqoop import --table <table> \

--connect jdbc:mysql://<host>/<db> \

--username <usr> --password <pwd> \

--compression-codec org.apache.hadoop.io.compress.SnappyCodec

Storing Data in other data Formats

Import table in Parquetfile format

sqoop import --table <table> \

--connect jdbc:mysql://<host>/<db> \

--username <usr> --password <pwd> \

--as-parquetfile

Import table Sequencefile format

sqoop import --table <table> \

--connect jdbc:mysql://<host>/<db> \

--username <usr> --password <pwd> \

--as-sequencefile

Import  table in Avro format

sqoop import --table <table> \

--connect jdbc:mysql://<host>/<db> \

--username <usr> --password <pwd> \

--as-avrodatafile

Import table incrementally (import only new records)

sqoop import --table <table> \

--connect jdbc:mysql://<host>/<db> \

--username <usr> --password <pwd> \

--incremental append \

--check-column <column-name> \

--last-value <value>

Import table managing null values

sqoop import --table <table> \

--connect jdbc:mysql://<host>/<db> \

--username <usr> --password <pwd> \

--null-string "\\N"

--null-non-string "\\N"

Suggest parallelize table import

sqoop import --table <table> \

--connect jdbc:mysql://<host>/<db> \

--username <usr> --password <pwd> \

-m <number>

Exporting Data

Export table to database (create DB table before)

sqoop export --table <table> \

--connect jdbc:mysql://<host>/<db> \

--username <usr> --password <pwd> \

--export-dir <dir> \

--update-mode <allowinsert|updateonly> 

Export table to database managing null values

sqoop export --table <table> \

--connect jdbc:mysql://<host>/<db> \

--username <usr> --password <pwd> \

--export-dir <dir> \

--update-mode <allowinsert|updateonly> \

--null-string "\\N"

--null-non-string "\\N"