RDB to HDFS

# Table with primary key no need for mapper
sqoop import --connect jdbc:mysql://localhost/sqoopdb -username hiveuser -password hivepassword -table emp -m 1
 
# Append allows to add new data to existing file
#Split by needed when more than 1 mapper used
sqoop import --connect jdbc:mysql://localhost/sqoopdb 
-username hiveuser -password hivepassword -table emp
--append --split-by empno -m 3;

Increment Loading of Data

sqoop-import --connect jdbc:mysql://localhost/sqoopdb 
-username hiveuser -password hivepassword -table emp -m 1 
--incremental append --check-column empno --last-value 10

Load only specific column from Table

sqoop import --connect jdbc:mysql://localhost/sqoopdb 
-username hiveuser -password hivepassword -table emp 
-columns empno, name --append -m 1

Import all tables from RDB (All table should have primary key)

sqoop import-all-tables --connect jdbc:mysql://localhost/userdb --username root

NOTE

  • When using an text column for split-by an extra parameter needs to be enabled Dorg.apache.sqoop.splitter.allow_text_splitter=true
  • The data is stored on HDFS under /user/<username> (Sqoop Staging Area)
  • We can save the data to other locations. But when it’s always loaded first into staging area
  • Sqoop can only load data once (Data needs to dropped to load the entire dataset)
  • For tables that don’t have a primary key we need to specify mappers (Default Mappers: 4)
  • Mappers specify the no. of parallel processes that is going to be used to load the data

HDFS to RDB

create table patmysql(pid int, pname varchar(10), drug varchar(10), gender varchar(10), amt int);
 
sqoop export -connect jdbc:mysql://localhost/sqoopdb 
--username hiveuser --password hivepassword --table patmysql 
--export-dir /user/hive/warehouse/lti871.db/patient -m 1 
--input-fields-terminated-by ','

IMPORTANT

Before exporting data make sure already table exist. If not error will be thrown

RDB to Hive

sqoop-import --connect jdbc:mysql://localhost/sqoopdb -username hiveuser -password hivepassword --table emp -create-hive-table -hive-table lti871.employeehive -hive-import --fields-terminated-by ',' -m 1;

We don’t have to create a table in hive it can be created automatically. The data is 1st written into the Staging area and then loaded into Hive. If same data exist in staging area the job will fail

Options

Location to save data
--target-dir <hdfs-dir>

Filter Data
--where "ename='David'"

Incremental loading into RDB into HDFS
--incremental append/ lastmodified --check-column <column-name> --last-value <value>

Instead of hardcoding an password prompt will be shown
-P

Makes an new managed table in Hive
--create-hive-table

sqoop commands in detail - Google Docs