Sqoop – Hive Job Failed When Executing With Query Option

Discussion for two options available to import data in hive from RDBMS. Full story to know about Sqoop-Hive failure with Query option execution.
Loading...

In this article post, experts of big data solutions companies will discuss the two options available to import data in hive from RDBMS. Read the full story to know about Sqoop-Hive failure with Query option execution.

The promise of data-driven decision-making is now being recognized broadly, and there is growing enthusiasm for the notion of “Big Data”, while the promise of Big Data is real – for example, it is estimated that Google alone contributed 54 billion dollars to the US economy in 2009 – there is currently a wide gap between its potential and its realization.

Having described the multiple phases in the Big Data analysis pipeline, we now turn to some common challenges that underlie many, and sometimes all, of these phases.

Technology:  

Sqoop and Oozie are the 2 major components of the Hadoop ecosystems used for data ingestion and job scheduling but there are certain loopholes when we integrate both of them together. We have encounter one such issue and thought of sharing it with you.

Use-case:

Below is the sample data in MySQL table – sqoop_test1 which we are going to load into Hive table with Sqoop command.

1

When you want to import the whole data into Hive table from MySQL table thru Sqoop, we will use below Sqoop command to achieve result.

Specify MySQL table name and hive-import clause along with hive table name.

2

We can see the expected output in hive table named – hive_sqoop1

Note:Hive table will be created automatically with Sqoop command if it’s not been created earlier or if the table is present, it will load the data into it.

3

But when we execute the same Sqoop command to select data from MySQL table based on certain conditions (in the form of query) then it got failed.

Below is the Sqoop command which we have used to achieve our result.

sqoop import \

--connect jdbc:mysql://localhost/mysql_db \

--username root \

--password cloudera \

--query 'select id, name from sqoop_test1 where id > 3 and $CONDITIONS'  --hive-import -hive-table hive_sqoop2 \

--m 1

One can refer below box if facing difficulty too see the above command.

Issue:

When we replace the same command which we have used in past to load the whole MySQL table in Hive table with the query option command, it got failed with the error that it is mandatory to specify the source directory in HDFS and we cannot directly give Hive table name with hive-import option.

Resolution:

There are 2 ways to handle this issue as discussed below.

  1. After some hit and trials, we came to conclusion that we can specify target directory clause with hive-import option i.e. we can specify target directory in HDFS as default directory path for the hive table (/user/hive/warehouse/) and specify the same table name in hive-import option.

In our case directory name will be ‘/user/hive/warehouse/hive_sqoop2’ and hive table name will be hive_sqoop2.

5

sqoop import \

--connect jdbc:mysql://localhost/mysql_db \

--username root \

--password cloudera \

--query 'select id, name from sqoop_test1 where id > 3 and $CONDITIONS' \

--target-dir /user/hive/warehouse/hive_sqoop2 -hive-import -hive-table hive_sqoop2 \

--m 1

One can refer below box if facing difficulty too see the above command.

Below is the important INFO message which sqoop will throw at the time of Sqoop command execution. Have a look at it.

6

Now, the command ran successfully and you can see the expected output in Hive table – hive_sqoop2.

7

Below are the screenshots of the backend directory in backend where the data has been loaded.

8

9

  1. We can directly import the data to some XYZ location in HDFS and then create Hive-external table with Location clause on the top of it.

In our case we have loaded data into /training/input_data/hive_sqoop2in HDFS as you can see in below command.

sqoop import \

--connect jdbc:mysql://localhost/mysql_db \

--username root \

--password cloudera \

--query 'select id, name from sqoop_test1 where id > 3 and $CONDITIONS' \

--target-dir /training/input_data/hive_sqoop2 \

--m 1

Once can see the output of the command below into the specified directory.

10

11

We can now create the Hive-external table on the top of the output directory (/training/input_data/hive_sqoop2) and can see the desired output.

Although by following the 2nd approach we have to include one more step in the whole process but it will help in data refresh as we have created external table in hive.

Conclusion:

We can opt any of the 2 options discussed above to achieve the result i.e. to import data in Hive from RDBMS based on certain conditions.

Hope experts of big data solutions companies have made you clear about the options that you can use to import data in hive from RDBMS. If you did not understand anything, contact experts today.

927 Total Views 1 Views Today
Loading...


Vijay is a compulsive blogger who likes to educate like-minded people on various new technologies and trends. He works with Aegis SoftTech as a software developer and has been developing software for years. Stay Connected to him on Facebook and Google+.