This post is shared by big data and analytics solutions provider to explain the limitation of hive. You will get to know about the external table partitioning issue, which is still under process.
Data is a lifeline of any organization. Without data, an organizations cannot think of functioning. The rate at which data is expanding is unpredictable and now everyone facing big data challenges too, it’s not about large volumes, but data that has varieties, its continuously changing with multiple sources.
Opportunities in Big Data for attaining powerful new insights by molding and analyzing the multiple data sets is enormous, at the same time, there are multiple challenges with big data such as to store, protect and handle data appropriately and to prevent it from becoming a liability. With big data opportunities comes big data challenges.
When we see that ecosystems which are used to tackle Big Data challenges are improving themselves, there are still few issues which act as a residual component, one such issue we have faced when processing data with Hive, thought of sharing with you.
As we generally use partition to improve the process time when implementing Hive queries, however when dealing with partitions in Hive external table we come up with a strange behavior of hive.
Suppose we have data as below under 2 files (hive_temp1 and hive_temp2) under path ‘training/hive/’, in 1st file we have person with year birth as ‘1988’ and in 2nd file with ‘1987’.
We will load both the files into below created Hive-internal table (user_1) with Year of birth (YOB) as partition column.
Loading both the files into respective partitions (hive_temp1 into partition “birth_yr=’1988’” and hive_temp2 into partition “birth_yr=’1987’”.
When we select the data from user_1, we can see that both the files are loaded successfully and data has been correctly captured in hive-internal table.
Verify the same in backend of hive table i.e. in HDFS.
Automatically, hive has created directory named birth_yr=1987 and birth_yr=1988 under table directory and both the files have been loaded into respective file locations.
Now, when we try to implement the same stuff with external table, hive behaves differently.
We have created hive-external table named ‘user_3’ with partition column as ‘birth_yr’ and location pointing towards ‘/test/input_data’.
Creating directory with names ‘birth_yr=1988’ and ‘birth_yr=1987’ under the path mentioned in location clause i.e. ‘/test/input_data’.
Loading data into birth_yr=1988 and birth_yr=1987 directory (in HDFS) from local UNIX box (hive_temp1 and hive_temp2) respectively.
By loading the data into each directory at backend and created an external table, one assumes that while querying the hive table they will expect the same output as hive internal table.
But when we select the data from table ‘user_3’, we cannot see the output as expected. We cannot even see a single row.
This is the limitation of hive and it is still under process that hive doesn’t capture the data in partition column when we create external table, even it didn’t detect the partition directory also.
Delete both the partition directories which we have created with names ‘birth_yr=1988’ and ‘birth_yr=1987’.
Now, Alter the table and add the partition into it with birth_yr=1988 and birth_yr=1987.
This will create the directory in backend (in HDFS) and then we can load the data into respective directories as shown below.
Now we can see the output the in table user_3 with partition column as birth_yr=1988 and birth_yr=1987
Currently this is limitation of hive and Jira case has been opened against it, till then we have to alter the external table and add the partition manually under it whenever one wants to load the data into partitions of hive external table.
If you have anything to ask from big data and analytics solutions provider related to Hive, make comment below. You can even share your views about this limitation of hive in comments with other readers.