Amazon’s Elastic Data Pipeline does a fine job of scheduling data processing activities. It spawns a cluster and executes Hive script when the data becomes available. And after all the jobs have completed the pipeline shuts down the EMR resource and exits. Since the cluster is only created and in use while the scripts are running you save money by not having a dedicated cluster. But this also creates a problem.
Problem:
Within our team we use EDP to schedule periodic pipelines that execute one or more hive scripts. Our pipelines do not run on a dedicated cluster, rather each pipeline creates its own EMR resource. Because of this, we need to add CREATE EXTERNAL TABLE command to each of our Hive scripts.
This is a recipe for disaster because if multiple scripts use the same table, then each have their own CREATE EXTERNAL TABLE command. If the schema of this table were to change (not an uncommon activity) then we would need to go and update each of the scripts using that table. And through experience I have learnt that it’s foolish to rely on a person successfully executing such repetitive tasks manually without any error! After all, we are all humans.
Ideally, we should have a hive script that defines the schema for a table and every hive script that needs the table includes its schema-defining hive script. Now if the schema of any table were to change, we would need to update only one script.
Solution:
If you run hive -h you get the following:
Usage: hive [-hiveconf x=y]* [<-i filename>]* [<-f filename>|<-e query-string>] [-S]
-i <filename> Initialization Sql from file (executed automatically and silently before any other commands)
-e 'quoted query string' Sql from command line
-f <filename> Sql from file
-S Silent mode in interactive shell where only data is emitted
-hiveconf x=y Use this to set hive/hadoop configuration variables.
-e and -f cannot be specified together. In the absence of these options, interactive shell is started. However, -i can be used with any other options.
To see this usage help, run hive -h
The -i option is used to specify initialization files. And we will solve our problem by using this! I’ll demonstrate the solution through an example. Say we have a table called abc_employees. Let’s create a script that will initialize the database and create this table for us so that we can use it later. Let’s call this file create_employees_table.hql.
CREATE EXTERNAL TABLE ${COMPANY}_employees (
first_name string,
last_name string
) LOCATION 's3://<>';
Now let’s create a test file that just does select * on the employees table. Let’s call it test.hql.
SELECT * FROM ${COMPANY}_employees;
Notice how we have separated the create command and the analytics command into their own files. We can now run the hive test.hql script by:
hive -i create_employees_table.hql -f test.hql -d COMPANY=abc -v
[Please note that you need to use -hiveconf instead of -d to set variables if you are not using AWS infrastructure.]
When the above command is executed, Hive first processes the scripts specified by -i option. So, the create_employees_table.hql script would execute first and create the abc_employees table in our cluster. Then when test.hql script executes it runs without any error because the abc_employees table had already been created in the initialization step earlier.
The Hive CLI can take multiple -i options, so you don’t have to put all your create commands in one file.
Leave a Reply