Tag: hive
-
Writing Into Dynamic Partitions Using Spark
Hive has this wonderful feature of partitioning — a way of dividing a table into related parts based on the values of certain columns. Using partitions, it’s easy to query a portion of data. Hive optimizes the data load operations based on the partitions. Writing data into partitions is very easy. You have two options:…
-
Parse Json in Hive Using Hive JSON Serde
In an earlier post I wrote a custom UDF to read JSON into my table. Since then, I have also learnt about and used the Hive-JSON-Serde. I will use the same example as before. Now, using the Hive-JSON-Serde you can parse the above JSON record as: This is really great! I can now parse more…
-
Writing UDF To Parse JSON In Hive
Sometimes we need to perform data transformation in ways too complicated for SQL (even with the Custom UDF’s provided by hive). Let’s take JSON manipulation as an example. JSON is widely used to store and transfer data. Hive comes with a built-in json_tuple() function that can extract values for multiple keys at once. But if…
-
Always Specify Region When Calling DynamoDb from Hive
DynamoDb is a key-value storage store. One can query DynamoDb tables from Hive using the DynamoDBStorageHandler. It’s super easy to setup. Let’s say we have built a platform that collects data for various clients, processes the data and outputs the processed data per client. For our example, let’s say each client can be identified by…
-
CamelCase Partition Column is a Bad Idea in Hive
Outside Java code I prefer snake_case over camelCase. This is mostly a preference without any strong good reason: Without a proper IDE I find it easier to read snake_case words than camelCase words. Python’s naming convention uses snake_case for variable names. They use camelCase only for class names. Languages like MySQL, Hive, etc convert everything…
-
Reusing Hive Scripts
Amazon’s Elastic Data Pipeline does a fine job of scheduling data processing activities. It spawns a cluster and executes Hive script when the data becomes available. And after all the jobs have completed the pipeline shuts down the EMR resource and exits. Since the cluster is only created and in use while the scripts are…