DynamoDb is a key-value storage store. One can query DynamoDb tables from Hive using the DynamoDBStorageHandler. It’s super easy to setup.
Let’s say we have built a platform that collects data for various clients, processes the data and outputs the processed data per client. For our example, let’s say each client can be identified by a unique tag. So our platform could probably have all the orders driven against each tag (client). Let’s say we have stored the client mapping in a DynamoDb table called client_mapping. To access this table from Hive all we need to do is:
DROP TABLE client_mapping;
-- Load the client - tag mapping FROM DynamoDb
CREATE EXTERNAL TABLE client_mapping (
tag_value string,
client_name string,
country string,
) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
"dynamodb.table.name" = "client_mapping",
"dynamodb.throughput.read.percent" = ".8000",
"dynamodb.throughput.write.percent" = ".8000",
"dynamodb.column.mapping" = "tag_value:tag_value,client_name:client_name,country:country"
);
Now you can write the usual Hive queries using the table client_mapping whose columns are tag_value, client_name and country.
Easy right? I thought so too. Until I stumbled across an issue with our production data. Let me explain.
I created the client_mapping table in two AWS regions — us-east-1 and eu-west-1. The idea was to utilize network regionalization and store USA, Canada client data in the us-east-1 AWS region and that of European countries in the eu-west-1 region. The Hive script was scheduled using EDP in respective regions. But I found that the Hive script in the eu-west-1 region was fetching client_mapping data from us-east-1 region. Weird!
The key is to include the dynamodb.region key in the tblproperties. This ensures that the DynamoDBStorageHandler uses the table from the correct region. If this key is not present then the default region used is us-east-1.
DROP TABLE client_mapping;
-- Load the client - tag mapping FROM DynamoDb
CREATE EXTERNAL TABLE client_mapping (
tag_value string,
client_name string,
country string,
) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
"dynamodb.table.name" = "client_mapping",
"dynamodb.region" = "${AWS_REGION}",
"dynamodb.throughput.read.percent" = ".8000",
"dynamodb.throughput.write.percent" = ".8000",
"dynamodb.column.mapping" = "tag_value:tag_value,client_name:client_name,country:country"
);
That is, it. Now we can pass proper AWS_REGION as an input to the Hive script execution to make sure we get data from the correct region.
Oh, and have I mentioned before that all the keys in tblproperties should be in small letters? Remember this always!
Leave a Reply