Creating Hive tables is a common experience to all of us that use Hadoop. It enables us to mix and merge datasets into unique, customized tables. And, there are many ways to do it.
We have some recommended tips for Hive table creation that can increase your query speeds and optimize and reduce the storage space of your tables. And it’s simpler than you might think.
For analysts and data scientists, implementing these methods means you can spend more time learning from the data than learning how to use it, or waiting for queries to return. For engineers and developers, this means you can lower your block count and file sizes, and make your analysts and data scientists happy at the same time.
In a nutshell, the steps are:
Create your table in ORC format
This is the standard way of creating a basic Hive table. The only difference? Instead of using the default storage format of TEXT, this table uses ORC, a columnar file format in Hive/Hadoop that uses compression, indexing, and separated-column storage to optimize your Hive queries and data storage. With this created, data can be freely inserted into it, and data will be converted to this ORC format on-the-fly!Create table statement
CREATE TABLE my_database.my_table ( column_1 string, column_2 int, column_3 double ) STORED AS ORC TBLPROPERTIES('ORC.COMPRESS'='SNAPPY'); -- ensure SNAPPY is uppercase, lowercase triggers a nasty bug in Hive (fixed in later versions)
You can even do it through a convenient CTAS statement. The CTAS method will not work if you want to create a partitioned, external, or list-bucketing table. Use the CTAS method for simple datasets like lookups, or quick and dirty temporary tables, not enormous ones you plan to keep — even small tables can gain from ORC’s benefits.Create table as SELECT
CREATE TABLE my_database.my_table STORED AS ORC TBLPROPERTIES('ORC.COMPRESS'='SNAPPY') as SELECT * FROM my_database.my_other_table WHERE YEAR=2017 AND MONTH=11 AND DAY=30;
Partition your table
Partitioning your tables is a fantastic way to improve the processing times of queries on your table. This is because of how Hive scans partitions to execute job tasks in parallel — partitioning your data logically assists the job planner in that process.Partitioning a table
CREATE TABLE my_database.my_table ( column_1 string, column_2 int, column_3 double ) PARTITIONED BY ( year int, month smallint, day smallint, hour smallint )
If you want a turbo-boost to your queries, use partitioning and the ORC format on your tables. It’s highly recommended. If we had a Hadoopsters Approves stamp, we’d use it here.Partitioning + ORC
CREATE TABLE my_database.my_table ( column_1 string, column_2 int, column_3 double ) PARTITIONED BY ( year int, month smallint, day smallint, hour smallint ) STORED AS ORC TBLPROPERTIES('ORC.COMPRESS'='SNAPPY'); -- ensure SNAPPY is uppercase, lowercase triggers a nasty bug in Hive (fixed in later versions)
Analyze your table when you make changes to it
Analyzing a table (also known as computing statistics) is a built-in Hive operation that you can execute to collect metadata on your table. This can vastly improve query times on the table because it collects the row count, file count, and file size (bytes) that make up the data in the table and gives that to the query planner before execution. By running this query, you collect that information and store it in the Hive Metastore (metadata store), which will make future queries on this table more optimal. Every time the contents of the table change, stats should be recollected.Analyze a table
ANALYZE TABLE my_database.my_table compute statistics;
You can (and really should) analyze partitioned tables as well. In fact, it’s particularly helpful for tables with partitions, as, again, it assists in query planning and optimization when querying the table.Analyze a partitioned table
ANALYZE TABLE my_database.my_table PARTITION (YEAR=2017, MONTH=11, DAY=30) compute statistics;You can also analyze the columns of your table and/or partitions. This is a more intense stat-collecting function that collects metadata on columns you specify, and stores that information in the Hive Metastore for query optimization. That information includes, per column, The number of distinct values, The number of NULL values, Minimum or maximum K values where K could be given by a user, Histogram: frequency and height balanced, Average size of the column, Average or sum of all values in the column if their type is numerical, and Percentiles of the value.Analyzing table columns
ANALYZE TABLE my_database.my_table compute statistics for column1, column2, column3; -- column stats for non-partitioned table ANALYZE TABLE my_database.my_table PARTITION (YEAR=2017, MONTH=11, DAY=30, HOUR=0) compute statistics for column1, column2, column3; -- column stats for single hour of partitioned table ANALYZE TABLE my_database.my_table PARTITION (YEAR=2017, MONTH=11, DAY=30, HOUR) compute statistics for column1, column2, column3; -- column stats for a single day of partitioned tableMake sure (when you’re ready to query your table) you have these Hive settings enabled to ensure you make the most use out of your calculated stats!
set hive.compute.query.using.stats=true; set hive.stats.fetch.column.stats=true;
Use ORC, partitioning, and analyzing for a powerful combo
- Build your table with partitions, ORC format, and SNAPPY compression.
- Analyze your table when you make changes or add a partition, and analyze the partition.
- Analyze the columns you use most often (or all of them) at the partition level when you add a partition.
This article was written by Landon Robinson, senior software engineer at SpotX.