Optimize Hive queries for Tableau – Five Tips

Posted on Posted in Business Intelligence

 

1. Denormalize your data.

In Hive a good practice is to denormalize your data.  With De-normalisation, the data is present in the same table so there is no need for any joins, hence the selects are very fast.

 

2. Execution Engine

Using your execution engine as tez or spark will optimize hive queries.
Optimize Hive

3. Usage of proper file format

Using proper file format based on your data will drastically optimize hive query performance. Generally ORC file format is best suitable for increasing your query performance.  PARQUET format is another alternative.  PARQUT format can be read by both Impala and Hive.

Optimize Hive

4. Optimize Hive Query by Vectorization

Vectorization improves operations like scans, aggregations, filters and joins, by performing them in batches of 1024 rows at once instead of single row each time.  Operations are performed on the entire column vector, which improves the instruction pipelines and cache usage.  This can optimize hive queries.

Optimize Hive
Optimize Hive

 

5. Cost based optimization

Optimize hive query’s logical and physical execution plan before submitting for final execution. These optimizations are not based on the cost of the query.

To collect statistics for the Hive query optimizer, set two parameters:

Optimize Hive

and then within Hive, enter the command:

Hive Optimization

 

There are a number of techniques available for improving the performance of Tableau visualizations and dashboards built from Hive. While Hadoop is a batch-oriented system, optimization hints listed above could reduce latency.

 

Visit our Tableau Dashboard Gallery.