Workload Driven Comparison and Optimization of Hive and Spark SQL

Man Zhang,Yutong Lu,Fang Liu,Zhiguang Chen

doi:10.1109/icisce.2017.166

Abstract

This paper proposes how to conduct the specific job performance optimization of Hive and Spark SQL, and make a comparison of them at the same time. First, we compare Hive and Spark SQL by ten SQL queries. By analyzing the impact of different file formats and compression strategies on the performance in different query types, we conclude that Spark SQL can better support Parquet, while it does not show obvious advantages for Parquet in Hive as in Spark SQL. Snappy has a better effect on the intermediate data compression, and relative to ORC, Parquet combined with Snappy has the best performance. Second, we change the default configuration for Hive, adjust the number of Map Reduce, optimize the join strategy, and eliminate the effects of data skew, making Hive performance increases 10% to 75% or more depending on the workload types. Also, we optimize Spark SQL through the improvement of parallelism and join methods.

Full Text