Abstract

This paper proposes how to conduct the specific job performance optimization of Hive and Spark SQL, and make a comparison of them at the same time. First, we compare Hive and Spark SQL by ten SQL queries. By analyzing the impact of different file formats and compression strategies on the performance in different query types, we conclude that Spark SQL can better support Parquet, while it does not show obvious advantages for Parquet in Hive as in Spark SQL. Snappy has a better effect on the intermediate data compression, and relative to ORC, Parquet combined with Snappy has the best performance. Second, we change the default configuration for Hive, adjust the number of Map Reduce, optimize the join strategy, and eliminate the effects of data skew, making Hive performance increases 10% to 75% or more depending on the workload types. Also, we optimize Spark SQL through the improvement of parallelism and join methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call