Performance Comparison of Hive, Impala and Spark SQL

Xiaopeng Li,Wenli Zhou

doi:10.1109/ihmsc.2015.95

Abstract

Quick query in the Big Data is important for mining the valuable information to improve the system performance. To achieve this goal, research institutions and internet companies develop three-type script query tools which are respectively Hive based on MapReduce, Spark SQL based on RDD and Impala based distributed query engine. In this paper, we compare three-type query tools in several ways. First we analyze the impact of the file format for the query time, and we conduct that compression can reduce the amount of data, so as to improve the query time. It is the best choice to take RC File compressed by Snappy for Hive, and it is the best choice to take Parquet for Impala. Further, Impala has the fastest query speed compared with Hive and Spark SQL. Second we discuss that the file format impact on the CPU and memory. Impala taken Parquet costs the least resource of CPU and memory. Impala taken the file format of Parquet show good performance. So we decide to evaluate Impala and Parquet. Then we find Parquet generated by different query tools show different performance. Finally, we find the query speed of Impala taken the file format of Parquet created by Spark SQL is the fastest. Consequently it is more suitable to use Impala for quick query.

Full Text