Abstract

Quick query in the Big Data is important for mining the valuable information to improve the system performance. To achieve this goal, research institutions and internet companies develop three-type script query tools which are respectively Hive based on MapReduce, Spark SQL based on RDD and Impala based distributed query engine. In this paper, we compare three-type query tools in several ways. First we analyze the impact of the file format for the query time, and we conduct that compression can reduce the amount of data, so as to improve the query time. It is the best choice to take RC File compressed by Snappy for Hive, and it is the best choice to take Parquet for Impala. Further, Impala has the fastest query speed compared with Hive and Spark SQL. Second we discuss that the file format impact on the CPU and memory. Impala taken Parquet costs the least resource of CPU and memory. Impala taken the file format of Parquet show good performance. So we decide to evaluate Impala and Parquet. Then we find Parquet generated by different query tools show different performance. Finally, we find the query speed of Impala taken the file format of Parquet created by Spark SQL is the fastest. Consequently it is more suitable to use Impala for quick query.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.