Performance analysis of shared-nothing SQL-on-Hadoop frameworks based on columnar database systems

Awais Mehmood,Yasmeen Khaliq,Muhammad Khaleeq,Muhammad Iqbal

doi:10.1109/intech.2016.7845097

Abstract

Hadoop is a Java-based programming framework used by enterprises for management and analysis of large scale data originating from heterogeneous sources. To support the analysis of large scale data, different SQL-on-Hadoop systems are being utilized due to their ease of use for the people familiar with SQL. This study performs a comparative analysis of the SQL-on-Hadoop systems by comparing their performance with various hardware and software parameters. The performance of three SQL-on-Hadoop systems i.e. Hive, Impala and Tajo is analyzed by applying TPC-H benchmarks. The experimentation is done with two major and largely used file formats for columnar databases i.e. ORC and Parquet file formats. This work also investigates the performance of ORC and Parquet file formats and analyzes their characteristics along with various performance impacts of these two file formats on Hive, Impala, and Tajo. Finally, the results show that Impala outperforms Hive and Tajo by 5X to 10X when the workload dataset fits in its memory.

Full Text