Abstract

Hadoop is a Java-based programming framework used by enterprises for management and analysis of large scale data originating from heterogeneous sources. To support the analysis of large scale data, different SQL-on-Hadoop systems are being utilized due to their ease of use for the people familiar with SQL. This study performs a comparative analysis of the SQL-on-Hadoop systems by comparing their performance with various hardware and software parameters. The performance of three SQL-on-Hadoop systems i.e. Hive, Impala and Tajo is analyzed by applying TPC-H benchmarks. The experimentation is done with two major and largely used file formats for columnar databases i.e. ORC and Parquet file formats. This work also investigates the performance of ORC and Parquet file formats and analyzes their characteristics along with various performance impacts of these two file formats on Hive, Impala, and Tajo. Finally, the results show that Impala outperforms Hive and Tajo by 5X to 10X when the workload dataset fits in its memory.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call