Abstract

Recently, Big Data systems have been gaining increasing popularity on handling the massive amounts of data that are continuously generated in our digital world. While the Hadoop framework has pioneered the area of Big Data processing systems, it had clear performance limitations on providing the best performance of processing massive amounts of structured data. In addition, practically, many users of the big data systems face some challenges on dealing with the APIs and the low level programming abstractions of the Big Data System and they would prefer to use SQL (in which they are more proficient) as a high-level declarative language to express their tasks while leaving all of the execution optimization details to the backend engine. Thus, several systems have been designed and implemented to tackle these challenges by designing and implementing scalable query execution engines for processing massive structured data while supporting SQL interfaces. In this article, we present an extensive experimental study of four popular systems in this domain, namely, Apache Hive, SPARK SQL, Apache Impala and PrestoDB. In particular, we report and analyze the performance characteristics of these systems using three different benchmarks, namely, TPC-H, TPC-DS and TPCx-BB. Finally, we report a set of insights and important lessons that we have learned from conducting our experiments.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call