Abstract
Apache Spark is one of the most technically challenged frameworks for cluster computing in which data are processed in a parallel fashion. The cluster consists of unreliable machines. It processes a large amount of data faster compared to the MapReduce framework. For providing the facility of optimized and fast SQL query processing, a new unit is developed in Apache Spark named Spark SQL. It allows users to use relational processing and functional programming in one place. It provides many optimizations by leveraging the benefits of its core. This is called the catalyst optimizer. This optimizer has many rules to optimize queries for efficient execution. In this paper, we discuss a scenario in which the catalyst optimizer is not able to optimize the query competently for a specific case. This is the reason for inefficient memory usage and increases in the time required for the execution of the query by Spark SQL. For dealing with this issue, we propose a solution in this paper by which the query is optimized up to the peak level. This significantly reduces the time and memory consumed by the shuffling process.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.