Abstract

Spark needs to use lots of memory resources, network resources and disk I/O resources when Spark SQL execute Join operation. The Join operation will greatly affect the performance of Spark SQL. How to improve the Join operation performance become an urgent problem. Spark SQL use Catalyst as query optimizer in the latest release. Catalyst query optimizer both implement the rule-based optimize strategy (RBO) and cost-based optimize strategy (CBO). There are some problems with the Catalyst CBO module. In the first place, the characteristic of In-memory computing in Spark was not fully considered. In the second place, the cost estimation of network transfer and disk I/O is insufficient. To solve these problems and improve the performance of Spark SQL. In this study, we proposed a cost estimation model for Join operator which take the cost from four aspects: time complexity, space complexity, network transfer and disk I/O. Then, the most cost-efficiency plan could be selected by using hierarchical analysis method from the equivalence physical plans which generated by Spark SQL. The experimental results show that the total amount of network transmission is reduced and the usage of processor is increased. Thus the performance of Spark SQL has improved.

Highlights

  • Spark SQL analyze SQL and generate execution plans by using Catalyst query optimizer

  • A SQL Statement will be analyzed from an abstract syntax tree (AST) to a logical plan by Catalyst and the logical plan will be optimized by Catalyst at same time

  • Shuffle-Hash-Join relies on Spark Shuffle and HashJoin operations, Broadcast-Hash-Join only relies on HashJoin operation, Sort-Merge-Join is a sort-based join operation, Broadcast-Nested-Loop-Join is a Join operation that applies to LeftJoin, RightJoin and OuterJoin

Read more

Summary

Introduction

Spark SQL analyze SQL and generate execution plans by using Catalyst query optimizer. The Join operator is one of the most complex operators in Spark SQL. Using cost-based optimization, we first need to get statistics information about the corresponding tables and columns By using these statistics information, we can estimate the cost of different operators in one SQL statement and select the optimal execution plan according to the estimation result. In Spark 2.2 release, a cost-based optimization strategy was implemented. Its cost estimation model is needed to be improved, especially on the complex operators such as Join operator. The cost estimation models for each kind of Join Operator implementations were proposed. 2. A physical plan selection strategy based on the cost estimation model was proposed

Related Works
Related Definitions
Cost Model of Join Operators
Cost Model of Shuffle Operator
Cost Model of HashJoin Operator
Cost Model of Broadcast-Hash-Join
Cost Model of Shuffle-Hash-Join
Cost Model of Sort-Merge-Join
Cost Estimation Method using AHP
Optimal Physical Plan Choice Algorithm
Experimental Results
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.