The Optimization of Cost-Model for Join Operator on Spark SQL Platform

Xin Lian,Tianyu Zhang,J Heled,A Yuan

doi:10.1051/matecconf/201817301015

Xin Lian, Tianyu Zhang + Show 2 more

Open Access

https://doi.org/10.1051/matecconf/201817301015

Copy DOI

Abstract

Spark needs to use lots of memory resources, network resources and disk I/O resources when Spark SQL execute Join operation. The Join operation will greatly affect the performance of Spark SQL. How to improve the Join operation performance become an urgent problem. Spark SQL use Catalyst as query optimizer in the latest release. Catalyst query optimizer both implement the rule-based optimize strategy (RBO) and cost-based optimize strategy (CBO). There are some problems with the Catalyst CBO module. In the first place, the characteristic of In-memory computing in Spark was not fully considered. In the second place, the cost estimation of network transfer and disk I/O is insufficient. To solve these problems and improve the performance of Spark SQL. In this study, we proposed a cost estimation model for Join operator which take the cost from four aspects: time complexity, space complexity, network transfer and disk I/O. Then, the most cost-efficiency plan could be selected by using hierarchical analysis method from the equivalence physical plans which generated by Spark SQL. The experimental results show that the total amount of network transmission is reduced and the usage of processor is increased. Thus the performance of Spark SQL has improved.

Highlights

Spark SQL analyze SQL and generate execution plans by using Catalyst query optimizer
A SQL Statement will be analyzed from an abstract syntax tree (AST) to a logical plan by Catalyst and the logical plan will be optimized by Catalyst at same time
Shuffle-Hash-Join relies on Spark Shuffle and HashJoin operations, Broadcast-Hash-Join only relies on HashJoin operation, Sort-Merge-Join is a sort-based join operation, Broadcast-Nested-Loop-Join is a Join operation that applies to LeftJoin, RightJoin and OuterJoin

Summary

Introduction

Spark SQL analyze SQL and generate execution plans by using Catalyst query optimizer. The Join operator is one of the most complex operators in Spark SQL. Using cost-based optimization, we first need to get statistics information about the corresponding tables and columns By using these statistics information, we can estimate the cost of different operators in one SQL statement and select the optimal execution plan according to the estimation result. In Spark 2.2 release, a cost-based optimization strategy was implemented. Its cost estimation model is needed to be improved, especially on the complex operators such as Join operator. The cost estimation models for each kind of Join Operator implementations were proposed. 2. A physical plan selection strategy based on the cost estimation model was proposed

Related Works

Related Definitions

Cost Model of Join Operators

Cost Model of Shuffle Operator

Cost Model of HashJoin Operator

Cost Model of Broadcast-Hash-Join

Cost Model of Shuffle-Hash-Join

Cost Model of Sort-Merge-Join

Cost Estimation Method using AHP

Optimal Physical Plan Choice Algorithm

Experimental Results

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: MATEC Web of Conferences	Publication Date: Jan 1, 2018
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

The Optimization of Cost-Model for Join Operator on Spark SQL Platform

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: MATEC Web of Conferences

Lead the way for us

Similar Papers

Allocation of Join and Semi Join Operations based on Dynamic Selectivity Factor in a Distributed Database Query
Ankita Bhalla ... Richa Arora
International Journal of Computer Applications | VOL. 98
Ankita Bhalla, et. al.Ankita Bhalla ... Richa Arora
18 Jul 2014
International Journal of Computer Applications | VOL. 98

Optimizing the Join Operation on Hive to Accelerate Cross-Matching in Astronomy
Liang Li ... Hong Liu
-
Liang Li, et. al.Liang Li ... Hong Liu
01 May 2014
01 May 2014

Common Influence Join: A Natural Join Operation for Spatial Pointsets
Man Lung Yiu ... Nikos Mamoulis
-
Man Lung Yiu, et. al.Man Lung Yiu ... Nikos Mamoulis
01 Apr 2008
01 Apr 2008

Proximal Policy Optimization-based Join Order Optimization with Spark SQL
Kyeong-Min Lee ... Ina Kim
IEIE Transactions on Smart Processing & Computing | VOL. 10
Kyeong-Min Lee, et. al.Kyeong-Min Lee ... Ina Kim
30 Jun 2021
IEIE Transactions on Smart Processing & Computing | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The Optimization of Cost-Model for Join Operator on Spark SQL Platform

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: MATEC Web of Conferences