Abstract

The multi-way join query has attracted considerable attention from research community for its importance in many big data analytic applications. For the multi-round multi-way join algorithm in distributed data-parallel platforms, the huge communication cost caused by shuffling large intermediate results over the network is the main bottleneck. The one-round multi-way join algorithm processes the join query in a single communication round, which can significantly reduce the communication cost in complex queries, including cyclic queries. However, the one-round method is not always superior to the multi-round method, because the intermediate result size of the multi-round method may the much smaller than the size of data shuffled in the one-round method. Therefore, it is challenging to choose the best multi-way join algorithm in practice. To solve this problem, in this paper, we present AutoMJ, an efficient framework for multi-way join queries. In AutoMJ, we propose a novel automatic join strategy selection model based on the size estimation of intermediate join results. AutoMJ chooses the multi-way join strategy with the minimal shuffle data size. In addition, we propose an optimized HyperCube algorithm for the one-round multi-way join. We have implemented the prototype of AutoMJ on the widely-used distributed data-parallel platform Apache Spark. Experiments show that for multi-way join queries with large intermediate results, the one-round join strategy can outperform the multi-round join strategy built in Spark SQL 1.2 – 159.3× faster. In contrast, the multi-round join strategy is 2.1 – 6.2× faster than the one-round method for the queries with small intermediate results. Experiments also show that the relative error of size estimation can be within 0.1 for the Twitter dataset and 0.25 for the Wikidata dataset. Furthermore, experiments verify that the automatic join strategy selection model is effective for choosing the optimal multi-way join algorithm.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.