Query Optimization Algorithm of Replication Join Based on Sampling Partition

Xin Lü ,Junchao Yang ,Jiao Yuan,Kun Fu,Xun Wang,Ke Yang

doi:10.1088/1742-6596/1693/1/012074

Abstract

Aiming at the low efficiency of join query in MapReduce traditional partition join algorithm when data skew, a replication join optimization algorithm based on sampling partition is proposed. According to the sampled statistics of connection attribute data, the algorithm divides the datasets in connection relationship into skewed data subset and non skewed data subset. In order to optimize the query performance, join query processing is carried out on them respectively. For the join queries of non skewed data subsets, the improved consistency hash function is used to partition these subsets, so that the load of data connection query processing of each node is balanced. For the skewed data subset join query, the smaller skewed data subsets are distributed to each node, and the larger skewed data subsets are partitioned according to the non skewed fields. In the Reduce stage, these skewed data subsets are join queried. Experiments show that the algorithm can optimize the join query performance under different data skew rates, and achieve efficient join query processing of large datasets.

Full Text