Abstract

Aiming at the low efficiency of join query in MapReduce traditional partition join algorithm when data skew, a replication join optimization algorithm based on sampling partition is proposed. According to the sampled statistics of connection attribute data, the algorithm divides the datasets in connection relationship into skewed data subset and non skewed data subset. In order to optimize the query performance, join query processing is carried out on them respectively. For the join queries of non skewed data subsets, the improved consistency hash function is used to partition these subsets, so that the load of data connection query processing of each node is balanced. For the skewed data subset join query, the smaller skewed data subsets are distributed to each node, and the larger skewed data subsets are partitioned according to the non skewed fields. In the Reduce stage, these skewed data subsets are join queried. Experiments show that the algorithm can optimize the join query performance under different data skew rates, and achieve efficient join query processing of large datasets.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.