MapReduce-Based Dynamic Partition Join with Shannon Entropy for Data Skewness

Donghua Chen,Runtong Zhang

doi:10.1155/2021/1602767

Abstract

Join operations of data sets play a crucial role in obtaining the relations of massive data in real life. Joining two data sets with MapReduce requires a proper design of the Map and Reduce stages for different scenarios. The factors affecting MapReduce join efficiency include the density of the data sets and data transmission over clusters like Hadoop. This study aims to improve the efficiency of MapReduce join algorithms on Hadoop by leveraging Shannon entropy to measure the information changes of data sets being joined in different MapReduce stages. To reduce the uncertainty of data sets in joins through the network, a novel MapReduce join algorithm with dynamic partition strategies called dynamic partition join (DPJ) is proposed. Leveraging the changes of entropy in the partitions of data sets during the Map and Reduce stages revises the logical partitions by changing the original input of the reduce tasks in the MapReduce jobs. Experimental results indicate that the entropy-based measures can measure entropy changes of join operations. Moreover, the DPJ variant methods achieved lower entropy compared with the existing joins, thereby increasing the feasibility of MapReduce join operations for different scenarios on Hadoop.

Highlights

Join algorithms between two data sets in stand-alone relational databases have been optimized for years; the increasing needs of big data analysis result in the emergence of various types of parallel join algorithms [1]
To empirically evaluate the aforementioned join algorithms, several experiments are conducted in a Hadoop cluster to join two data sets with different settings of sizes. is section firstly introduces the detailed environment settings and the data set configurations in our experiments. en, eight types of join algorithms on Hadoop have been defined and used in performance evaluation
Our experiments are performed in the distributed Hadoop architecture. e cluster consists of one master node and 14 data nodes to support parallel computing

Summary

Introduction

Join algorithms between two data sets in stand-alone relational databases have been optimized for years; the increasing needs of big data analysis result in the emergence of various types of parallel join algorithms [1]. In the era of big data, such join operations on large data sets should be performed in existing distributed computing architectures, such as Apache Hadoop; that is, efficient joins must follow the scheme of programming models and require the extended revision of conventional joins for architectures [2]. Joining two data sets with MapReduce requires a proper design of the Map and Reduce stages for different scenarios. E factors affecting MapReduce join efficiency include the density of the data sets and data transmission over clusters like Hadoop. Since Shannon’s theory indicates that information is a measurable commodity, it is possible that MapReduce stages and the transmission of data sets over Hadoop clusters can be treated as a message channel between senders (mappers) and receivers (reducers). A data set to be joined is considered as a memoryless message source that contains a set of messages with its own probabilities to send to receivers

Methods

Results

Discussion

Conclusion