An Optimal Reduce Placement Algorithm for Data Skew Based on Sampling

Zhuo Tang,Rui Li,Wen Ma,Keqin Li,Kenli Li

doi:10.1007/978-3-319-29006-5_8

Abstract

For frequent disk I/O and big data transmissions among different racks and physical nodes, the intermediate data communication has become the biggest performance bottle-neck in most running Hadoop systems. This paper proposes a reduce placement algorithm called CORP to schedule related map and reduce tasks on the near nodes or clusters or racks for the data locality. Since the number of keys cannot be counted until the input data are processed by map tasks, this paper firstly provides a sampling algorithm based on reservoir sampling to achieve the distribution of the keys in intermediate data. Through calculating the distance and cost matrices among the cross node communication, the related map and reduce tasks can be scheduled to relatively near physical nodes for data locality. Experimental results show that CORP can not only improve the balance of reduce tasks effectively, but also decrease the job execution time for the lower inner data communication.

Full Text