Improving MapReduce Performance via Heterogeneity-Load-Aware Partition Function

Huifeng Sun,Junliang Chen,Zhi Yang,Nan Yu,Chuanchang Liu,Zibin Zheng

doi:10.1109/cluster.2011.68

Abstract

MapReduce is an important programming model for large-scale data-intensive applications such as web indexing, scientific simulation, and data mining. Hadoop is an open-source implementation of MapReduce enjoying wide adoption. Partition function is an important component of Hadoop which split outputs of maps into bulks that place the input data of reduces. Based on the assumptions that cluster nodes are homogeneous and perform work at roughly the same rate, its default partition function splits intermediate keys into reduces. However, in practice the homogeneity assumptions seldom hold and cluster nodes usually perform work at different rate. In this paper, we design a heterogeneity-load-aware partition function named proportional partition function (PPF). Besides the dynamic loading of cluster nodes, PPF considers the capacity diversity of cluster nodes such as CPU processing speed and disk writing speed.

Full Text