Improving the Shuffle of Hadoop MapReduce

Jingui Li,Yue Ye,Xuelian Lin,Xiaolong Cui

doi:10.1109/cloudcom.2013.42

Abstract

As an efficient parallel computing system based on MapReduce model, Hadoop is widely used for large-scale data analysis such as data mining, machine learning and scientific simulation. However, there are still some performance problems in MapReduce, especially the situation in the shuffle phase. In order to solve these problems, in this paper, a lightweight individual shuffle service component with more efficient I/O policy was proposed rather than the existing shuffle phase in MapReduce. We also describe how to implement the shuffle service in three steps: extract shuffle from reduce task as a shuffle task, reconstruct the shuffle task as a service and improve I/O scheduling policy on Map sides. Furthermore both simulated experiments and MapReduce job comparative studies are conducted to evaluate the performance of our improvements. The result reveals that our approach can decrease the whole job's execution time and make full use of cluster resources.

Full Text