ESAMR: An Enhanced Self-Adaptive MapReduce Scheduling Algorithm

Xiaoyu Sun,Chen He,Ying Lu

doi:10.1109/icpads.2012.30

Abstract

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Hadoop is an open-source implementation of Map Reduce, enjoying wide adoption, and is used not only for batch jobs but also for short jobs where low response time is critical. However, Hadoop's performance is currently limited by its default task scheduler, which implicitly assumes that cluster nodes are homogeneous and tasks make progress linearly, and uses these assumptions to decide when to speculatively re-execute tasks that appear to be stragglers. In practice, the homogeneity assumption does not always hold. Longest Approximate Time to End (LATE) is a Map Reduce scheduling algorithm that takes heterogeneous environments into consideration. It, however, adopts a static method to compute the progress of tasks. As a result neither Hadoop default nor LATE schedulers perform well in a heterogeneous environment. Self-adaptive Map Reduce Scheduling Algorithm (SAMR) uses historical information to adjust stage weights of map and reduce tasks when estimating task execution times. However, SAMR does not consider the fact that for different types of jobs their map and reduce stage weights may be different. Even for the same type of jobs, different datasets may lead to different weights. To this end, we propose ESAMR: an Enhanced Self-Adaptive Map Reduce scheduling algorithm to improve the speculative re-execution of slow tasks in Map Reduce. In ESAMR, in order to identify slow tasks accurately, we differentiate historical stage weights information on each node and divide them into k clusters using a k-means clustering algorithm and when executing a job's tasks on a node, ESAMR classifies the tasks into one of the clusters and uses the cluster's weights to estimate the execution time of the job's tasks on the node. Experimental results show that among the aforementioned algorithms, ESAMR leads to the smallest error in task execution time estimation and identifies slow tasks most accurately.

Full Text