Historical data based approach for straggler avoidance in a heterogeneous Hadoop cluster

Kamalakant Laxman Bawankule,Anil Kumar Singh,Rupesh Kumar Dewang

doi:10.1007/s12652-020-02699-0

Abstract

Cloud computing has emerged as a new way of sharing resources. MapReduce has become the de facto standard for cloud computing, which helps for data-intensive computation in parallel. Hadoop is an open-source framework that allows the implementation of MapReduce on the cluster of commodity hardware. An environment with different generations of commodity hardware (node) raises heterogeneity in the Hadoop environment. Today heterogeneity has become common in industries as well as in research centers. Hadoop’s current implementation assumes that nodes in the environment are homogeneous and distribute the workload evenly among these nodes. This homogeneity assumption creates a load imbalance among the nodes in the heterogeneous Hadoop environment, which furthers leads to stragglers. Stragglers are the nodes that are available in the environment, but their performance is abysmal. The paper proposed a Historical data based data placement (HDBDP) policy to balance the workload among heterogeneous nodes based on their computing capabilities to improve the Map tasks data locality and to reduce the job turnaround time in the heterogeneous Hadoop environment. The approach introduces an agent to measures the node computing capabilities using the job history information. It also helps NameNode to decide the block counts for each node in the environment. The proposed policy’s performance is evaluated on Hadoop’s most popular benchmark, i.e., HiBench benchmark suite. Finally, compared to the Hadoop’s default data placement policy and different policies, the proposed HDBDP policy minimizes the job turnaround time for several workloads by an average of 14–26%. Also, it improves the Map tasks data locality by nearly 27% in a heterogeneous Hadoop environment.

Full Text