Abstract

MapReduce is an essential framework for distributed storage and parallel processing for large-scale dataintensive jobs proposed in recent times. Hadoop default scheduler assumes a homogeneous environment. This assumption of homogeneity does not work at all times in practice and limits the performance of MapReduce. In heterogeneous environments, the job completion times do not synchronize. Data locality is essentially moving computation closer (faster access) to the input data. Fundamentally, MapReduce does not always look into the heterogeneity from a data locality perspective. Improving data locality for MapReduce framework is an important issue to enhance the performance of heterogeneous Hadoop clusters. Learning based scheduling decisions can potentially help in significantly reducing the overall job execution time. In this paper, we provide an overview of the taxonomy for MapReduce schedulers. This paper proposes a novel hybrid scheduler using a Reinforcement learning based approach. The proposed scheduler identifies the true Straggler tasks and schedules these tasks on fast processing nodes in a heterogeneous Hadoop cluster by taking the data locality into account.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call