Abstract

Big data refers to numerous forms of complex and large datasets which need distinctive computational platforms in order to be analyzed. Hadoop is one of the popular frameworks for analytics of big data. In Hadoop, a big job is split into multiple small tasks and then they are distributed to worker nodes in a parallel way using MapReduce to speed up computational processes. In this aspect, it is important how to improve throughput performance. MapReduce jobs require quick responses from the worker nodes to complete them under their deadlines. The existing scheduling schemes for Hadoop such as FIFO, fair, and capacity schedulers cannot guarantee the quick response requirement satisfying a prior deadline. Thus, Hadoop system needs to improve response time and completion time for the heterogeneous MapReduce jobs. In this paper, we propose an efficient preemptive deadline constraint scheduler based on least slack time and data locality. In order for better allocation of tasks and load balancing, we first analyze the task scheduling behaviors of the Hadoop platform. Based on that, we propose a novel preemptive approach which considers the remaining execution time of the job being executed in deciding preemption. The experimental results show that the proposed scheme significantly reduces the job execution time and queue waiting time, compared to existing schemes.

Highlights

  • In recent years, cloud computing and big data have attracted the researchers’ attention

  • We present a preemptive approach for effectively scheduling the jobs so that the total completion time of the jobs is reduced under given deadlines and least slack time

  • We present the existing schedulers for Hadoop to schedule the submitted MapReduce jobs based on their requirements and available resource in a computing cluster. fair scheduler [27] was proposed to assign average amount of resources to the jobs to be on shared all the jobs over time

Read more

Summary

INTRODUCTION

Cloud computing and big data have attracted the researchers’ attention. Hadoop is a distributed computing framework based on the MapReduce model that runs applications on a cluster of a large number of commodities and inexpensive computing nodes It is developed by Google in 2004 to handle big data applications by parallel processing. The proposed scheme in this paper attempts to solve these issues by focusing on meeting the deadlines of the jobs in a shared computing environment This requires accurate estimation of the map and reduce task computation time. Dynamic workloads scheduling with queue-wise preemption based on the priority of jobs to maximize the resource utilization of a Hadoop cluster. Developing a multi-server queuing model applicable to the proposed scheme to improve the schedulability process of MapReduce jobs under different constraints and requirements.

RELATED WORK
HADOOP SCHEDULERS
PERFORMANCES EVALUATION
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call