Abstract

With the rise of big data, more and more users will launch computing systems to process a large volume of data in various applications. A Scheduling algorithm is crucial to the performance of the processing platforms, especially when they are concurrently executing a batch of jobs. Such jobs usually represent multiple stages. Each stage produces the intermediate data which will be piped to the next stage for further processing. However, the scheduling problem in a big data computing system is different from the traditional multi-stage job scheduling problem as for any two consecutive stages, the later stage usually starts before the former stage is finished to “shuffle” the intermediate data. In this paper, we consider MapReduce/Hadoop as a representative computing system and develop a new strategy named OMO, Optimize MapReduce Overlap with a Good Start (Reduce) and a Good Finish (Map). A MapReduce job contains two consecutive phases: map and reduce. Our general target is to optimize the internal overlap between these two phases. There are two new techniques included in our solution, Lazy start of reduce tasks and Batch finish of map tasks, which aim to approach an effective alignment of the two phases based on the characteristics of the MapReduce process. OMO has been implemented on the Hadoop system with extensive experiments for performance evaluation. The results show that OMO's performance is superior in terms of total completion time (i.e., makespan) of a batch of jobs.

Highlights

  • In the past few years, we have all witnessed the rise of big data and various processing platforms such as Hadoop [1], Mesos [2] and Spark [3], which have been widely adopted in both academia and industry for various applications

  • The purpose of this paper aims to establish an efficient scheduling scheme in big data computing systems to better the resource consumption and reduce the makespan

  • This paper studies the scheduling problem in a big data computing system with multiple internal stages, especially in a Hadoop cluster serving a batch of MapReduce jobs

Read more

Summary

INTRODUCTION

In the past few years, we have all witnessed the rise of big data and various processing platforms such as Hadoop [1], Mesos [2] and Spark [3], which have been widely adopted in both academia and industry for various applications. This work advances a novel technique, called OMO, that targets on optimizing the overlap in between map and reduce stages This overlapping period plays an essential part in MapReduce processing, when the map stage produces large quantities of information for shuffling. OMO consists of 2 new strategies, lazy start of reduce tasks, and batch finish of map tasks The former strategy efforts to discover the most optimal timing to begin the reduce tasks in order to make sure that an adequate amount of time is allocated for reduce tasks to shuffle the intermediate data, whilst containers can be assigned in order to assist map tasks to the fullest.

RELATED WORK
OUR SOLUTION
1) MOTIVATION
Fe between
COMBINATION OF THE TWO TECHNIQUES
PERFORMANCE EVALUATION
Findings
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call