Abstract

Apache Hadoop is one of the most popular MapReduce framework for parallel processing of large data sets. As the job scheduler and resource manager, YARN plays a very important role. Schedulers on YARN are designed to minimize the makespan of MapReduce jobs. The performance of a scheduler in YARN depends not only on whether the resource capacity of the working nodes are fully utilized, but also on the dependencies among those tasks. Therefore it is very difficult to achieve an optimal solution. This paper proposes a new Hadoop YARN scheduling algorithm. The algorithm formalizes the problem as a multiple knapsack problem which takes into consideration of the resource cost and time cost of each task as well as the dependency between different tasks. Artificial Fish Swarm Algorithm is adopted to solve the knapsack optimization problem. The algorithm was implemented as a pluggable scheduler on the most recent version of Hadoop YARN and evaluated with several MapReduce benchmarks. The experimental results show that our scheduler could effectively reduce the makespan of Hadoop jobs by 30% compared with some existing scheduling policies.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call