Abstract

Big data analytics has become increasingly vital in many modern enterprise applications such as user profiling and business process optimization. Today’s big data processing systems, such as Hadoop MapReduce, Spark, and Hive, treat big data applications as a batch of jobs for scheduling. Existing schedulers in production systems often maintain fair allocation without considering application performance and resource utilization simultaneously. It is challenging to perform job scheduling in big data systems to achieve both low turnaround time and high resource utilization due to the high complexity in data processing logics and the dynamic variation in workloads. In this article, we propose a performance-aware scheduler, referred to as PAS, which dynamically schedules big data jobs in Hadoop YARN and Spark and autonomously adjusts scheduling policies to improve application performance and resource utilization. Specifically, PAS schedules multiple concurrent jobs using different policies based on the predicted job completion time and employs a greedy approach and a one-step lookahead strategy to opportunistically maximize the average job performance while still maintaining a satisfactory level of resource utilization. We implement PAS in Hadoop YARN and evaluate its performance with HiBench, a well-known big data processing benchmark. Experimental results show that PAS reduces the average turnaround time by 25% and the makespan by 15% in comparison with four state-of-the-art schedulers.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call