Time-Aware Data Partition Optimization and Heterogeneous Task Scheduling Strategies in Spark Clusters

Senxing Lu,Chunlin Li,Youlong Luo,Quanbing Du,Mingming Zhao

doi:10.1093/comjnl/bxad017

Abstract

Abstract The Spark computing framework provides an efficient solution to address the major requirements of big data processing, but data partitioning and job scheduling in the Spark framework are the two major bottlenecks that limit Spark’s performance. In the Spark Shuffle phase, the data skewing problem caused by unbalanced data partitioning leads to the problem of increased job completion time. In response to the above problems, a balanced partitioning strategy for intermediate data is proposed in this article, which considers the characteristics of intermediate data, establishes a data skewing model and proposes a dynamic partitioning algorithm. In Spark heterogeneous clusters, because of the differences in node performance and task requirements, the default task scheduling algorithm cannot complete scheduling efficiently, which leads to low system task processing efficiency. In order to deal with the above problems, an efficient job scheduling strategy is proposed in this article, which integrates node performance and task requirements, and proposes a task scheduling algorithm using greedy strategy. The experimental results prove that the dynamic partitioning algorithm for intermediate data proposed in this article effectively alleviates the problem that data skew leads to the decrease of system task processing efficiency and shortens the overall task completion time. The efficient job scheduling strategy proposed in this article can efficiently complete the job scheduling tasks under heterogeneous clusters, allocate jobs to nodes in a balanced manner, decrease the overall job completion time and increase the system resource utilization.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Time-Aware Data Partition Optimization and Heterogeneous Task Scheduling Strategies in Spark Clusters

Abstract

Talk to us

Similar Papers

More From: The Computer Journal

Lead the way for us

Similar Papers

Efficient scheduling strategies in High Performance Computing Service Platform for Shanghai Colleges
Lihua Zhang ... Longhai Zeng
-
Lihua Zhang, et. al. Lihua Zhang ... Longhai Zeng
01 Mar 2011
01 Mar 2011

Energy and SLA-driven MapReduce Job Scheduling Framework for Cloud-based Cyber-Physical Systems
Kuljeet Kaur ... Sahil Garg
ACM Transactions on Internet Technology | VOL. 21
Kuljeet Kaur, et. al.Kuljeet Kaur ... Sahil Garg
03 May 2021
ACM Transactions on Internet Technology | VOL. 21

A dynamic cluster job scheduling optimisation algorithm based on data irreversibility in sensor cloud
Xiaoyan Pan ... Chuanfeng Li
International Journal of Embedded Systems | VOL. 11
Xiaoyan Pan, et. al.Xiaoyan Pan ... Chuanfeng Li
01 Jan 2019
International Journal of Embedded Systems | VOL. 11

A dynamic cluster job scheduling optimisation algorithm based on data irreversibility in sensor cloud
Zeyu Sun ... Chuanfeng Li
International Journal of Embedded Systems | VOL. 11
Zeyu Sun, et. al.Zeyu Sun ... Chuanfeng Li
01 Jan 2019
International Journal of Embedded Systems | VOL. 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Time-Aware Data Partition Optimization and Heterogeneous Task Scheduling Strategies in Spark Clusters

Abstract

Talk to us

Similar Papers

More From: The Computer Journal