Performance Improvement of DAG-Aware Task Scheduling Algorithms with Efficient Cache Management in Spark

Yao Zhao,Jin Wu,Yanxin Liu,Hongwei Liu,Jian Dong

doi:10.3390/electronics10161874

Yao Zhao, Jin Wu + Show 3 more

Open Access

https://doi.org/10.3390/electronics10161874

Copy DOI

Abstract

Directed acyclic graph (DAG)-aware task scheduling algorithms have been studied extensively in recent years, and these algorithms have achieved significant performance improvements in data-parallel analytic platforms. However, current DAG-aware task scheduling algorithms, among which HEFT and GRAPHENE are notable, pay little attention to the cache management policy, which plays a vital role in in-memory data-parallel systems such as Spark. Cache management policies that are designed for Spark exhibit poor performance in DAG-aware task-scheduling algorithms, which leads to cache misses and performance degradation. In this study, we propose a new cache management policy known as Long-Running Stage Set First (LSF), which makes full use of the task dependencies to optimize the cache management performance in DAG-aware scheduling algorithms. LSF calculates the caching and prefetching priorities of resilient distributed datasets according to their unprocessed workloads and significance in parallel scheduling, which are key factors in DAG-aware scheduling algorithms. Moreover, we present a cache-aware task scheduling algorithm based on LSF to reduce the resource fragmentation in computing. Experiments demonstrate that, compared to DAG-aware scheduling algorithms with LRU and MRD, the same algorithms with LSF improve the JCT by up to 42% and 30%, respectively. The proposed cache-aware scheduling algorithm also exhibits about 12% reduction in the average job completion time compared to GRAPHENE with LSF.

Highlights

Spark is an in-memory data analytics framework that is used extensively in iterative data processing with low latency [1,2,3,4]
We have presented a cache management policy with data eviction and prefetching in Spark, known as Long-Running Stage Set First (LSF)
LSF calculates the caching and prefetching priority of the resilient distributed datasets (RDDs) according to their unprocessed workloads and significance in parallel scheduling, which are key factors in directed acyclic graph (DAG)-aware scheduling algorithms

Summary

Introduction

Spark is an in-memory data analytics framework that is used extensively in iterative data processing with low latency [1,2,3,4] It uses resilient distributed datasets (RDDs) to cache and compute parallel data, which results in significant performance improvements compared to traditional disk-based frameworks [5]. The DAG-aware scheduling should be optimized according to the cache policy to improve resource utilization and to obtain a better job completion time (JCT) in the workflow. We investigated the impact of the cache policy on scheduling and proposed a cacheaware scheduling method in Spark, which increases the parallelism of tasks and exhibits better utilization of cluster resources.

Background and Motivation

RDD and Data Dependency

Scheduling

Cache-Oblivious

System Design

Cache Management Policy for DAG-Aware Task Scheduling

Cache-Aware Scheduling Algorithm

Spark Implementation

//Schedule phase

Evaluations

Performance of LSF

Performance of Cache-Aware Scheduling

Findings

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Electronics	Publication Date: Aug 4, 2021
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Performance Improvement of DAG-Aware Task Scheduling Algorithms with Efficient Cache Management in Spark

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronics

Lead the way for us

Similar Papers

Task Scheduling Algorithm for Interconnection Constrained Network of Heterogeneous Processors
E Ilavarasan ... P Thambidurai
-
E Ilavarasan, et. al.E Ilavarasan ... P Thambidurai
01 Jan 2004
01 Jan 2004

High Performance Task Scheduling Algorithm for Heterogeneous Computing System
E Ilavarasan ... P Thambidurai
-
E Ilavarasan, et. al.E Ilavarasan ... P Thambidurai
01 Jan 2004
01 Jan 2004

Failure-resilient DAG task scheduling in edge computing
Lingfeng Cai ... Xiulei Wang
Computer Networks | VOL. 198
Lingfeng Cai, et. al.Lingfeng Cai ... Xiulei Wang
02 Aug 2021
Computer Networks | VOL. 198

A Duplication Based Compile Time Scheduling Method for Task Parallelism
Sekhar Darbha ... Dharma P Agrawal
-
Sekhar Darbha, et. al.Sekhar Darbha ... Dharma P Agrawal
01 Jan 2001
01 Jan 2001

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Performance Improvement of DAG-Aware Task Scheduling Algorithms with Efficient Cache Management in Spark

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronics