Abstract

Directed acyclic graph (DAG)-aware task scheduling algorithms have been studied extensively in recent years, and these algorithms have achieved significant performance improvements in data-parallel analytic platforms. However, current DAG-aware task scheduling algorithms, among which HEFT and GRAPHENE are notable, pay little attention to the cache management policy, which plays a vital role in in-memory data-parallel systems such as Spark. Cache management policies that are designed for Spark exhibit poor performance in DAG-aware task-scheduling algorithms, which leads to cache misses and performance degradation. In this study, we propose a new cache management policy known as Long-Running Stage Set First (LSF), which makes full use of the task dependencies to optimize the cache management performance in DAG-aware scheduling algorithms. LSF calculates the caching and prefetching priorities of resilient distributed datasets according to their unprocessed workloads and significance in parallel scheduling, which are key factors in DAG-aware scheduling algorithms. Moreover, we present a cache-aware task scheduling algorithm based on LSF to reduce the resource fragmentation in computing. Experiments demonstrate that, compared to DAG-aware scheduling algorithms with LRU and MRD, the same algorithms with LSF improve the JCT by up to 42% and 30%, respectively. The proposed cache-aware scheduling algorithm also exhibits about 12% reduction in the average job completion time compared to GRAPHENE with LSF.

Highlights

  • Spark is an in-memory data analytics framework that is used extensively in iterative data processing with low latency [1,2,3,4]

  • We have presented a cache management policy with data eviction and prefetching in Spark, known as Long-Running Stage Set First (LSF)

  • LSF calculates the caching and prefetching priority of the resilient distributed datasets (RDDs) according to their unprocessed workloads and significance in parallel scheduling, which are key factors in directed acyclic graph (DAG)-aware scheduling algorithms

Read more

Summary

Introduction

Spark is an in-memory data analytics framework that is used extensively in iterative data processing with low latency [1,2,3,4] It uses resilient distributed datasets (RDDs) to cache and compute parallel data, which results in significant performance improvements compared to traditional disk-based frameworks [5]. The DAG-aware scheduling should be optimized according to the cache policy to improve resource utilization and to obtain a better job completion time (JCT) in the workflow. We investigated the impact of the cache policy on scheduling and proposed a cacheaware scheduling method in Spark, which increases the parallelism of tasks and exhibits better utilization of cluster resources.

Background and Motivation
RDD and Data Dependency
Scheduling
Cache-Oblivious
System Design
Cache Management Policy for DAG-Aware Task Scheduling
Cache-Aware Scheduling Algorithm
Spark Implementation
//Schedule phase
Evaluations
Performance of LSF
Performance of Cache-Aware Scheduling
Findings
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.