Abstract

For many data-parallel computing systems like Spark, a job usually consists of multiple computation stages and inter-stage communication (i.e., coflows). Many efforts have been done to schedule coflows and jobs independently. The simple combination of coflow scheduling and job scheduling, however, would prolong the average job completion time (JCT) due to the conflict. For this reason, we propose a new abstraction of scheduling unit, named coBranch, which takes the dependency between computation stages and coflows into consideration, to schedule coflows and jobs jointly. Besides, mainstream coflow schedulers are order-preserving, i.e., all coflows of a high-priority job are prioritized than those of a low-priority job. We observe that the order-preserving constraint incurs low inter-job parallelism. To overcome the problem, we employ an urgency-based mechanism to schedule coBranches, which aims to decrease the average JCT by enhancing the inter-job parallelism. We implement the urgency-based coBranch Scheduling (BS) method on Apache Spark, conduct prototype-based experiments, and evaluate the performance of our method against the shortest-job-first critical-path method and the FIFO method. Results show that our method achieves around 10 and 15 percent reduction in the average JCT, respectively. Large-scale simulations based on the Google trace show that our method performs better and reduces JCT by 23 and 35 percent, respectively.

Highlights

  • To accelerate big data analytics, data-parallel frameworks such as Dryad [2], Hadoop [3] and Spark [4] partition large input data so that multiple computers process different data partitions concurrently

  • We propose the based coBranch Scheduling (BS) method to coordinately schedule the transmission of coflows with the execution of jobs, for decreasing the average job completion time (JCT)

  • Simulation on the average JCT: We evaluate the performance of three methods via the trace replay. 5000 jobs are submitted within 600s and assigned to 500 machines

Read more

Summary

INTRODUCTION

To accelerate big data analytics, data-parallel frameworks such as Dryad [2], Hadoop [3] and Spark [4] partition large input data so that multiple computers process different data partitions concurrently. Computation tasks of Job can be executed after high-priority coflows of other jobs are transmitted This will prolong the JCT of Job. As a result of the urgency-based scheduling mechanism, coBranches with high urgency should be prioritized they may belong to different jobs We propose the distributed flow scheduling method to coordinate the transmission of coflows with the execution of coBranches. Given a batch of jobs, the online BS method does not determine the priorities of all coBranches at once but updates the time-varying urgency during the execution of jobs and continuously makes scheduling decisions.

Job scheduling for data-parallel jobs
Coflow scheduling
Motivation of the coordinative scheduling mechanism
Problem formulation
Result
Model of the coBranch duration
URGENCY-BASED COBRANCH SCHEDULING METHOD
Motivation of urgency-based coBranch scheduling
Time-varying coBranch urgency
Exceeding time minimization
Overview
Distributed flow scheduling
Approximation ratio of the BS method
Performance improvement of the DFS method
Online BS method
21: Replace Btaken with Bnew
IMPLEMENTATION
PERFORMANCE EVALUATION
Experiment evaluation on the performance
Experiment evaluation on the JCT improvement
Details
Experiment evaluation on system overheads
30 BS online BS
Experiment evaluation on the online performance
Evaluation on the prediction accuracy
Simulation on Google cluster data
Analysis
V 5HDO7ime gap PV -log5
Simulations
CONCLUSION
Simulation on Facebook coflow data

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.