Cross-Layer Self-Similar Coflow Scheduling for Machine Learning Clusters

Guang Yang,Yong Jiang,Xuya Jia,Qing Li,Mingwei Xu

doi:10.1109/icccn.2018.8487329

Abstract

In recent years, many companies have developed various distributed computation frameworks for processing machine learning (ML) jobs in clusters. Networking is a well-known bottleneck for ML systems and the cluster demands efficient scheduling for huge traffic (up to 1GB per flow) generated by ML jobs. Coflow has been proven an effective abstraction to schedule flows of such data-parallel applications. However, the implementation of coflow scheduling policy is constrained when coflow characteristics are unknown a prior, and when TCP congestion control misinterprets the congestion signal leading to low throughput. Fortunately, traffic patterns experienced by some ML jobs support to speculate the complete coflow characteristic with limited information. Hence this paper summarizes coflow from these ML jobs as self-similar coflow and proposes a decentralized self-similar coflow scheduler Cicada. Cicada assigns each coflow a probe flow to speculate its characteristics during the transportation and employs the Shortest Job First (SJF) to separate coflow into strict priority queues based on the speculation result. To achieve full bandwidth for throughput- sensitive ML jobs, and to guarantee the scheduling policy implementation, Cicada promotes the elastic transport-layer rate control that outperforms prior works. Large-scale simulations show that Cicada completes coflow 2.08x faster than the state-of-the-art schemes in the information-agnostic scenario.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Cross-Layer Self-Similar Coflow Scheduling for Machine Learning Clusters

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Job scheduling for large-scale machine learning clusters
Haoyu Wang ... Zetian Liu
-
Haoyu Wang, et. al.Haoyu Wang ... Zetian Liu
23 Nov 2020
23 Nov 2020

Machine Learning Feature Based Job Scheduling for Distributed Machine Learning Clusters
Haoyu Wang ... Haiying Shen
IEEE/ACM Transactions on Networking | VOL. 31
Haoyu Wang, et. al.Haoyu Wang ... Haiying Shen
01 Feb 2023
IEEE/ACM Transactions on Networking | VOL. 31

Addressing Skewness in Iterative ML Jobs with Parameter Partition
Shaoqi Wang ... Mike Ji
-
Shaoqi Wang, et. al.Shaoqi Wang ... Mike Ji
01 Apr 2019
01 Apr 2019

Dynamic Pricing and Placing for Distributed Machine Learning Jobs: An Online Learning Approach
Ruiting Zhou ... John C. S. Lui
IEEE Journal on Selected Areas in Communications | VOL. 41
Ruiting Zhou, et. al.Ruiting Zhou ... John C. S. Lui
01 Apr 2023
IEEE Journal on Selected Areas in Communications | VOL. 41

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Cross-Layer Self-Similar Coflow Scheduling for Machine Learning Clusters

Abstract

Talk to us

Similar Papers