Deep Learning-Based Job Placement in Distributed Machine Learning Clusters With Heterogeneous Workloads

Yixin Bao,Yanghua Peng,Chuan Wu

doi:10.1109/tnet.2022.3202529

Abstract

Nowadays, most leading IT companies host a variety of distributed machine learning (ML) workloads in ML clusters to support AI-driven services, such as speech recognition, machine translation, and image processing. While multiple jobs are executed concurrently in a shared cluster to improve resource utilization, interference among co-located ML jobs can lead to significant performance downgrade. Existing cluster schedulers, such as YARN and Mesos, are interference-agnostic in their job placement, leading to suboptimal resource efficiency and usage. Some literature has studied interference-aware job placement policy, but relies on detailed workload profiling and interference modeling, which is not a general solution. In this work, we present Harmony, a deep learning-driven ML cluster scheduler that places heterogeneous training jobs (either with parameter server architecture or all-reduce architecture) in a manner that minimizes interference and maximizes performance (i.e., training completion time minimization). The design of Harmony is based on a carefully designed deep reinforcement learning (DRL) framework enhanced with reward modeling. The DRL integrates a dynamic sequence-to-sequence model with the state-of-the-art techniques to stabilize training and improve convergence, including actor-critic algorithm, job-aware action space exploration, multi-head attention, and experience replay. In view of a common lack of reward samples corresponding to different placement decisions, we build an auxiliary sequence-to-sequence reward prediction model, which is trained with historical samples and used for producing reward for unseen placement. Experiments using real ML workloads in a Kubernetes cluster of 6 GPU servers show that Harmony outperforms representative schedulers by 16%–42% in terms of average job completion time.

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Deep Learning-Based Job Placement in Distributed Machine Learning Clusters With Heterogeneous Workloads

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM transactions on networking : a joint publication of the IEEE Communications Society, the IEEE Computer Society, and the ACM with its Special Interest Group on Data Communication

Lead the way for us

Journal: IEEE/ACM transactions on networking : a joint publication of the IEEE Communications Society, the IEEE Computer Society, and the ACM with its Special Interest Group on Data Communication	Publication Date: Apr 1, 2023
Citations: 7

Similar Papers

Deep Learning-based Job Placement in Distributed Machine Learning Clusters
Yixin Bao ... Yanghua Peng
-
Yixin Bao, et. al.Yixin Bao ... Yanghua Peng
01 Apr 2019
01 Apr 2019

Optimizing Machine Learning Workloads in Collaborative Environments
Behrouz Derakhshan ... Alireza Rezaei Mahdiraji
-
Behrouz Derakhshan, et. al.Behrouz Derakhshan ... Alireza Rezaei Mahdiraji
31 May 2020
31 May 2020

Online Placement and Scaling of Geo-Distributed Machine Learning Jobs via Volume-Discounting Brokerage
Xiaotong Li ... Yuhang Deng
IEEE Transactions on Parallel and Distributed Systems | VOL. 31
Xiaotong Li, et. al.Xiaotong Li ... Yuhang Deng
01 Apr 2020
IEEE Transactions on Parallel and Distributed Systems | VOL. 31

In-Memory Computing in Emerging Memory Technologies for Machine Learning: An Overview
Kaushik Roy ... Indranil Chakraborty
-
Kaushik Roy, et. al.Kaushik Roy ... Indranil Chakraborty
01 Jul 2020
01 Jul 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Deep Learning-Based Job Placement in Distributed Machine Learning Clusters With Heterogeneous Workloads

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM transactions on networking : a joint publication of the IEEE Communications Society, the IEEE Computer Society, and the ACM with its Special Interest Group on Data Communication