Elastic Resource Management for Deep Learning Applications in a Container Cluster

Ying Mao,Qiang Guan,Ang Li,Vaishali Sharma,Wenjia Zheng,Long Cheng

doi:10.1109/tcc.2022.3194128

Abstract

The increasing demand for learning from massive datasets is restructuring our economy. Effective learning, however, involves nontrivial computing resources. Most businesses utilize commercial infrastructure providers (e.g., AWS) to host their computing clusters in the cloud, where various jobs compete for available resources. While cloud resource management is a fruitful research field that has made many advances in production, such as Kubernetes and YARN, few efforts have been invested to further optimize the system performance, especially for Deep Learning (DL) training jobs in a container cluster. This work introduces FlowCon, a system that is able to monitor the individual evaluation functions of DL jobs at runtime, and thus to make placement decisions and resource allocations elastically. We present a detailed design and implementation of FlowCon and conduct intensive experiments over various DL models. The results demonstrate that FlowCon significantly improves DL job completion time and resource utilization efficiency when compared to default systems. According to the results, FlowCon can improve the completion time by up to 68.8% and meanwhile, reduce the makespan by 18.0%, in the presence of various DL job workloads.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Elastic Resource Management for Deep Learning Applications in a Container Cluster

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Cloud Computing

Lead the way for us

Journal: IEEE Transactions on Cloud Computing	Publication Date: Apr 1, 2023
Citations: 7

Similar Papers

FlowCon
Wenjia Zheng ... Ying Mao
-
Wenjia Zheng, et. al.Wenjia Zheng ... Ying Mao
05 Aug 2019
05 Aug 2019

Horus: An Interference-Aware Resource Manager for Deep Learning Systems
Gingfung Yeung ... Richard Harper
-
Gingfung Yeung, et. al.Gingfung Yeung ... Richard Harper
01 Jan 2020
01 Jan 2020

Fluid: Dataset Abstraction and Elastic Acceleration for Cloud-native Deep Learning Training Jobs
Rong Gu ... Haipeng Dai
-
Rong Gu, et. al.Rong Gu ... Haipeng Dai
01 May 2022
01 May 2022

Cost Efficient GPU Cluster Management for Training and Inference of Deep Learning
Dong-Ki Kang ... Young-Chon Kim
Energies | VOL. 15
Dong-Ki Kang, et. al.Dong-Ki Kang ... Young-Chon Kim
10 Jan 2022
Energies | VOL. 15

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Elastic Resource Management for Deep Learning Applications in a Container Cluster

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Cloud Computing