Cooperative Distributed GPU Power Capping for Deep Learning Clusters

Dong-Ki Kang,Yun-Gi Ha,Limei Peng,Chan-Hyun Youn

doi:10.1109/tie.2021.3095790

Abstract

The recent GPU-based clusters that handle deep learning (DL) tasks have the features of GPU device heterogeneity, a variety of deep neural network (DNN) models, and high computational complexity. Thus, the traditional power capping methods for CPU-based clusters or small-scale GPU devices cannot be applied to the GPU-based clusters handling DL tasks. This article develops a cooperative distributed GPU power capping (CD-GPC) system for GPU-based clusters, aiming to minimize the training completion time of invoked DL tasks without exceeding the limited power budget. Specifically, we first design the frequency scaling approach using the online model estimation based on the recursive least square method. This approach achieves the accurate tuning for DL task training time and power usage of GPU devices without needing offline profiling. Then, we formulate the proposed FS problem as a Lagrangian dual decomposition-based economic model predictive control problem for large-scale heterogeneous GPU clusters. We conduct both the NVIDIA GPU-based lab-scale real experiments and real job trace-based simulation experiments for performance evaluation. Experimental results validate that the proposed system improves the power capping accuracy to have a mean absolute error of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$< \!1\%$</tex-math></inline-formula> , and reduces the deadline violation ratio of invoked DL tasks by 21.5% compared with other recent counterparts.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Industrial Electronics	Publication Date: Jul 1, 2022
Citations: 6	License type: CC BY-NC-ND 4.0

R Discovery Prime

R Discovery Prime

Cooperative Distributed GPU Power Capping for Deep Learning Clusters

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Industrial Electronics

Lead the way for us

Similar Papers

RT-mDL
Neiwen Ling ... Guoliang Xing
-
Neiwen Ling, et. al.Neiwen Ling ... Guoliang Xing
15 Nov 2021
15 Nov 2021

Accelerating Deep Learning Tasks with Optimized GPU-assisted Image Decoding
Lipeng Wang ... Shengen Yan
-
Lipeng Wang, et. al.Lipeng Wang ... Shengen Yan
01 Dec 2020
01 Dec 2020

Characterizing Resource Heterogeneity in Edge Devices for Deep Learning Inferences
Jianwei Hao ... In Kee Kim
-
Jianwei Hao, et. al.Jianwei Hao ... In Kee Kim
21 Jun 2020
21 Jun 2020

Joint DNN Partition and Resource Allocation for Task Offloading in Edge–Cloud-Assisted IoT Environments
Wenhao Fan ... Fan Wu
IEEE Internet of Things Journal | VOL. 10
Wenhao Fan, et. al.Wenhao Fan ... Fan Wu
15 Jun 2023
IEEE Internet of Things Journal | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Cooperative Distributed GPU Power Capping for Deep Learning Clusters

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Industrial Electronics