Congestion-Aware Critical Gradient Scheduling for Distributed Machine Learning in Data Center Networks

Zehua Guo,Jineng Ren,Yang Xu,Jiayu Wang,Sen Liu,Chao Yao

doi:10.1109/tcc.2022.3197350

Abstract

Distributed Machine Learning (DML) is proposed not only to accelerate the training of machine learning, but also to solve the inadequate ability for handling a large amount of training data. It adopts multiple computing nodes in data center to collaboratively work in parallel at the cost of high communication overhead. Gradient Compression (GC) is introduced to reduce the communication overhead by reducing the number of synchronized gradients among computing nodes. However, existing GC solutions suffer from varying network congestion. To be specific, when some computing nodes experience high network congestion, their gradient transmission process could be significantly delayed, slowing down the entire training process. To solve the problem, we propose FLASH, a congestion-aware GC solution for DML. FLASH accelerates the training process by jointly considering the iterative approximation of machine learning and dynamic network congestion scenarios. It can maintain good training performance by adaptively adjust and schedule the number of synchronized gradients among computing nodes. We evaluate the effectiveness of FLASH using AlexNet and Resnet18 under different network congestion scenarios. Simulation results show that under the same number of training epochs, FLASH reduces training time 22-71%, maintains good accuracy, and low loss, compared with the existing memory top-K GC solution.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Congestion-Aware Critical Gradient Scheduling for Distributed Machine Learning in Data Center Networks

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Cloud Computing

Lead the way for us

Journal: IEEE Transactions on Cloud Computing	Publication Date: Jul 1, 2023
Citations: 3

Similar Papers

Traffic Management for Distributed Machine Learning in RDMA-enabled Data Center Networks
Weihong Yang ... Yang Qin
-
Weihong Yang, et. al.Weihong Yang ... Yang Qin
01 Jun 2021
01 Jun 2021

Segment routing based energy aware routing for software defined data center
B Balakiruthiga ... K Shankar
Cognitive Systems Research | VOL. 64
B Balakiruthiga, et. al.B Balakiruthiga ... K Shankar
02 Sep 2020
Cognitive Systems Research | VOL. 64

(ITMP) – Intelligent Traffic Management Prototype using Reinforcement Learning approach for Software Defined Data Center (SDDC)
Balakiruthiga B ... Deepalakshmi P
Sustainable Computing: Informatics and Systems | VOL. 32
Balakiruthiga B, et. al.Balakiruthiga B ... Deepalakshmi P
04 Oct 2021
Sustainable Computing: Informatics and Systems | VOL. 32

Integration Frameworks for THz Wireless Technologies in Data Centre Networks
Sean Ahearne
-
Sean AhearneSean Ahearne
28 Jun 2021
28 Jun 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Congestion-Aware Critical Gradient Scheduling for Distributed Machine Learning in Data Center Networks

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Cloud Computing