Green, Yellow, Yield: End-Host Traffic Scheduling for Distributed Deep Learning with TensorLights

Xin Sunny Huang,Ang Chen,T S Eugene Ng

doi:10.1109/ipdpsw.2019.00082

Abstract

The recent success of Deep Learning (DL) in a board range of AI services has led to a surging amount of DL workloads in production clusters. To support DL jobs at scale, the parameter server (PS) architecture is the most popular approach for distributing the computation in a compute cluster. Concurrent DL jobs consisting of PS tasks and worker tasks are typically launched on available compute nodes by a cluster resource manager to ensure high cluster resource utilization. As a PS needs to distribute model updates to every remote worker, its communication has very large fan-out. We observe that network contention among colocated PSes would cause stragglers among workers, resulting in application performance degradation and resource under-utilization. To mitigate the straggler effect, we propose TensorLights, which introduces traffic prioritization at host NICs to manage traffic contention among PSes. We evaluate TensorLights experimentally and show that it effectively mitigates stragglers, improves the average completion time of DL applications by up to 31%, and increases resource utilization. TensorLights is highly practical as it provides benefits without needing changes to the DL software stack.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Green, Yellow, Yield: End-Host Traffic Scheduling for Distributed Deep Learning with TensorLights

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Horus: An Interference-Aware Resource Manager for Deep Learning Systems
Gingfung Yeung ... Damian Borowiec
-
Gingfung Yeung, et. al.Gingfung Yeung ... Damian Borowiec
01 Jan 2020
01 Jan 2020

Scalable Data Analytics and Machine Learning on the Cloud

-

13 May 2021
13 May 2021

Towards Mitigating Straggler with Deep Reinforcement Learning in Parameter Server
Haodong Lu ... Kun Wang
-
Haodong Lu, et. al.Haodong Lu ... Kun Wang
09 Aug 2020
09 Aug 2020

Developing a Chatbot system using Deep Learning based for Universities consultancy
Thuong Le-Tien ... Tai Nguyen-D-P
-
Thuong Le-Tien, et. al.Thuong Le-Tien ... Tai Nguyen-D-P
03 Jan 2022
03 Jan 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Green, Yellow, Yield: End-Host Traffic Scheduling for Distributed Deep Learning with TensorLights

Abstract

Talk to us

Similar Papers