Semi-dynamic load balancing

Chen Chen,Bo Li,Qizhen Weng,Wei Wang,Baochun Li

doi:10.1145/3419111.3421299

Abstract

Machine learning (ML) models are increasingly trained in clusters with non-dedicated workers possessing heterogeneous resources. In such scenarios, model training efficiency can be negatively affected by stragglers -- workers that run much slower than others. Efficient model training requires eliminating such stragglers, yet for modern ML workloads, existing load balancing strategies are inefficient and even infeasible. In this paper, we propose a novel strategy called semi-dynamic load balancing to eliminate stragglers of distributed ML workloads. The key insight is that ML workers shall be load-balanced at iteration boundaries, being non-intrusive to intra-iteration execution. We develop LB-BSP based on such an insight, which is an integrated worker coordination mechanism that adapts workers' load to their instantaneous processing capabilities by right-sizing the sample batches at the synchronization barriers. We have custom-designed the batch sizing algorithm respectively for CPU and GPU clusters based on their own characteristics. LB-BSP has been implemented as a Python module for ML frameworks like TensorFlow and PyTorch. Our EC2 deployment confirms that LB-BSP is practical, effective and light-weight, and is able to accelerating distributed training by up to $54\%$.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Semi-dynamic load balancing

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Accelerating Distributed Learning in Non-Dedicated Environments
Chen Chen ... Qizhen Weng
IEEE Transactions on Cloud Computing | VOL. 11
Chen Chen, et. al.Chen Chen ... Qizhen Weng
01 Jan 2023
IEEE Transactions on Cloud Computing | VOL. 11

Optimizing Machine Learning Workloads in Collaborative Environments
Behrouz Derakhshan ... Ziawasch Abedjan
-
Behrouz Derakhshan, et. al.Behrouz Derakhshan ... Ziawasch Abedjan
31 May 2020
31 May 2020

Data Management and Visual Information Processing using Machine Learning
Arhath Kumar ... Shaik Vaseem Akram
-
Arhath Kumar, et. al.Arhath Kumar ... Shaik Vaseem Akram
14 Dec 2022
14 Dec 2022

GradeML: Towards Holistic Performance Analysis for Machine Learning Workflows
Tim Hegeman ... Animesh Trivedi
-
Tim Hegeman, et. al.Tim Hegeman ... Animesh Trivedi
19 Apr 2021
19 Apr 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Semi-dynamic load balancing

Abstract

Talk to us

Similar Papers