A Comprehensive Inspection of the Straggler Problem

Qihua Zhou,Minyi Guo,Song Guo,Haodong Lu,Kun Wang,Yanfei Sun,Li Li

doi:10.1109/mc.2021.3099211

Abstract

Parameter server is a popular distributed processing paradigm for operating distributed deep learning (DL) applications. As a growing number of DL models are trained via shared clusters, machines are in confrontation with the heterogeneous environment, which incurs the unexpected phenomenon with a slow task processing speed called straggler. Straggler addressing is a crucial issue in distributed DL applications, since stragglers significantly hamper system performance. While many techniques have been deployed to mitigate stragglers, they may not achieve their goals with the presence of heterogeneity, where systems consume much longer time until DL training convergence than in a homogeneous environment, as evidenced by our experimental study. With the methodology of straggler projection and abstraction of parallelism, a new synchronization mechanism called elastic parallelism synchronous parallel (EPSP) is proposed, which exploits the superiority of iteration acceleration in stale synchronous parallel and conquers the shortage of barrier wasting time in bulk synchronous parallel. More precisely, EPSP supports both enforced and slack synchronization by adjusting the parameter of staleness.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Comprehensive Inspection of the Straggler Problem

Abstract

Talk to us

Similar Papers

More From: Computer

Lead the way for us

Similar Papers

Revisiting Resource Management for Deep Learning Framework
Erci Xu ... Shanshan Li
Electronics | VOL. 8
Erci Xu, et. al.Erci Xu ... Shanshan Li
16 Mar 2019
Electronics | VOL. 8

Distributed Machine Learning based Mitigating Straggler in Big Data Environment
Haodong Lu ... Kun Wang
-
Haodong Lu, et. al.Haodong Lu ... Kun Wang
01 Jun 2021
01 Jun 2021

Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning
Xing Zhao ... Bao Xin Chen
-
Xing Zhao, et. al.Xing Zhao ... Bao Xin Chen
01 Jul 2019
01 Jul 2019

SHAT: A Novel Asynchronous Training Algorithm That Provides Fast Model Convergence in Distributed Deep Learning
Yunyong Ko ... Sang-Wook Kim
Applied Sciences | VOL. 12
Yunyong Ko, et. al.Yunyong Ko ... Sang-Wook Kim
29 Dec 2021
Applied Sciences | VOL. 12

Journal: Computer	Publication Date: Oct 1, 2021
Citations: 3

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Comprehensive Inspection of the Straggler Problem

Abstract

Talk to us

Similar Papers

More From: Computer