Petrel: Community-aware Synchronous Parallel for Heterogeneous Parameter Server

Qihua Zhou,Minyi Guo,Li Li,Peng Li,Yanfei Sun,Song Guo,Kun Wang

doi:10.1109/icdcs47774.2020.00132

Abstract

As to address the impact of heterogeneity in distributed Deep Learning (DL) systems, most previous approaches focus on prioritizing the contribution of fast workers and reducing the involvement of slow workers, incurring the limitations of workload imbalance and computation inefficiency. We reveal that grouping workers into communities, an abstraction proposed by us, and handling parameter synchronization in community level can conquer these limitations and accelerate the training convergence progress. The inspiration of community comes from our exploration of prior knowledge about the similarity between workers, which is often neglected by previous work. These observations motivate us to propose a new synchronization mechanism named Community-aware Synchronous Parallel (CSP), which uses the Asynchronous Advantage Actor-Critic (A3C), a Reinforcement Learning (RL) based algorithm, to intelligently determine community configuration and fully improve the synchronization performance. The whole idea has been implemented in a system called Petrel that achieves a good balance between convergence efficiency and communication overhead. The evaluation under different benchmarks demonstrates our approach can effectively accelerate the training convergence speed and reduce synchro-nization traffic.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Petrel: Community-aware Synchronous Parallel for Heterogeneous Parameter Server

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Petrel: Heterogeneity-Aware Distributed Deep Learning Via Hybrid Synchronization
Qihua Zhou ... Li Li
IEEE Transactions on Parallel and Distributed Systems | VOL. 32
Qihua Zhou, et. al.Qihua Zhou ... Li Li
25 Nov 2020
IEEE Transactions on Parallel and Distributed Systems | VOL. 32

Hylo: Hybrid Layer-Based Optimization to Reduce Communication in Distributed Deep Learning
Wenbin Jiang ... Jing Peng
-
Wenbin Jiang, et. al.Wenbin Jiang ... Jing Peng
01 Jan 2020
01 Jan 2020

RGChaser: A RL-guided Fuzz and Mutation Testing Framework for Deep Learning Systems
Yuteng Lu ... Meng Sun
-
Yuteng Lu, et. al.Yuteng Lu ... Meng Sun
01 Aug 2022
01 Aug 2022

RAMP: A flat nanosecond optical network and MPI operations for distributed deep learning systems
Alessandro Ottino ... Georgios Zervas
Optical Switching and Networking | VOL. 51
Alessandro Ottino, et. al.Alessandro Ottino ... Georgios Zervas
17 Aug 2023
Optical Switching and Networking | VOL. 51

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Petrel: Community-aware Synchronous Parallel for Heterogeneous Parameter Server

Abstract

Talk to us

Similar Papers