Efficient MPI‐AllReduce for large‐scale deep learning on GPU‐clusters

Truong Thao Nguyen,Ryousei Takano,Mohamed Wahib

doi:10.1002/cpe.5574

Truong Thao Nguyen, Ryousei Takano + Show 1 more

Open Access

https://doi.org/10.1002/cpe.5574

Copy DOI

Abstract

SummaryTraining models on large‐scale GPUs‐accelerated clusters are becoming a commonplace due to the increase in complexity and size in deep learning models. One of the main challenges for distributed training is the collective communication overhead for large message sizes: up to hundreds of MB. In this paper, we propose two hierarchical distributed memory multileader AllReduce algorithms optimized for GPU‐accelerated clusters (named lr_lr and lr_rab), in which GPUs inside a computing node perform an intra‐node communication phase to gather and store results of local reduced values to designated GPUs (known as node leaders). Node leaders then keep a role as an inter‐node communicator. Each leader exchanges one part of reduced values to the leaders of the other nodes in parallel. Hence, we are capable of significantly reducing the time for injecting data into the inter‐node network. We also overlap the inter‐node and intra‐node communication by implementing our proposal in a pipelined manner. We evaluate those algorithms on the discrete‐event simulation Simgrid. We show that our algorithms, lr_lr and lr_rab, can cut down the execution time of an AllReduce microbenchmark that uses the logical ring algorithm (lr) by up to 45% and 51%, respectively. With the pipelined implementation, our lr_lr_pipe achieves 15% performance improvement when compared with lr_lr. In addition, the simulation result also projects power savings for the network devices of up to 23% and 32%.

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Concurrency and Computation: Practice and Experience	Publication Date: Dec 9, 2019
Citations: 5	License type: publisher-specific, author manuscript

R Discovery Prime

R Discovery Prime

Efficient MPI‐AllReduce for large‐scale deep learning on GPU‐clusters

Abstract

Talk to us

Similar Papers

More From: Concurrency and Computation: Practice and Experience

Lead the way for us

Similar Papers

Abstract 184: The utility of deep metric learning for breast cancer identification on mammographic images
Justin Du ... Aidan Gilson
Cancer Research | VOL. 81
Justin Du, et. al.Justin Du ... Aidan Gilson
01 Jul 2021
Cancer Research | VOL. 81

Explainable artificial intelligence (XAI) for predicting the need for intubation in methanol-poisoned patients: a study comparing deep and machine learning models
Khadijeh Moulaei ... Sayed Masoud Hosseini
Scientific Reports | VOL. 14
Khadijeh Moulaei, et. al.Khadijeh Moulaei ... Sayed Masoud Hosseini
08 Jul 2024
Scientific Reports | VOL. 14

P–260 Towards better explainable deep learning models for embryo selection in ART
...
Human Reproduction | VOL. 36
, et. al. ...
06 Aug 2021
Human Reproduction | VOL. 36

Development and validation of deep learning algorithms for automated eye laterality detection with anterior segment photography
Ce Zheng ... Tong Qiao
Scientific Reports | VOL. 11
Ce Zheng, et. al.Ce Zheng ... Tong Qiao
12 Jan 2021
Scientific Reports | VOL. 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Efficient MPI‐AllReduce for large‐scale deep learning on GPU‐clusters

Abstract

Talk to us

Similar Papers

More From: Concurrency and Computation: Practice and Experience