Collective Communication Performance Evaluation for Distributed Deep Learning Training

Sookwang Lee,Jaehwan Lee

doi:10.3390/app14125100

Abstract

In distributed deep learning, the improper use of the collective communication library can lead to a decline in deep learning performance due to increased communication time. Representative collective communication libraries such as MPI, GLOO, and NCCL exhibit varying performance based on server environment and communication architecture. In this study, we investigate three key aspects to evaluate the performance of the collective communication libraries in a distributed deep learning setting in an intra-node environment. First, we conduct a comparison and analysis of collective communication library performance within common distributed deep learning architectures, such as parameter servers and ring all-reduce methods. Second, we evaluate the performance of these libraries in different environments, including various container platforms and bare metal setups, considering the scalability and flexibility advantages offered by cloud virtualization. Last, to ensure practicality, we assess the libraries’ performance in a Linux shell and within the PyTorch framework. In the cross-docker virtualization environment, NCCL shows up to 213% higher latency compared to single docker, while GLOO exhibits 36% lower latency in single docker than in cross docker, and NCCL achieves up to 345% lower execution time in all-reduce operations compared to other libraries (MPI and GLOO). These findings will inform the selection of an appropriate collective communication library for designing effective distributed deep learning environments.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Collective Communication Performance Evaluation for Distributed Deep Learning Training

Abstract

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Journal: Applied Sciences	Publication Date: Jun 12, 2024
License type: CC BY 4.0

Similar Papers

Revisiting Resource Management for Deep Learning Framework
Erci Xu ... Shanshan Li
Electronics | VOL. 8
Erci Xu, et. al.Erci Xu ... Shanshan Li
16 Mar 2019
Electronics | VOL. 8

Instance segmentation on distributed deep learning big data cluster
Mohammed Elhmadany ... Hossam E Abdelmunim
Journal of Big Data | VOL. 11
Mohammed Elhmadany, et. al.Mohammed Elhmadany ... Hossam E Abdelmunim
02 Jan 2024
Journal of Big Data | VOL. 11

NetShield: An in-network architecture against byzantine failures in distributed deep learning
Qingqing Ren ... Yujun Zhang
Computer Networks | VOL. 237
Qingqing Ren, et. al.Qingqing Ren ... Yujun Zhang
31 Oct 2023
Computer Networks | VOL. 237

Distributed Deep Reinforcement Learning: A Survey and a Multi-player Multi-agent Learning Toolbox
Qiyue Yin ... Kaiqi Huang
Machine Intelligence Research | VOL. 21
Qiyue Yin, et. al.Qiyue Yin ... Kaiqi Huang
11 Jan 2024
Machine Intelligence Research | VOL. 21

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Collective Communication Performance Evaluation for Distributed Deep Learning Training

Abstract

Talk to us

Similar Papers

More From: Applied Sciences