Effective Scheduler for Distributed DNN Training Based on MapReduce and GPU Cluster

Jie Xu,Qi Qi,Jianxin Liao,Di Yang,Jingyu Wang,Haifeng Sun

doi:10.1007/s10723-021-09550-6

Abstract

Parallel training accelerates the Deep Neural Networks (DNN) training by parallel GPUs. While the in-memory data transmission becomes the cross-node network transmission due to distribution of GPUs on different nodes, which drags the training time. Most researches address it by reducing the data size on network links. However, the factor of network distance is ignored. In this paper, we construct a distributed DNN training architecture based on MapReduce. The customized scheduler is designed to make the computations nodes that finish the training closer to the nodes that storage data. At the same time, the parallel training models are synchronized by adjusting the data transmission time. The experimental results show that the shortened network distance benefits the reduced network traffic usage. The resulting data transmission time decreases the training time by at least 50% and guarantees the synchronization for the parallel training.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Effective Scheduler for Distributed DNN Training Based on MapReduce and GPU Cluster

Abstract

Talk to us

Similar Papers

More From: Journal of Grid Computing

Lead the way for us

Journal: Journal of Grid Computing	Publication Date: Feb 22, 2021
Citations: 3

Similar Papers

A Framework for Distributed Deep Neural Network Training with Heterogeneous Computing Platforms
Bontak Gu ... Arslan Munir
-
Bontak Gu, et. al.Bontak Gu ... Arslan Munir
01 Dec 2019
01 Dec 2019

SoftMemoryBox II: A Scalable, Shared Memory Buffer Framework for Accelerating Distributed Training of Large-Scale Deep Neural Networks
Shinyoung Ahn ... Eunji Lim
IEEE Access | VOL. 8
Shinyoung Ahn, et. al.Shinyoung Ahn ... Eunji Lim
01 Jan 2020
IEEE Access | VOL. 8

Priority-based parameter propagation for distributed deep neural network training

-

01 Jan 2019
01 Jan 2019

Efficient All-Reduce for Distributed DNN Training in Optical Interconnect Systems
Fei Dai ... Haibo Zhang
-
Fei Dai, et. al.Fei Dai ... Haibo Zhang
21 Feb 2023
21 Feb 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Effective Scheduler for Distributed DNN Training Based on MapReduce and GPU Cluster

Abstract

Talk to us

Similar Papers

More From: Journal of Grid Computing