Modeling the Training Iteration Time for Heterogeneous Distributed Deep Learning Systems

Yifu Zeng,Kenli Li,Guo Chen,Pulin Pan,Bowei Chen

doi:10.1155/2023/2663115

Abstract

Distributed deep learning systems effectively respond to the increasing demand for large-scale data processing in recent years. However, the significant investment in building distributed learning systems with powerful computing nodes places a huge financial burden on developers and researchers. It will be good to predict the precise benefit, i.e., how many times of speedup it can get compared with training on single machine (or a few), before actually building such big learning systems. To address this problem, this paper presents a novel performance model on training iteration time for heterogeneous distributed deep learning systems based on the characteristics of the parameter server (PS) system with bulk synchronous parallel (BSP) synchronization style. The accuracy of our performance model is demonstrated by comparing real measurement results on TensorFlow when training different neural networks with various kinds of hardware testbeds: the prediction accuracy is higher than 90% in most cases.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Modeling the Training Iteration Time for Heterogeneous Distributed Deep Learning Systems

Abstract

Talk to us

Similar Papers

More From: International Journal of Intelligent Systems

Lead the way for us

Journal: International Journal of Intelligent Systems	Publication Date: Feb 21, 2023
License type: CC BY 4.0

Similar Papers

Optimizing the Distributed Learning System with Accuracy Driven Dynamic Communication Frequency
Fengyuan Yang ... Tao Wei
-
Fengyuan Yang, et. al.Fengyuan Yang ... Tao Wei
24 Apr 2021
24 Apr 2021

Self-Organizing Democratized Learning: Toward Large-Scale Distributed Learning Systems.
Minh N H Nguyen ... Choong Seon Hong
IEEE Transactions on Neural Networks and Learning Systems | VOL. 34
Minh N H Nguyen, et. al.Minh N H Nguyen ... Choong Seon Hong
01 Dec 2023
IEEE Transactions on Neural Networks and Learning Systems | VOL. 34

Optimal Number of Edge Devices in Distributed Learning Over Wireless Channels
Jaeyoung Song ... Marios Kountouris
-
Jaeyoung Song, et. al.Jaeyoung Song ... Marios Kountouris
01 May 2020
01 May 2020

An Autonomous Mobile Agent-Based Distributed Learning Architecture-A Proposal and Analytical

The Turkish Online Journal of Distance Education | VOL. 6

01 Jan 2004
The Turkish Online Journal of Distance Education | VOL. 6

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Modeling the Training Iteration Time for Heterogeneous Distributed Deep Learning Systems

Abstract

Talk to us

Similar Papers

More From: International Journal of Intelligent Systems