Dynamic layer-wise sparsification for distributed deep learning

Hao Zhang,Tingting Wu,Zhifeng Ma,Feng Li,Jie Liu

doi:10.1016/j.future.2023.04.022

Abstract

Distributed stochastic gradient descent (SGD) algorithms are becoming popular in speeding up deep learning model training by employing multiple computational devices (named workers) parallelly. Top-k sparsification, a mechanism where each worker only communicates a small number of largest gradients (by absolute value) and accumulates the rest locally, is one of the most basic and high-profile practices to reduce communication overhead. However, the theoretical implementation (Global Top-k SGD) ignoring the layer-wise structure of neural networks has low training efficiency, since the top-k operation requiring the whole gradients impedes parallelism of computation and communication. The practical implementation (Layer-wise Top-k SGD) solves the parallelism problem, but hurts the performance of the trained model due to the deviation from the theoretically optimal solution. In this paper, we solve this contradiction by introducing a Dynamic Layer-wise Sparsification (DLS) mechanism and its extensions, DLS(s). DLS(s) efficiently adjusts the sparsity ratios of the layers to make the uploaded threshold of each layer automatically tend to be the unified global one, so as to retain the good performance of Global Top-k SGD and the high efficiency of Layer-wise Top-k SGD. The experimental results show that DLS(s) outperforms Layer-wise Top-k SGD in performance, and performs close to Global Top-k SGD yet have much less training time.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Dynamic layer-wise sparsification for distributed deep learning

Abstract

Talk to us

Similar Papers

More From: Future Generation Computer Systems

Lead the way for us

Journal: Future Generation Computer Systems	Publication Date: May 3, 2023
Citations: 3

Similar Papers

The Improved Stochastic Fractional Order Gradient Descent Algorithm
Yang Yang ... Fei Long
Fractal and Fractional | VOL. 7
Yang Yang, et. al.Yang Yang ... Fei Long
18 Aug 2023
Fractal and Fractional | VOL. 7

Medical imaging deep learning with differential privacy
Alexander Ziller ... Georgios Kaissis
Scientific Reports | VOL. 11
Alexander Ziller, et. al.Alexander Ziller ... Georgios Kaissis
29 Jun 2021
Scientific Reports | VOL. 11

A Primal-Dual SGD Algorithm for Distributed Nonconvex Optimization
Xinlei Yi ... Tianyou Chai
IEEE/CAA Journal of Automatica Sinica | VOL. 9
Xinlei Yi, et. al.Xinlei Yi ... Tianyou Chai
01 May 2022
IEEE/CAA Journal of Automatica Sinica | VOL. 9

Research on three-step accelerated gradient algorithm in deep learning
Yongqiang Lian ... Shirong Zhou
Statistical Theory and Related Fields | VOL. ahead-of-print
Yongqiang Lian, et. al.Yongqiang Lian ... Shirong Zhou
24 Nov 2020
Statistical Theory and Related Fields | VOL. ahead-of-print

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Dynamic layer-wise sparsification for distributed deep learning

Abstract

Talk to us

Similar Papers

More From: Future Generation Computer Systems