Abstract
Distributed stochastic gradient descent (SGD) algorithms are becoming popular in speeding up deep learning model training by employing multiple computational devices (named workers) parallelly. Top-k sparsification, a mechanism where each worker only communicates a small number of largest gradients (by absolute value) and accumulates the rest locally, is one of the most basic and high-profile practices to reduce communication overhead. However, the theoretical implementation (Global Top-k SGD) ignoring the layer-wise structure of neural networks has low training efficiency, since the top-k operation requiring the whole gradients impedes parallelism of computation and communication. The practical implementation (Layer-wise Top-k SGD) solves the parallelism problem, but hurts the performance of the trained model due to the deviation from the theoretically optimal solution. In this paper, we solve this contradiction by introducing a Dynamic Layer-wise Sparsification (DLS) mechanism and its extensions, DLS(s). DLS(s) efficiently adjusts the sparsity ratios of the layers to make the uploaded threshold of each layer automatically tend to be the unified global one, so as to retain the good performance of Global Top-k SGD and the high efficiency of Layer-wise Top-k SGD. The experimental results show that DLS(s) outperforms Layer-wise Top-k SGD in performance, and performs close to Global Top-k SGD yet have much less training time.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.