Standard Deviation Based Adaptive Gradient Compression For Distributed Deep Learning

Mengqiang Chen,Zijie Yan,Jiangtao Ren,Weigang Wu

doi:10.1109/ccgrid49817.2020.00-40

Abstract

Distributed deep learning has been proposed to train large scale neural network models with huge amounts of datasets using multiple workers. Since workers need to frequently communicate with each other to exchange gradients for updating parameters, communication overhead is always a major challenge in distributed deep learning. To cope with this challenge, gradient compression has been used to reduce the amount of data to be exchanged. However, existing compression methods, including both gradient quantization and gradient sparsification, either hurt model performance significantly or suffer from inefficient compression. In this paper, we propose a novel approach, called Standard Deviation based Adaptive Gradient Compression (SDAGC), which can simultaneously achieve model training with low communication overhead and high model performance in synchronous training. SDAGC uses the standard deviation of gradients in each layer of the neural network to dynamically calculate a suitable threshold according to the training process. Moreover, several associated methods, including residual gradient accumulation, local gradient clipping, adaptive learning rate revision and momentum compensation, are integrated to guarantee the convergence of the model. We verify the performance of SDAGC on various machine learning tasks: image classification, language modeling and speech recognition. The experiment results show that, compared with other existing works, SDAGC can achieve a gradient compression ratio from 433× to 2021× with similar or even better accuracy.

Full Text