MIPD: An Adaptive Gradient Sparsification Framework for Distributed DNNs Training

Zhaorui Zhang,Cho-Li Wang

doi:10.1109/tpds.2022.3154387

Zhaorui Zhang, Cho-Li Wang

Open Access

https://doi.org/10.1109/tpds.2022.3154387

Copy DOI

Abstract

Asynchronous training based on the parameter server architecture is widely used for scaling up the DNN training over large datasets and DNN models. Communication has been identified as the major bottleneck when deploying the DNN training over the large-scale distributed deep learning systems. Recent studies try to reduce the communication traffic through gradient sparsification and quantization approaches. We identify three limitations in previous studies. First, the fundamental guideline for gradient sparsification of their work is the magnitude of the gradient. However, the gradients’ magnitude represents the current optimization direction while it cannot indicate the significance of the parameters, which potentially results in delayed updating for the significant parameters. Second, their gradient quantization methods based on the entire model often lead to error accumulation for gradients aggregation since the gradients from different layers of the DNN model follow different distributions. Third, previous quantization approaches are CPU intensive, which generates strong overhead for the server. We propose <i>MIPD</i>, an adaptive and layer-wise gradient sparsification framework that compresses the gradients based on model interpretability and probability distribution of gradients. MIPD compresses the gradients according to the corresponding significance of its parameters, which is defined by model interpretability. An Exponential Smoothing method is also proposed to compensate for the dropped gradients on the server to reduce the gradients error. MIPD proposes to update half of the parameters for each training step to reduce the CPU overhead of the server. It encodes the gradients based on their probability distribution, thereby minimizing the approximated errors. Extensive experimental results generated on the GPU cluster indicate that the proposed framework effectively improves the training performance of DNNs by up to 36.2%, which ensures high accuracy as compared to state-of-art solutions. Accordingly, the CPU and network usage of the server dropped by up to 42.0% and 32.7% respectively.

Full Text