CASQ: Accelerate Distributed Deep Learning with Sketch-Based Gradient Quantization

Keshi Ge,Yongquan Fu,Yiming Zhang,Zhiquan Lai,Xiaoge Deng,Dongsheng Li

doi:10.1109/cluster48925.2021.00074

Abstract

Gradient quantization has been widely used in distributed training of deep neural network (DNN) models to reduce communication costs. However, existing quantization methods overlook that gradients have a nonuniform distribution changing over time, which can lead to significant gradient variance that requires a higher number of quantization bits (and consequently higher communication cost) to keep the validation accuracy as high as stochastic gradient descent (SGD). In this paper, we propose Cluster-Aware Sketch Quantization (CASQ), a novel sketch-based gradient quantization method for SGD. CASQ models the nonuniform distribution of gradients via clustering, and adaptively allocates appropriate numbers of hash buckets based on the statistics of different clusters to compress gradients. The extensive evaluation shows that compared to existing quantization methods CASQ-based SGD (i) achieves the same validation accuracy when decreasing quantization level from 3 bits to 2 bits, and (ii) reduces the training time to convergence by up to 43% for the same training loss.

Full Text