As the size of deep neural network (DNN) models and datasets increases, distributed training becomes popular to reduce the training time. However, a severe communication bottleneck in distributed training limits its scalability. Many methods aim to solve this communication bottleneck by reducing communication traffic, such as gradient sparsification and quantification. However, these methods either are at the expense of losing model accuracy or introducing lots of computing overhead. We observe that the data distribution between layers of neural network models is similar. Thus, we propose a model parameter prediction method (MP2) to accelerate distributed DNN training under parameter server (PS) framework, where workers push only a subset of model parameters to PS, and residual model parameters are locally predicted by an already-trained deep neural network model on PS. Some key challenges are addressed. First, we build a hierarchical parameters dataset by randomly sampling a subset of model from normal distributed trainings. Second, we build a neural network model with the structure of “convolution + channel attention + Max pooling” for predicting model parameters by using a prediction result-based evaluation method. For VGGNet, ResNet, and AlexNet models on CIFAR10 and CIFAR100 datasets, compared with Baseline, Top-k, deep gradient compression (DGC), and weight nowcaster network (WNN), MP2 can reduce traffic by up to 88.98%; and accelerates the training by up to 47.32% while not losing the model accuracy. MP2 has shown good generalization.
Read full abstract