Momentum-driven adaptive synchronization model for distributed DNN training on HPC clusters

Zhaorui Zhang,Zhuoran Ji,Choli Wang

doi:10.1016/j.jpdc.2021.09.007

Abstract

Building a distributed deep learning (DDL) system on HPC clusters that guarantees convergence speed and scalability for the training of DNNs is challenging. The HPC cluster, which consists of multiple high-density multi-GPU servers connected by the Infiniband network (HDGib), compresses the computing and communication time for distributed DNNs' training but brings new challenges. The convergence time is far from linear scalability (with respect to the number of workers) for parallel DNNs training. We thus analyze the optimization process and identify three key issues that cause scalability degradation. First, the high-frequency update for parameters due to the compression of the computing and communication times exacerbates the stale gradient problem, which slows down the convergence. Second, the previous methods used to constrain the gradient noise (stochastic error) of the SGD are outdated, as HDGib clusters can support more strict constraints due to the Infiniband network connections, which can further constrain the stochastic error. Third, the same learning rate for all workers is inefficient due to the different training stages of each worker. We thus propose a momentum-driven adaptive synchronization model that focuses on solving the above issues and accelerating the training procedure on HDGib clusters. Our adaptive k-synchronization algorithm uses the momentum term to absorb the stale gradients and adaptively bind the stochastic error to provide an approximate optimal descent direction for the distributed SGD. Our model also includes an individual dynamic learning rate search method for each worker to further improve training performance. Compared with previous linear and exponent decay methods, it can provide a more precise descent distance for distributed SGD based on different training stages. Extensive experimental results indicate that the proposed model effectively improves the training performance of CNNs, which retains high accuracy with a speed-up of up to 57.76% and 125.3% on the CPU-based and GPU-based clusters, respectively.

Full Text