Abstract
Synchronous strategies with data parallelism are widely utilized in distributed training of Deep Neural Networks (DNNs), largely owing to their easy implementation yet promising performance. In these strategies, the workers with different computational capabilities need to wait for each other because of the essential gradient or weight synchronization. This will inevitably cause the high-performance workers to waste time waiting for the weak computational workers, which in turn results in the inefficiency of the cluster. In this paper, we propose a Dynamic Load Balance (DLB) strategy for the distributed training of DNNs to tackle this issue. Specifically, the performance of each worker is evaluated first based on the performance demonstration during the previous training epochs, and then the batch size and dataset partition are adaptively adjusted in consideration of the current performance of the workers. As a result, the waiting cost among the workers will be eliminated, thereby the utilization of the clusters is highly improved. Furthermore, the essential theoretical analysis has also been provided to justify the convergence of the proposed algorithm. Extensive experiments have been conducted on the CIFAR10 and CIFAR100 benchmark datasets with four state-of-the-art DNN models. The experimental results indicate that the proposed algorithm can significantly improve the utilization of the distributed cluster. In addition, the proposed algorithm can also prevent the load imbalance of the distributed DNN training from being affected by the disturbance and can be employed flexibly in conjunction with the other synchronous distributed DNN training methods.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have