Bridging the Gap Between Memory and Communication Efficiency on Distributed Deep Learning Systems

Shaofeng Zhao,Bo Liu,Fang Wang,Dan Feng

doi:10.1109/access.2021.3071579

Abstract

Large-scale distributed deep learning is of great importance in various applications. For data-parallel distributed training systems, limited hardware resources (e.g., GPU memory and interconnection bandwidth) often become a performance bottleneck, and it is necessary to consider the full utilization of multiple resources simultaneously, especially for extreme-scale deep neural networks. Although two different types of strategies, based on memory management and sparse communication, have been proposed to reduce the usage of resources, a naive combination of these two optimizations is impractical, since they cannot successfully coexist with each other. We therefore consider the idea of collaborative optimization in terms of both system memory and bandwidth resources, and propose a layer-centric memory-efficient distributed sparse communication mechanism called LaySA. Firstly, to tackle the memory ballooning issue caused by sparse communication, the existing memory reuse strategy is refined, and the data object of the memory optimization is augmented and redefined. Secondly, a mirror weight update mechanism is proposed to address the contradiction between memory management and sparse communication optimization for weight gradients. Our scheme, which involves the deep integration and collaborative execution of these two types of strategies, can fill the gap in relation to multiple resource optimization in distributed GPU-based training systems. Our experimental results show that the proposed collaborative optimization can significantly alleviate the memory pressure on the computing nodes, and improve both the resource utilization and the throughput of distributed training systems. Compared with baseline systems using only a single strategy, LaySA can help to reduce the system memory usage by up to 80.5%, and the overall training time of the neural network models on a single GPU is reduced by about 12.25%. Furthermore, LaySA can scale up the batch size of the datasets by an extremely large factor during distributed training, and the overall throughput is increased by more than 150%, meaning that our approach outperforms current systems that use memory or communication optimization mechanisms alone.

Highlights

Deep Learning (DL) has become one of the most promising techniques in the field of machine learning
1) INTERMEDIATE DATA CHARACTERISTICS Since this work focuses on improving training efficiency while maintaining the representative ability of the Deep Neural Network (DNN) model, it is necessary to analyze the components of the intermediate data during the training process
DESIGN OF LaySA The main objectives of LaySA are to solve the contradiction between memory management and sparse communication optimization, improve the utilization of system resources and training efficiency

Summary

Introduction

Deep Learning (DL) has become one of the most promising techniques in the field of machine learning. Dataparallelism, a common choice in distributed DL, involves each node executing a training sub-task, completing the feed-forward process, computing its local gradients, and uniformly updating the model parameters by exchanging. S. Zhao et al.: Bridging Gap Between Memory and Communication Efficiency on Distributed DL Systems gradients with other nodes in such DL systems. Zhao et al.: Bridging Gap Between Memory and Communication Efficiency on Distributed DL Systems gradients with other nodes in such DL systems This type of workflow requires accelerators to provide necessary hardware resources to accomplish it. - Weight: This is a parameter within a neural network that transforms input data, and represents the strength of the connection between layers. In a multi-layer neural network, all layer (except for the lowest input layer and the highest objective function layer) will generate these four types of intermediate data and store them during the feed-forward process. The memory overhead of these intermediate data increases linearly with the number of layers

Methods

Results

Conclusion