HGP4CNN: an efficient parallelization framework for training convolutional neural networks on modern GPUs

Hao Fu,Ce Yu,Bingsheng He,Shanjiang Tang,Jizhou Sun

doi:10.1007/s11227-021-03746-z

Abstract

Graphics Processing Units (GPUs) have evolved into powerful accelerators for the development of Convolutional Neural Network (CNN) models. Most existing GPU-based frameworks adopt a kernel-based execution approach and only focus on optimizing individual kernels for better performance and resource utilization. With this approach, kernels involved will be launched sequentially, which may result in the underutilization of GPU resources due to the limited optimization space of a single kernel. In this paper, we propose an efficient software parallelization framework, called HGP4CNN, to accelerate the training of CNN models by considering the characteristics of workloads from both the same layer and adjacent layers as well as new GPU features such as concurrent kernel execution. In the intra-layer level, to achieve a better training performance of a single network layer, we design a novel model-based lightweight parallelization module to make better use of the concurrent kernel execution feature on modern GPUs. An asynchronous resource tracker is used to collect kernels’ information at runtime, and a kernel analyzer is devised to calculate the number of kernels that can be dispatched concurrently. Moreover, to avoid consuming too many CPU threads or process resources, we integrate a runtime scheduler module for kernel launch and a pool-based stream manager for GPU work queue management. While in the inter-layer level, we present a pipeline execution strategy to overlap the processing of workloads from adjacent layers. To determine the number of samples to be processed by a single pipeline stage, the analysis result from the intra-layer module is considered. In the end, we implement a prototype of the proposed framework with Caffe, a well-known deep learning framework and conduct experiments with four off-the-shelf CNN models on three NVIDIA GPUs. Results show that HGP4CNN can be exploited to achieve better performance over the original implementation and keep the convergence property of networks. We can achieve a speedup of up to 6.X for a single convolutional layer and 2.X for multiple layers within pipelines of a network model.

Full Text