Heterogeneous acceleration for CNN training with many integrated core

Lei Shan,Weixia Xu,Minxuan Zhang,Canqun Yang

doi:10.1109/icspcc.2017.8242465

Abstract

Convolutional neural network (CNN) extracts features from big data by using the multilayer network structure. Due to the high effectiveness, CNN has achieved great successes in many fields such as computer vision and speech analysis. However, CNN training is quite challenging because computing the gradients through multiple layers is time consuming. In this paper, we propose to accelerate the computation of gradients in the convolutional layer by CPU+MIC heterogeneous computing technique. In particular, we evaluate the time costs of computing the gradients of all layers in the Caffe framework, and found that the convolutional layer occupies the overall computational overheads. Based on this observation, we implement the intensive matrix manipulations of convolutional layers on the MIC coprocessor with OpenMP in the Caffe framework. To fully utilize the threads provided by MIC, we set two types of threads including data threads and MKL threads, and give a thread setting strategy with both theoretical and empirical analysis. We evaluate our acceleration method on several typical CNN models on the ImageNet dataset, and show that it speedups the computation of convolutional layer by about 6.8 times, and speedups the computation of overall training by about 5.8 times compared with those performed on single CPU.

Full Text