Exploiting Parallelism for CNN Applications on 3D Stacked Processing-In-Memory Architecture

Yi Wang,Weixuan Chen,Tao Li,Jing Yang

doi:10.1109/tpds.2018.2868062

Abstract

Deep convolutional neural networks (CNNs) are widely adopted in intelligent systems with unprecedented accuracy but at the cost of a substantial amount of data movement. Although the emerging processing-in-memory (PIM) architecture seeks to minimize data movement by placing memory near processing elements, memory is still the major bottleneck in the entire system. The selection of hyper-parameters in the training of CNN applications requires over hundreds of kilobytes cache capacity for concurrent processing of convolutions. How to jointly explore the computation capability of the PIM architecture and the highly parallel property of neural networks remains a critical issue. This paper presents Para-Net , that exploits Para llelism for deterministic convolutional neural Net works on the PIM architecture. Para- Net achieves data-level parallelism for convolutions by fully utilizing the on-chip processing engine (PE) in PIM. The objective is to capture the characteristics of neural networks and present a hardware-independent design to jointly optimize the scheduling of both intermediate results and computation tasks. We formulate this data allocation problem as a dynamic programming model and obtain an optimal solution. To demonstrate the viability of the proposed Para-Net, we conduct a set of experiments using a variety of realistic CNN applications. The graph abstractions are obtained from deep learning framework Caffe. Experimental results show that Para-Net can significantly reduce processing time and improve cache efficiency compared to representative schemes.

Full Text