Abstract

Deep convolutional neural networks (CNNs) are rapidly becoming the dominant approach to computer vision and a major component of many other pervasive machine learning tasks, such as speech recognition, natural language processing, and fraud detection. As a result, accelerators for efficiently evaluating CNNs are rapidly growing in popularity. The conventional approaches to designing such CNN accelerators is to focus on creating accelerators to iteratively process the CNN layers. However, by processing each layer to completion, the accelerator designs must use off-chip memory to store intermediate data between layers, because the intermediate data are too large to fit on chip. In this work, we observe that a previously unexplored dimension exists in the design space of CNN accelerators that focuses on the dataflow across convolutional layers. We find that we are able to fuse the processing of multiple CNN layers by modifying the order in which the input data are brought on chip, enabling caching of intermediate data between the evaluation of adjacent CNN layers. We demonstrate the effectiveness of our approach by constructing a fused-layer CNN accelerator for the first five convolutional layers of the VGGNet-E network and comparing it to the state-of-the-art accelerator implemented on a Xilinx Virtex-7 FPGA. We find that, by using 362KB of on-chip storage, our fused-layer accelerator minimizes off-chip feature map data transfer, reducing the total transfer by 95%, from 77MB down to 3.6MB per image.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call