Abstract

Recently, FPGAs have been widely used in the implementation of hardware accelerators for Convolutional Neural Networks (CNN), especially on mobile and embedded devices. However, most of these existing accelerators are designed with the same concept as their ASIC counterparts, that is all operations from different CNN layers are mapped to the same hardware units and work in a multiplexed way. Although this approach improves the generality of these accelerators, it does not take full advantage of reconfigurability and customizability of FPGAs, resulting in a certain degree of computational efficiency degradation, which is even worse on the embedded platforms. In this paper, we propose an FPGA-based CNN accelerator with all the layers mapped to their own on-chip units, and working concurrently as a pipeline. A strategy which can find the optimized paralleling scheme for each layer is proposed to eliminate the pipeline stall and achieve high resource utilization. In addition, a balanced pruning-based method is applied on fully connected (FC) layers to reduce the computational redundancy. As a case study, we implement a widely used CNNs model, LeNet-5, on an embedded FPGA device, Xilinx Zedboard. It can achieve a peak performance of 39.78 GOP/s and the power efficiency with a value 19.6 GOP/s/W which outperforms previous approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call