Optimizing FPGA-based convolutional neural networks accelerator for image super-resolution

Jung-Woo Chang ,Suk‐Ju Kang

doi:10.5555/3201607.3201695

Abstract

Convolutional neural networks (CNN) are widely used in various computer vision applications. Recently, there have been many studies on FPGA-based CNN accelerators to achieve high performance and power efficiency. Most of them have been on CNN-based object detection algorithms, but researches on image super-resolution have been rarely conducted. Fast super-resolution CNN (FSRCNN), well known for CNN-based super-resolution algorithm, are a combination of multiple convolutional layers and a single deconvolutional layer. Since the deconvolutional layer generates high-resolution (HR) output feature maps from low-resolution (LR) input feature maps, its execution cycles are larger than those of the convolutional layer. In this paper, we propose a novel architecture of the FPGA-based CNN accelerator with the efficient parallelization. We develop a method of transforming a deconvolutional layer into a convolutional layer (TDC), a new methodology for the deconvolutional neural networks (DCNN). There is a massive parallelization source in the deconvolutional layer where multiple outputs within the same output feature map are created with the same inputs. When this new parallelization technique is applied to the deconvolutional layer, it generates the LR output feature maps the same as the convolutional layer. Thus, the performance of the accelerator increases without any additional hardware resources because the kernel size required to generate the LR output feature maps is smaller. In addition, if there is a DSP underutilization problem in the deconvolutional layer that some of the processors are in an idle state, the proposed method solves this problem by allowing more output feature maps to be processed in parallel. Experimental results show that the proposed TDC method achieves up to 81 times higher throughput than the state-of-the-art DCNN accelerator with the same hardware resources. We also improve the speed by 7.8 times by having all layers in the hourglass-type FSRCNN to be processed in inter-layer parallelism without additional DSP usage.

Full Text