Towards an Efficient CNN Inference Architecture Enabling In-Sensor Processing

Md Jubaer Hossain Pantho,Pankaj Bhowmik,Christophe Bobda

doi:10.3390/s21061955

Md Jubaer Hossain Pantho, Pankaj Bhowmik + Show 1 more

Open Access

https://doi.org/10.3390/s21061955

Copy DOI

Journal: Sensors (Basel, Switzerland)	Publication Date: Mar 10, 2021
Citations: 9	License type: CC BY 4.0

Affiliation: University of Florida

Abstract

The astounding development of optical sensing imaging technology, coupled with the impressive improvements in machine learning algorithms, has increased our ability to understand and extract information from scenic events. In most cases, Convolution neural networks (CNNs) are largely adopted to infer knowledge due to their surprising success in automation, surveillance, and many other application domains. However, the convolution operations’ overwhelming computation demand has somewhat limited their use in remote sensing edge devices. In these platforms, real-time processing remains a challenging task due to the tight constraints on resources and power. Here, the transfer and processing of non-relevant image pixels act as a bottleneck on the entire system. It is possible to overcome this bottleneck by exploiting the high bandwidth available at the sensor interface by designing a CNN inference architecture near the sensor. This paper presents an attention-based pixel processing architecture to facilitate the CNN inference near the image sensor. We propose an efficient computation method to reduce the dynamic power by decreasing the overall computation of the convolution operations. The proposed method reduces redundancies by using a hierarchical optimization approach. The approach minimizes power consumption for convolution operations by exploiting the Spatio-temporal redundancies found in the incoming feature maps and performs computations only on selected regions based on their relevance score. The proposed design addresses problems related to the mapping of computations onto an array of processing elements (PEs) and introduces a suitable network structure for communication. The PEs are highly optimized to provide low latency and power for CNN applications. While designing the model, we exploit the concepts of biological vision systems to reduce computation and energy. We prototype the model in a Virtex UltraScale+ FPGA and implement it in Application Specific Integrated Circuit (ASIC) using the TSMC 90nm technology library. The results suggest that the proposed architecture significantly reduces dynamic power consumption and achieves high-speed up surpassing existing embedded processors’ computational capabilities.

Highlights

As convolutional neural networks (CNN) finding their way more and more into a wide range of vision-based applications, there has been a significant focus on realizing low power custom hardware accelerators to attain their services on the edge/remote devices [1,2,3,4]
We report a maximum frequency of 350 MHz and 380 MHz in FPGA and Application Specific Integrated Circuit (ASIC), respectively
Considering our design to a design without the Relevance computation layer (RCL) layer, we found that our architecture will save energy in convolution operations, if the ROI is less than the size of the image frame

Summary

Introduction

As convolutional neural networks (CNN) finding their way more and more into a wide range of vision-based applications, there has been a significant focus on realizing low power custom hardware accelerators to attain their services on the edge/remote devices [1,2,3,4]. A major challenge in deploying CNNs to the edge is the high data volume of image sensors, impacting the channel bandwidth from the sensor interface to the embedded processor [6]. For an input image frame I, the equation representing the convolution operation of CNN is shown below: C mm ∑ ∑ ∑ S f ,y,z =. It is possible to exploit the parallelism available within the convolution operation and map the pixels into an array of processing elements to achieve fast computation. In our architecture, we map the input frame on an array of processing elements (PEs), where each processor is designed to perform the operation shown in Equation (1) in parallel and generate output for the layer. PEs within a region are designed to perform the regional convolution operation and facilitate optimum execution

Objectives

Results

Conclusion