Fast and Efficient Convolutional Accelerator for Edge Computing

Arash Ardakani,Carlo Condo,Warren J Gross

doi:10.1109/tc.2019.2941875

Abstract

Convolutional neural networks (CNNs) are a vital approach in machine learning. However, their high complexity and energy consumption make them challenging to embed in mobile applications at the edge requiring real-time processes such as smart phones. In order to meet the real-time constraint of edge devices, recently proposed custom hardware CNN accelerators have exploited parallel processing elements (PEs) to increase throughput. However, this straightforward parallelization of PEs and high memory bandwidth require high data movement, leading to large energy consumption. As a result, only a certain number of PEs can be instantiated when designing bandwidth-limited custom accelerators targeting edge devices. While most bandwidth-limited designs claim a peak performance of a few hundred giga operations per second, their average runtime performance is substantially lower than their roofline when applied to state-of-the-art CNNs such as AlexNet, VGGNet and ResNet, as a result of low resource utilization and arithmetic intensity. In this work, we propose a zero-activation-skipping convolutional accelerator (ZASCA) that avoids noncontributory multiplications with zero-valued activations. ZASCA employs a dataflow that minimizes the gap between its average and peak performances while maximizing its arithmetic intensity for both sparse and dense representations of activations, targeting the bandwidth-limited edge computing scenario. More precisely, ZASCA achieves a performance efficiency of up to 94 percent over a set of state-of-the-art CNNs for image classification with dense representation where the performance efficiency is the ratio between the average runtime performance and the peak performance. Using its zero-skipping feature, ZASCA can further improve the performance efficiency of the state-of-the-art CNNs by up to 1.9× depending on the sparsity degree of activations. The implementation results in 65-nm TSMC CMOS technology show that, compared to the most energy-efficient accelerator, ZASCA can process convolutions from 5.5× to 17.5× faster, and is between 2.1× and 4.5× more energy efficient while occupying 2.1× less silicon area.

Full Text