Abstract

Over the past decade, deep convolutional neural networks (CNN) have been widely embraced in various visual recognition applications owing to their extraordinary accuracy. However, their high computational complexity and excessive data storage present two challenges when designing CNN hardware. In this paper, we propose an energy-aware bit-serial streaming deep CNN accelerator to tackle these challenges. Using ring streaming dataflow and the output reuse strategy to decrease data access, the amount of external DRAM access for the convolutional layers is reduced by 357.26x when compared with that of no output reuse case on AlexNet. We optimize the hardware utilization and avoid unnecessary computations using the loop tiling technique and by mapping the strides of the convolutional layers to unit-ones for computational performance enhancement. In addition, the bit-serial processing element (PE) is designed to use fewer bits in weights, which can reduce both the amount of computation and external memory access. We evaluate our design using the well-known roofline model. The design space is explored to find the solution with the best computational performance and communication to computation (CTC) ratio. We can reach 1.36x speed and reduce energy consumption by 41% for external memory access compared with the design in [1]. The hardware implementation for our PE Array architecture design can reach an operating frequency of 119 MHz and consumes 68 k gates with a power consumption of 10.08 mW using TSMC 90-nm technology. Compared to the 15.4 MB external memory access for Eyeriss [2] on the convolutional layers of AlexNet, our method only requires 4.36 MB of external memory access to dramatically reduce the costliest portion of power consumption.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call