Abstract

It remains a challenge to run Deep Learning in devices with stringent power budget in the Internet-of-Things. This paper presents a low-power accelerator for processing Convolutional Neural Networks on the embedded devices. The power reduction is realized by exploring data reuse in three different aspects, with regards to convolution, filter and input features. A systolic-like data flow is proposed and applied to rows of Processing Elements (PEs), which facilitate reusing the data during convolution. Reuse of input features and filters is achieved by arranging the PE array in a 3D tiled architecture, whose dimension is 3 × 14 × 4. Local storage within PEs is therefore reduced and only cost 17.75 kB, which is 20% of the state-of-the-art. With dedicated delay chains in each PE, this accelerator is reconfigurable to suit various parameter settings of convolutional layers. Evaluated in UMC 65 nm low leakage process, the accelerator can reach a peak performance of 84 GOPS and consume only 136 mW at 250 Mhz.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call