Abstract
With the growing demand for deploying Deep Learning models to the “edge”, it is paramount to develop techniques that allow to execute state-of-the-art models within very tight and limited resource constraints. In this work we propose a software-hardware optimization paradigm for obtaining a highly efficient execution engine of deep neural networks (DNNs) that are based on fully-connected layers. The work's approach is centred around compression as a means for reducing the area as well as power requirements of, concretely, multilayer perceptrons (MLPs) with high predictive performances. Firstly, we design a novel hardware architecture named FantastIC4, which (1) supports the efficient on-chip execution of multiple compact representations of fully-connected layers and (2) minimizes the required number of multipliers for inference down to only 4 (thus the name). Moreover, in order to make the models amenable for efficient execution on FantastIC4, we introduce a novel entropy-constrained training method that renders them to be robust to 4bit quantization and highly compressible in size simultaneously. The experimental results show that we can achieve throughputs of 2.45 TOPS with a total power consumption of 3.6W on a Virtual Ultrascale FPGA XCVU440 device implementation, and achieve a total power efficiency of 20.17 TOPS/W on a 22nm process ASIC version. When compared to other state-of-the-art accelerators designed for the Google Speech Command (GSC) dataset, FantastIC4 is better by 51× in terms of throughput and 145× in terms of area efficiency (GOPS/mm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> ).
Highlights
I N RECENT years, the topic of “edge” computing has gained significant attention due to the benefits that come along with processing data directly at its source of collection [1]
By implementing accumulate-then-multiply computational paradigm (ACM) we significantly reduce the computational resource utilization compared to the usual multiply-accumulate (MAC) paradigm, naturally due to performing less multiplication in total, and due to better data movement of the activations for multilayer perceptrons (MLPs) models as well as reduction in required area and power consumption for computations
We evaluate the FantastIC4 on FC layers of popular deep neural networks (DNNs) models, as well as on custom multilayer perceptrons (MLPs) trained on hand-gesture and speech recognition tasks
Summary
I N RECENT years, the topic of “edge” computing has gained significant attention due to the benefits that come along with processing data directly at its source of collection [1]. Processing a high number of parameters usually requires expensive hardware components such as large memory units and, if high throughput and low latency is desired, a high number of multipliers for parallel processing This comes at the expense of spending lots of resources in power consumption and chip-area, greatly limiting their application in use-cases with tight area and power consumption budgets such as in the IoT or wearables. State-ofthe-art compression techniques require complex decoding prior to performing arithmetic operations, which can compensate for the savings attained from compression specially when the hardware is not tailored to such type of decoding algorithms This motivates a hardware-software co-design paradigm where, on the one hand, novel training techniques that make DNNs highly compressible are proposed and, on the other hand, novel hardware architectures are designed supporting the efficient, on-chip execution of compressed representations.
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have