FantastIC4: A Hardware-Software Co-Design Approach for Efficiently Running 4Bit-Compact Multilayer Perceptrons

Simon Wiedemann,Pablo Wiedemann,Daniel Becking,Wojciech Samek,Friedel Gerfers,Suhas Shivapakash,Thomas Wiegand

doi:10.1109/ojcas.2021.3083332

Simon Wiedemann, Pablo Wiedemann + Show 5 more

Open Access

PDF Available

https://doi.org/10.1109/ojcas.2021.3083332

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

With the growing demand for deploying Deep Learning models to the “edge”, it is paramount to develop techniques that allow to execute state-of-the-art models within very tight and limited resource constraints. In this work we propose a software-hardware optimization paradigm for obtaining a highly efficient execution engine of deep neural networks (DNNs) that are based on fully-connected layers. The work's approach is centred around compression as a means for reducing the area as well as power requirements of, concretely, multilayer perceptrons (MLPs) with high predictive performances. Firstly, we design a novel hardware architecture named FantastIC4, which (1) supports the efficient on-chip execution of multiple compact representations of fully-connected layers and (2) minimizes the required number of multipliers for inference down to only 4 (thus the name). Moreover, in order to make the models amenable for efficient execution on FantastIC4, we introduce a novel entropy-constrained training method that renders them to be robust to 4bit quantization and highly compressible in size simultaneously. The experimental results show that we can achieve throughputs of 2.45 TOPS with a total power consumption of 3.6W on a Virtual Ultrascale FPGA XCVU440 device implementation, and achieve a total power efficiency of 20.17 TOPS/W on a 22nm process ASIC version. When compared to other state-of-the-art accelerators designed for the Google Speech Command (GSC) dataset, FantastIC4 is better by 51× in terms of throughput and 145× in terms of area efficiency (GOPS/mm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> ).

Highlights

I N RECENT years, the topic of “edge” computing has gained significant attention due to the benefits that come along with processing data directly at its source of collection [1]
By implementing accumulate-then-multiply computational paradigm (ACM) we significantly reduce the computational resource utilization compared to the usual multiply-accumulate (MAC) paradigm, naturally due to performing less multiplication in total, and due to better data movement of the activations for multilayer perceptrons (MLPs) models as well as reduction in required area and power consumption for computations
We evaluate the FantastIC4 on FC layers of popular deep neural networks (DNNs) models, as well as on custom multilayer perceptrons (MLPs) trained on hand-gesture and speech recognition tasks

Summary

INTRODUCTION

I N RECENT years, the topic of “edge” computing has gained significant attention due to the benefits that come along with processing data directly at its source of collection [1]. Processing a high number of parameters usually requires expensive hardware components such as large memory units and, if high throughput and low latency is desired, a high number of multipliers for parallel processing This comes at the expense of spending lots of resources in power consumption and chip-area, greatly limiting their application in use-cases with tight area and power consumption budgets such as in the IoT or wearables. State-ofthe-art compression techniques require complex decoding prior to performing arithmetic operations, which can compensate for the savings attained from compression specially when the hardware is not tailored to such type of decoding algorithms This motivates a hardware-software co-design paradigm where, on the one hand, novel training techniques that make DNNs highly compressible are proposed and, on the other hand, novel hardware architectures are designed supporting the efficient, on-chip execution of compressed representations.

RELATED WORK

FANTASTIC4

Findings

CONCLUSION

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Open Journal of Circuits and Systems	Publication Date: Jan 1, 2021
Citations: 8	License type: CC BY 4.0

R Discovery Prime

FantastIC4: A Hardware-Software Co-Design Approach for Efficiently Running 4Bit-Compact Multilayer Perceptrons

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: IEEE Open Journal of Circuits and Systems

Lead the way for us

Similar Papers

Improving approximate neural networks for perception tasks through specialized optimization
Cecilia De La Parra ... Akash Kumar
Future Generation Computer Systems | VOL. 113
Cecilia De La Parra, et. al.Cecilia De La Parra ... Akash Kumar
22 Jul 2020
Future Generation Computer Systems | VOL. 113

Neural Architecture Search for Keyword Spotting
Tong Mo ... Mohammad Salameh
-
Tong Mo, et. al.Tong Mo ... Mohammad Salameh
25 Oct 2020
25 Oct 2020

Towards the Efficient Multi-Platform Execution of Deep Neural Networks
Hector Gerardo Munoz Hernandez
-
Hector Gerardo Munoz HernandezHector Gerardo Munoz Hernandez
01 Aug 2021
01 Aug 2021

Computational Offloading of Convolutional Neural Network on a Smart Watch
Imran A Zualkernan ... Mohammed Towheed
-
Imran A Zualkernan, et. al.Imran A Zualkernan ... Mohammed Towheed
01 Feb 2020
01 Feb 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

FantastIC4: A Hardware-Software Co-Design Approach for Efficiently Running 4Bit-Compact Multilayer Perceptrons

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: IEEE Open Journal of Circuits and Systems