Abstract

In this paper, we present hardware accelerators created with high-level synthesis techniques for sparse and dense matrix multiplication operations. The cores can operate with different precisions and are designed to be integrated in a heterogeneous CPU-FPGA system for Edge AI applications. The methodology involves quantization-sparsity aware training and it is applied to a case study consisting of human activity classification. We initially investigate the effects of quantization and sparsity on the accuracy of neural networks with convolution, dense and recurrent layers observing better tolerance to pruning when recurrent layers are present. Then, we propose the hardware accelerators that can switch precision at run-time and work with any matrix size up to a maximum configured at compile time. We compare the performance of these accelerators at different levels of precision and sparsity levels and create a performance model to enable workload balancing. The results show that the proposed sparse matrix multipliers can outperform dense multipliers when sparsity levels are higher than 70% and the improvements are more evident when higher precision arithmetic or structural pruning is used. Additionally, sparsity levels as high as 99% can maintain the level of accuracy required in the network especially when recurrent layers are deployed. Overall, the balance between sparse and dense performance depends on matrix shape, precision, structural pruning and sparsity levels and performance modelling can be used to balance concurrent execution in a heterogeneous configuration.

Highlights

  • Over the last few years, novel hardware for deep-learning in AI from well-known companies and start-ups have entered the market, focusing on high energy-efficiency/performance and low cost

  • Real-time inference of deep neural networks (DNNs) on custom hardware has become increasingly relevant with low-precision arithmetic and training frameworks such as the 8 bit EdgeTPU Google devices and TensorFlow Lite [3]

  • We investigate the effects of deep quantization and pruning on accuracy with convolutional and recurrent layers targeting a motion detection application

Read more

Summary

Introduction

Over the last few years, novel hardware for deep-learning in AI from well-known companies and start-ups have entered the market, focusing on high energy-efficiency/performance and low cost. Matrix multiplication acceleration based on combined sparse and dense arithmetic with multi-precision arithmetic as proposed in this research could select the optimal hardware configuration depending on the task. Motivated by these observations, this paper makes the following contributions:. It reviews mixed and arbitrary precision which is more suitable for reconfigurable hardware and concludes that sparse operators are an area open to new research.

Background and related work
Sub-byte precision hardware
Methodology and case study
Pruning and quantization accuracy analysis
GEMM hardware
SPMM hardware
Performance and complexity analysis
Structural pruning optimization
Performance modelling
GEMM model
Conclusions and future work
Findings
SPMM model
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call