Hardware Acceleration of SpMV Multiplier for Deep Learning

M Mahesh,S Nalesh,S Kala

doi:10.1109/vdat53777.2021.9600988

Abstract

Deep Learning techniques are widely adopted in many computer vision tasks and are practically implemented in real-world systems. Deep learning is a promising method for machine learning, where complex functions can be learned directly from the data. Deep learning algorithms provide high accuracy for recognition tasks, while consuming significant resources for computation, compared to the conventional algorithms. State-of-the-art deep learning networks have hundreds of millions of parameters, which makes them less suitable for adoption in devices with constrained memory and power requirements like edge computing devices and mobile devices. Techniques like quantization and inducing sparsity, aims to reduce the total number of computations needed for deep learning inference. For portable embedded applications, general purpose computing hardware like GPPs (General Purpose Processors) and GPUs (Graphic Processing Units) are not preferred as these applications demand high energy efficiency. Field Programmable Gate Arrays (FPGAs) are suitable hardware solutions for edge computing, which gives better power consumption with increased flexibility. In this paper we propose a bandwidth-efficient sparse matrix-vector (SpMV) multiplier architecture for faster deep learning inference. Our architecture also reduces memory-bandwidth bottleneck present in hardware realization of deep learning algorithms. Proposed architecture can use the maximum available memory bandwidth of the computing device, using the multiple MAC (multiply and accumulate) channels. The proposed architecture has been implemented on Kintex-7 FPGA with an operating frequency of 270 MHz and gave significant performance gain when compared with existing implementation.

Full Text