An Empirical Approach to Enhance Performance for Scalable CORDIC-Based Deep Neural Networks

Gopal Raut,Santosh Kumar Vishvakarma,Saurabh Karkun

doi:10.1145/3596220

Abstract

Practical implementation of deep neural networks (DNNs) demands significant hardware resources, necessitating high computational power and memory bandwidth. While existing field-programmable gate array (FPGA)–based DNN accelerators are primarily optimized for fast single-task performance, cost, energy efficiency, and overall throughput are crucial considerations for their practical use in various applications. This article proposes a performance-centric pipeline Coordinate Rotation Digital Computer (CORDIC)–based MAC unit and implements a scalable CORDIC-based DNN architecture that is area- and power-efficient and has high throughput. The CORDIC-based neuron engine uses bit-rounding to maintain input-output precision and minimal hardware resource overhead. The results demonstrate the versatility of the proposed pipelined MAC, which operates at 460 MHz and allows for higher network throughput. A software-based implementation platform evaluates the proposed MAC operation’s accuracy for more extensive neural networks and complex datasets. The DNN accelerator with parameterized and modular layer-multiplexed architecture is designed. Empirical evaluation through Pareto analysis is used to improve the efficiency of DNN implementations by fixing the arithmetic precision and optimal pipeline stages. The proposed architecture utilizes layer-multiplexing, a technique that effectively reuses a single DNN layer to enhance efficiency while maintaining modularity and adaptability for integrating various network configurations. The proposed CORDIC MAC-based DNN architecture is scalable for any bit-precision network size, and the DNN accelerator is prototyped using the Xilinx Virtex-7 VC707 FPGA board, operating at 66 MHz. The proposed design does not use any Xilinx macros, making it easily adaptable for ASIC implementation. Compared with state-of-the-art designs, the proposed design reduces resource use by 45% and power consumption by 4× without sacrificing performance. The accelerator is validated using the MNIST dataset, achieving 95.06% accuracy, only 0.35% less than other cutting-edge implementations.

Full Text