O⁴-DNN: A Hybrid DSP-LUT-Based Processing Unit With Operation Packing and Out-of-Order Execution for Efficient Realization of Convolutional Neural Networks on FPGA Devices

Pouya Haghi,Mehdi Kamal,Ali Afzali-Kusha,Massoud Pedram

doi:10.1109/tcsi.2020.2986350

Abstract

In this paper, we propose O4-DNN, a high-performance FPGA-based architecture for convolutional neural network (CNN) accelerators relying on o peration packing and o ut- o f- o rder ( OoO ) execution for DSP blocks augmented with LUT-based glue logic. The high-level architecture is comprised of a systolic array of processing elements (PEs), supporting output stationary dataflow. In this architecture, the computational unit of each PE is realized by using a DSP block as well as a small number of LUTs. Given the limited number of DSP blocks in FPGAs, the combination (DSP block and some LUTs) provides more computational power obtainable through each DSP block. The proposed computational unit performs eight convolutional operations on five input operands where one of them is an 8-bit weight and the others are four 8-bit input feature (IF) maps. In addition, to improve the energy efficiency of the proposed computational unit, we present an approximate form of the unit suitable for neural network applications. To reduce the memory bandwidth as well as increase the utilization of the computational units, a data reusing technique based on the weight sharing is also presented. To improve the performance of the proposed computational unit further, an addressing approach for computing the partial sums out-of-order is proposed. The efficacy of the architecture is assessed using two FPGA devices executing four state-of-the-art neural networks. Experimental results show that this architecture leads to, on average (up to), $2.5\times $ ( $3.44\times$ ) higher throughput compared to a baseline structure. In addition, on average (maximum of), 12% (40%) energy efficiency improvement is achievable by employing the O4-DNN compared to the baseline structure.

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Circuits and Systems I: Regular Papers	Publication Date: Sep 1, 2020
Citations: 24	License type: publisher-specific, author manuscript

R Discovery Prime

R Discovery Prime

O⁴-DNN: A Hybrid DSP-LUT-Based Processing Unit With Operation Packing and Out-of-Order Execution for Efficient Realization of Convolutional Neural Networks on FPGA Devices

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Circuits and Systems I: Regular Papers

Lead the way for us

Similar Papers

NAND-SPIN-based processing-in-MRAM architecture for convolutional neural network acceleration
Yinglin Zhao ... Xingzhou Cheng
Science China Information Sciences | VOL. 66
Yinglin Zhao, et. al.Yinglin Zhao ... Xingzhou Cheng
09 Feb 2023
Science China Information Sciences | VOL. 66

An Uninterrupted Processing Technique-Based High-Throughput and Energy-Efficient Hardware Accelerator for Convolutional Neural Networks
Md Najrul Islam ... Rahul Shrestha
IEEE Transactions on Very Large Scale Integration (VLSI) Systems | VOL. 30
Md Najrul Islam, et. al.Md Najrul Islam ... Rahul Shrestha
01 Dec 2022
IEEE Transactions on Very Large Scale Integration (VLSI) Systems | VOL. 30

Improving the Performance of CNN Accelerator Architecture under the Impact of Process Variations
Jingweijia Tan ... Maodi Ma
ACM Transactions on Design Automation of Electronic Systems | VOL. 28
Jingweijia Tan, et. al.Jingweijia Tan ... Maodi Ma
09 Sep 2023
ACM Transactions on Design Automation of Electronic Systems | VOL. 28

Process Variation Mitigation on Convolutional Neural Network Accelerator Architecture
Maodi Ma ... Jingweijia Tan
-
Maodi Ma, et. al.Maodi Ma ... Jingweijia Tan
01 Nov 2019
01 Nov 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

O⁴-DNN: A Hybrid DSP-LUT-Based Processing Unit With Operation Packing and Out-of-Order Execution for Efficient Realization of Convolutional Neural Networks on FPGA Devices

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Circuits and Systems I: Regular Papers