Abstract
Many existing studies on accelerating convolutional neural networks (CNNs) use parallel data operation schemes to increase the throughput. This study proposes area-efficient parallel multiplication unit (PMU) designs for a CNN accelerator that uses parallelization on the output channels of a CNN layer, which parallel multiplies a common feature map pixel with multiple CNN kernel weights. First, tailored PMU designs are proposed for CNNs with specific low-precision 3-to-8-bit weights. Second, the proposed 5-to-8-bit PMU designs are extended with two-clock-cycle operations to develop PMUs for weight precision scalable to 10/12/14/16 bits. Compared to 16-path PMUs directly using carry-save-adder array multipliers, our PMU designs can achieve the area reductions of 28.19%−56.09% and 22.10%−30.71% for 3–8 bit and 10-/12-/14-/ 16-bit weights, respectively. Moreover, a resultant 16-path 16-bit weight PMU is verified through the system-on-chip (SoC) field-programmable gate array (FPGA) implementation to demonstrate the CNN inference.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.