Abstract

Many existing studies on accelerating convolutional neural networks (CNNs) use parallel data operation schemes to increase the throughput. This study proposes area-efficient parallel multiplication unit (PMU) designs for a CNN accelerator that uses parallelization on the output channels of a CNN layer, which parallel multiplies a common feature map pixel with multiple CNN kernel weights. First, tailored PMU designs are proposed for CNNs with specific low-precision 3-to-8-bit weights. Second, the proposed 5-to-8-bit PMU designs are extended with two-clock-cycle operations to develop PMUs for weight precision scalable to 10/12/14/16 bits. Compared to 16-path PMUs directly using carry-save-adder array multipliers, our PMU designs can achieve the area reductions of 28.19%−56.09% and 22.10%−30.71% for 3–8 bit and 10-/12-/14-/ 16-bit weights, respectively. Moreover, a resultant 16-path 16-bit weight PMU is verified through the system-on-chip (SoC) field-programmable gate array (FPGA) implementation to demonstrate the CNN inference.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call