Abstract

Batch normalization (BN) has been established as a very effective component in deep learning, largely helping accelerate the convergence of deep neural network (DNN) training. Nevertheless, its hardware architecture has not received much attention in the field of DNN on-device training processors. Several previous designs incur either high off-chip memory traffic or high circuit complexity, and hence have deficiencies in terms of hardware efficiency and performance. This article proposes approximately calculated BN (ACBN) to achieve a much better tradeoff between hardware efficiency and performance for DNN on-device training processors. The accuracy and convergence rate of the proposed ACBN have been extensively evaluated using four typical DNN models. Compared with the state-of-the-art reference design, the hardware simulation results show the proposed ACBN can at least reduce floating point operations by 22.2% and save external memory access by 33.3% on average. Moreover, the proposed ACBN introduces 63.6% data sparsity for the backward propagation of BN layers of VGG16 on average. To the best of our knowledge, we are the first to introduce data sparsity for the backward propagation of BN layers. The ACBN module is implemented on Zynq UltraScale <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$+$</tex-math> </inline-formula> ZCU102 system-on-chip (SoC) field-programmable gate array (FPGA), and the results show that the implementation of ACBN hardware module saves 33.9% look-up table (LUT), 49.4% flip-flop (FF), 75% digital signal processor (DSP), and reduces the power by 12.4% compared with the reference design while achieving better performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call