Abstract

Training deep/convolutional neural networks (DNNs/CNNs) requires a large amount of memory and iterative computation, which necessitates speedup and energy reduction, especially for edge devices with resource/energy constraints. In this work, we present an 8-bit floating-point (FP8) training the processor which implements: 1) highly parallel tensor cores (fused multiply–add trees) that maintain high utilization throughout forward propagation (FP), backward propagation (BP), and weight update (WU) phases of the training process; 2) hardware-efficient channel gating for dynamic output activation sparsity; 3) dynamic weight sparsity (WS) based on group Lasso; and 4) gradient skipping based on the FP prediction error. We develop a custom instruction set architecture (ISA) to flexibly support different CNN topologies and training parameters. The 28-nm prototype chip demonstrates large improvements in floating-point operations (FLOPs) reduction (7.3 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times$</tex-math> </inline-formula> ), energy efficiency (16.4 TFLOPS/W), and overall training latency speedup (4.7 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times$</tex-math> </inline-formula> ), for both supervised and self-supervised training tasks.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.