THETA: A High-Efficiency Training Accelerator for DNNs With Triple-Side Sparsity Exploration

Jinming Lu,Zhongfeng Wang,Jian Huang

doi:10.1109/tvlsi.2022.3175582

Abstract

Training deep neural networks (DNNs) on edge devices has attracted increasing attention in real-world applications for domain adaption and privacy protection. However, deploying DNN training on resource-limited edge devices is challenging as there are massive computations and data transportation in training. To address this issue, we propose an energy-efficient training accelerator in this work by employing a hybrid compression strategy. Here, various data redundancies are fully exploited, and the real triple-side sparsity is achieved. Hence, the computational complexity is drastically reduced with negligible accuracy loss across a range of transfer learning tasks. To facilitate triple-side zero-skipping operations during different training stages, we first present a novel sparse data representation and a triple-sparsity index matching scheme. Second, a sparse tensor processing unit (STPU) arranged in a hierarchical structure is developed, which enables a flexible dataflow to process convolutional (Conv) and fully connected (FC) layers with diverse computational patterns throughout the entire training. Third, an auxiliary processing unit (APU) is designed to execute some postprocessing operations, such as rectified linear unit (ReLU) and on-the-fly pruning. Finally, the training accelerator is implemented under Taiwan Semiconductor Manufacturing Company (TSMC) 28-nm process and evaluated on multiple benchmarks. The experimental results show that THETA achieves 7.28–22.32 tera operations per second (TOPS) and 45.24–133.70 TOPS/W in performance and energy efficiency, reducing 40– <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$72\times $ </tex-math></inline-formula> training time and 19– <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$63\times $ </tex-math></inline-formula> energy consumption over dense training, respectively. Compared with the prior art, our design offers <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.6\times $ </tex-math></inline-formula> throughput and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.9\times $ </tex-math></inline-formula> energy efficiency, respectively.

Full Text