Model Compression on Faulty Array-based Neural Network Accelerator

Krishna Teja Chitty-Venkata,Arun K Somani

doi:10.1109/prdc50213.2020.00020

Abstract

Due to an increase in the size of Deep Neural Networks (DNNs), special purpose hardware has gained prominence to accelerate the forward pass of the network like Google’s Tensor Processing Unit (TPU) and Eyeriss. The heart of these accelerators is a Matrix Multiplication unit, which is based on systolic array architecture. This array processor has a grid-like structure, made of individual Processing Elements (PEs) that can be extended along row and column. A lot of work has been done on the computing array implementation and its reliability concerns in the past. However, their fault tolerance perspective with respect to DNNs is not yet fully understood with a fault model. We, in this paper, first present a fault model i.e., different sequences in which faults can occur on the array. We classify the fault modes into random, row, and column, and study their impact on the accuracy of DNNs followed by overheads of the mitigation strategies.Pruning is a process of removing redundant parameters in the model to decrease the network size for efficient performance. Although several pruning techniques have been developed to reduce the inference time on general-purpose and special-purpose systems, model compression (pruning) under faulty scenario has not yet been explored. In the second part of our work, we co-design a Fault based and Array size based Pruning (FPAP) algorithm with an intent of bypassing the faults and removing the internal redundancy at the same time for efficient inference. We compare our method with recent pruning methods under different fault scenarios and array sizes. We achieve a mean speedup of 4.2x where the baselines achieved 1.6x on ConvNet, NiN, AlexNet, VGG16 over Eyeriss in the case of random faults.

Full Text