Characterizing Deep Learning Neural Network Failures Between Algorithmic Inaccuracy and Transient Hardware Faults

Bohan Zhang,Md Hasanur Rahman,Guanpeng Li,Sabuj Laskar

doi:10.1109/prdc55274.2022.00020

Abstract

Deep Neural Networks (DNNs) have been widely deployed in safety-critical applications such as autonomous vehicles, healthcare, and space applications. Though DNN models have long suffered intrinsic algorithmic inaccuracies, the increasing number of hardware transient faults in computer systems has been raising safety and reliability concerns in safety-critical applications. This paper investigates the impact of DNN misclassifications that caused by hardware transient faults and intrinsic algorithmic inaccuracy in safety-critical applications. We first extend a state-of-the-art fault injector for TensorFlow application, TensorFI, to support fault injections on modern DNN models in a scalable way, then characterize the outcome classes of the models, analyzing them based on safety related metrics. Finally, we conduct a large-scale fault injection experiment to measure the failures according to the metrics and study their impact on safety. We observe that failures caused by hardware transient faults could have much more significant impact (up to 4 times higher probability) on safety-critical applications than that of the DNN algorithmic inaccuracies, advocating the potential needs to protect DNNs from hardware faults in safety-critical applications.

Full Text