TC-SEPM: Characterizing soft error resilience of CNNs on Tensor Cores from program and microarchitecture perspectives

Xiaohui Wei,Changbao Zhou,Hengshan Yue,Joey Tianyi Zhou

doi:10.1016/j.sysarc.2023.103024

Abstract

As an architectural CNN accelerator integrated into NVIDIA’s GPUs, existing research mainly focuses on improving the performance of Tensor Cores. However, the highly integrated Tensor Cores are vulnerable to transient faults (i.e., soft errors), causing catastrophic consequences in safety-critical applications like automatic driving. Thus, it is imperative to estimate the reliability of CNNs on Tensor Cores. However, obtaining a statistically significant resilience profile of CNNs on Tensor Cores with the existing fault injection (FI)-based reliability estimation methods is expensive. To this end, we build TC-SEPM to predict the error resilience of CNNs on Tensor Cores instead of FI methods. To ensure the accuracy of TC-SEPM, we first investigate resilience-related features from program and microarchitecture perspectives. Then, leveraging these heuristic features, we train machine learning models to learn the hidden relationship between error resilience and the investigated features, enabling us to predict the impact of soft errors in Tensor Cores on CNN output. Experimental results show that TC-SEPM achieves high accuracy for individual soft error resiliency prediction and overall program resilience estimation while its overhead is only 1/27 of FI methods. Additionally, TC-SEPM can provide valuable insights for programmers or architects to design more robust CNN models on Tensor Cores.

Full Text