Deep Neural Networks (DNNs) have permeated multiple applications, including cutting-edge safety-critical domains, which require relevant computational power, often provided by Graphic Processing Units (GPUs). GPUs are manufactured with advanced semiconductor technologies that can be affected by faults during the operational phase (e.g., due to wear-out, aging, or environmental harshness), whose effects possibly reach the DNN outputs, in some cases leading to catastrophic consequences. Hence, hardware-aware reliability assessments of DNNs are crucial to be considered in the context of safety-critical systems (following regulations/standards of specific application domains). Application-level fault injection (FI) techniques (i.e., DNN parameter corruption) are often adopted for the reliability evaluation of DNNs; unfortunately, these approaches hardly represent fault effects from GPU hardware. This work proposes an FI strategy based on Hardware-Injection-Through-Program-Transformation (HITPT) to mimic the effect of permanent faults (PFs) at the GPU instruction level, enabling effective assessment of PFs on DNN’s reliability. Our approach provides a good trade-off between the fault effect evaluation’s accuracy and the required computational time. Using the proposed approach, for the first time, we systematically assessed the effects of PF in GPUs executing some DNN sample cases. The results indicate that the faults injected closer to the hardware, using our evaluation strategy, can produce a higher accuracy degradation than the evaluations performed by the typical application-level FI that modify only the DNN parameters. Furthermore, the proposed FI methodology provides insightful results to identify the most suitable fault-tolerance solutions (e.g., selective hardening or design diversity) for their application at thread levels inside GPU’s kernels.
Read full abstract