Graphics Processing Units (GPUs) are increasingly adopted in several domains where reliability is fundamental, such as self-driving cars and autonomous systems. Unfortunately, GPU devices have been shown to have a high error rate, while the constraints imposed by real-time safety-critical applications make traditional (and costly) replication-based hardening solutions inadequate. This work proposes an effective methodology to identify the architectural vulnerable sites in GPUs modules, i.e., the locations that, if corrupted, most affect the correct instructions execution. We first identify, through an innovative method based on Register-Transfer Level (RTL) fault injection experiments, the architectural vulnerabilities of a GPU model. Then, we mitigate the fault impact via selective hardening applied to the flip-flops that have been identified as critical. We evaluate three hardening strategies: Triple Modular Redundancy (TMR), Triple Modular Redundancy against SETs ( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\Delta $ </tex-math></inline-formula> TMR), and Dual Interlocked Storage Cells (Dice flip-flops). The results gathered on a publicly available GPU Model (FlexGripPlus) considering functional units, pipeline registers, and warp scheduler controller show that our method can tolerate from 85% to 99% of faults in the pipeline registers, from 50% to 100% of faults in the functional units and up to 10% of faults in the warp scheduler, with a reduced hardware overhead (in the range of 58% to 94% when compared with traditional TMR). Finally, we adapt the methodology to perform a complementary evaluation targeting permanent faults and identify critical sites prone to propagate fault effects across the GPU. We found that a considerable percentage (65% to 98%) of flip-flops that are critical for transient faults are also critical for permanent faults.
Read full abstract