Selective Fault Tolerance for Register Files of Graphics Processing Units

Marcio Goncalves,Ivan Lamb,Paolo Rech,Fernando Fernandes,Jose Rodrigo Azambuja

doi:10.1109/tns.2019.2903027

Abstract

The high computing efficiency of graphics processing units (GPUs) makes them attractive for both high-performance computing and safety-critical applications, such as the automotive and aerospace ones. For both application domains, reliability is a major concern. This paper aims at providing guidelines to improve the reliability of GPUs register file without jeopardizing the device’s computing efficiency. We advance the knowledge of GPUs’ reliability by investigating register file criticality, which is the probability for a fault in a register to propagate and affect computation. Then, we propose and validate selective fault-tolerance techniques for GPUs register file that can be applied at hardware or software level. Results show that both implementations are well suited to detect faults affecting computation. However, although hardware-implemented techniques are able to detect faults that are triggering a crash, software-implemented techniques may not be sufficient to guarantee sufficient coverage for crashes.

Full Text