Graphics processing units (GPUs) are widely used in general-purpose high-performance computing applications (i.e., GPGPUs), which require reliable execution in the presence of soft errors. To support massive thread-level parallelism, a sizeable register file is adopted in GPUs, which is highly vulnerable to soft errors. Although modern commercial GPUs provide single-error-correction double-error-detection (SEC-DED) error correction code (ECC) for the register file, it consumes a considerable amount of energy due to frequent register accesses and leakage power of ECC storage. In this article, we propose to leverage the error sensitivity of instructions, the duplicate characteristics of the same-named registers, and the error sensitivity of data bits to build a unified energy-efficient ECC mechanism for a GPGPUs register file (Eff-ECC), which consists of instruction-aware ECC (IA-ECC), duplication-aware ECC (DA-ECC), and bit-aware ECC (BA-ECC). Considering the error sensitivity of instructions, IA-ECC merely implements ECCs for the write registers of critical instructions. Observing the same-named registers across threads usually keeps the same data, DA-ECC avoids unnecessary ECC generation and verification for duplicate register values. Leveraging the inherent error-tolerance features of the program, BA-ECC merely protects significant bits of registers to combat the crucial error. Experimental results demonstrate that Eff-ECC tremendously reduces 86.46% energy consumption of traditional SEC-DED ECC.
Read full abstract