Characterization and remediation for soft error reliability on GPU

Fritz Gerald Previlon

doi:10.17760/d20323961

Abstract

Graphic Processing Units (GPUs) have become the accelerator of choice for improving the performance of many of the most demanding applications. While performance of these devices continues to improve generation after generation, reliability of these devices has not been studied rigorously. Several sources of errors can undermine the reliability of these devices, including radiation-induced transient faults, environmental perturbations, and process, temperature or voltage variations. In particular, transient faults in GPU execution have become a significant threat to high performance computing (HPC) and safety-critical applications. HPC systems experience transient faults every few tens of hours and the trend is expected to become worse. A key point in the study of transient faults and their effects on user programs is that some faults do not cause undesirable results in the affected programs. This is especially true for GPU applications. Past efforts to study and understand GPU vulnerability to transient faults have demonstrated reliability can vary greatly across different applications. A significant amount of resilience resides intrinsically in some GPU applications. Transient faults in these applications are not likely to affect the results they produce. Other applications are highly sensitive to transient faults and are likely to crash or produce incorrect results when they are affected by transient faults. While it is generally a good idea to protect the GPU hardware from transient faults, the penalties incurred in terms of performance, power and area are not always justifiable, depending on the applications utilizing the hardware resources. Understanding the relationship between the underlying program characteristics and their implications on vulnerability is crucial. The inherent resilience in applications should be carefully considered when making decisions about designing protection mechanisms to guard against nefarious transient faults. In this thesis, we focus on program characteristics that contribute to their vulnerability. We offer several methodologies that aim at alleviating the prohibitively expensive process of quantifying and estimating the vulnerability of GPU applications, as this is the first step toward improving the reliability of GPUs. Our analyses also demonstrate that beyond the variability between the vulnerabilities of different applications, the vulnerability of GPU applications also varies during their runtime, and present a phase behavior. This phase behavior opens new opportunities for not only more efficient vulnerability estimation, but also more efficient fault mitigation approaches. We demonstrate a methodology for application designers to reduce the cost of protection against transient faults.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Characterization and remediation for soft error reliability on GPU

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Characterizing and Exploiting Soft Error Vulnerability Phase Behavior in GPU Applications
Fritz Previlon ... Devesh Tiwari
IEEE Transactions on Dependable and Secure Computing | VOL. 19
Fritz Previlon, et. al.Fritz Previlon ... Devesh Tiwari
01 May 2020
IEEE Transactions on Dependable and Secure Computing | VOL. 19

Evaluating the impact of execution parameters on program vulnerability in GPU applications
Fritz G Previlon ... Charu Kalra
-
Fritz G Previlon, et. al.Fritz G Previlon ... Charu Kalra
01 Mar 2018
01 Mar 2018

Tuning applications for efficient GPU offloading to in-memory processing
Yudong Wu ... Yuanyuan Zhou
-
Yudong Wu, et. al.Yudong Wu ... Yuanyuan Zhou
29 Jun 2020
29 Jun 2020

Hint-assisted scheduling on modern GPUs
Xun Gong
-
Xun GongXun Gong
10 May 2021
10 May 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Characterization and remediation for soft error reliability on GPU

Abstract

Talk to us

Similar Papers