Abstract

Graphic Processing Units (GPUs) have become the accelerator of choice for improving the performance of many of the most demanding applications. While performance of these devices continues to improve generation after generation, reliability of these devices has not been studied rigorously. Several sources of errors can undermine the reliability of these devices, including radiation-induced transient faults, environmental perturbations, and process, temperature or voltage variations. In particular, transient faults in GPU execution have become a significant threat to high performance computing (HPC) and safety-critical applications. HPC systems experience transient faults every few tens of hours and the trend is expected to become worse. A key point in the study of transient faults and their effects on user programs is that some faults do not cause undesirable results in the affected programs. This is especially true for GPU applications. Past efforts to study and understand GPU vulnerability to transient faults have demonstrated reliability can vary greatly across different applications. A significant amount of resilience resides intrinsically in some GPU applications. Transient faults in these applications are not likely to affect the results they produce. Other applications are highly sensitive to transient faults and are likely to crash or produce incorrect results when they are affected by transient faults. While it is generally a good idea to protect the GPU hardware from transient faults, the penalties incurred in terms of performance, power and area are not always justifiable, depending on the applications utilizing the hardware resources. Understanding the relationship between the underlying program characteristics and their implications on vulnerability is crucial. The inherent resilience in applications should be carefully considered when making decisions about designing protection mechanisms to guard against nefarious transient faults. In this thesis, we focus on program characteristics that contribute to their vulnerability. We offer several methodologies that aim at alleviating the prohibitively expensive process of quantifying and estimating the vulnerability of GPU applications, as this is the first step toward improving the reliability of GPUs. Our analyses also demonstrate that beyond the variability between the vulnerabilities of different applications, the vulnerability of GPU applications also varies during their runtime, and present a phase behavior. This phase behavior opens new opportunities for not only more efficient vulnerability estimation, but also more efficient fault mitigation approaches. We demonstrate a methodology for application designers to reduce the cost of protection against transient faults.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.