Abstract

As GPU applications expand, the reliability of GPU is drawing more attention since even reliability-demanding applications are executed on GPUs. Silent data corruption (SDC) is widely studied both in irradiation experiments and fault injection experiments. On the other hand, detectable uncorrected error (DUE) is not well studied. This work focuses on DUEs reported by the GPU driver and analyzes those observed in fault injection and neutron irradiation experiments, where faults are injected in the control flow to change the program counter value unexpectedly. The DUE errors of GPU engine exception, GPU memory page fault, and GPU processing stop are observed in both the experiments. On the other hand, the DUE error categorized as internal microcontroller halt by the GPU driver, which is not found in the fault injection experiment, is observed frequently, suggesting the necessity of investigating the failures originating from the faults in the components invisible to programmers.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call