Abstract
Resilient computation has been an emerging topic in the field of high-performance computing (HPC). In particular, studies show that tolerating faults on leadership-class supercomputers (such as exascale supercomputers) is expected to be one of the main challenges. In this paper, we utilize dynamic binary instrumentation and virtual machine based fault injection to emulate soft errors and study the soft errors' impact on the behavior of applications. We propose Chaser, a fine-grained, accountable, flexible, and efficient fault injection framework built on top of QEMU. Chaser offers just-in-time fault injection, the ability to trace fault propagation, and flexible and programable interfaces. In the case study, we demonstrate the usage of Chaser on Matvec and a real DOE mini MPI application
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.