Abstract

Soft errors are one-time events that corrupt the state of a computing system but not its overall functionality. Large supercomputers are especially susceptible to soft errors because of their large number of components. Soft errors can generally be detected offline through the comparison of the final computation results of two duplicated computations, but this approach often introduces significant overhead. This paper presents Online-ABFT, a simple but efficient online soft error detection technique that can detect soft errors in the widely used Krylov subspace iterative methods in the middle of the program execution so that the computation efficiency can be improved through the termination of the corrupted computation in a timely manner soon after a soft error occurs. Based on a simple verification of orthogonality and residual, Online-ABFT is easy to implement and highly efficient. Experimental results demonstrate that, when this online error detection approach is used together with checkpointing, it improves the time to obtain correct results by up to several orders of magnitude over the traditional offline approach.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call