Abstract

It is well known that soft errors in linear algebra operations can be detected off-line at the end of the computation using algorithm-based fault tolerance (ABFT). However, traditional ABFT usually cannot correct errors in Cholesky, QR, and LU factorizations because any error in one matrix element will be propagated to many other matrix elements and hence cause too many errors to correct. Although, recently, tremendous progresses have been made to correct errors in LU and QR factorizations, these new techniques correct errors off-line at the end of the computation after errors propagated and accumulated, which significantly complicates the error correction process and introduces at least quadratically increasing overhead as the number of errors increases. In this paper, we present the design and implementation of FT-ScaLAPACK, a fault tolerant version ScaLAPACK that is able to detect, locate, and correct errors in Cholesky, QR, and LU factorizations on-line in the middle of the computation in a timely manner before the errors propagate and accumulate. FT-ScaLAPACK has been validated with thousands of cores on Stampede at the Texas Advanced Computing Center. Experimental results demonstrate that FT-ScaLAPACK is able to achieve comparable performance and scalability with the original ScaLAPACK.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call