Extending checksum-based ABFT to tolerate soft errors online in iterative methods

Longxiang Chen,Zizhong Chen,Dingwen Tao,Panruo Wu

doi:10.1109/padsw.2014.7097827

Abstract

As the size and complexity of high performance computers increase, more soft errors will be encountered during computations. Algorithm-Based Fault Tolerance (ABFT) has been proved to be a highly efficient technique to detect soft errors in dense linear algebra operations including matrix multiplication, Cholesky and LU factorization. While ABFT can also be applied to a iterative sparse linear algebra algorithm via applying it to every individual matrix-vector multiplication in the algorithm, it often introduces considerable overhead. In this paper, we propose novel extensions to ABFT to not only reduce the overhead but also protect computations that can not be protected by existing ABFT. Instead of maintaining checksums in every individual matrix-vector multiplication, we modified the algorithms so that checksums established at the beginning of the algorithms can be maintained at every iterations throughout the algorithms. Because soft errors in most iterative sparse linear algebra algorithms will propagate from one iteration to another, we do not have to verify the correctness of the checksums at each iteration to detect errors. By reducing the frequency of verification, the fault tolerance overhead can be greatly reduced. Experimental results demonstrate that, when used with local diskless checkpoints together, our approach introduces much less overhead than the existing ABFT techniques.

Full Text