A Comparison of Several Fault-Tolerance Methods for the Detection and Correction of Floating-Point Errors in Matrix-Matrix Multiplication

Valentin Le Fèvre,Yves Robert,Julien Langou,Thomas Herault

doi:10.1007/978-3-030-71593-9_24

Abstract

AbstractThis paper compares several fault-tolerance methods for the detection and correction of floating-point errors in matrix-matrix multiplication. These methods include replication, triplication, Algorithm-Based Fault Tolerance (ABFT) and residual checking (RC). Error correction for ABFT can be achieved either by solving a small-size linear system of equations, or by recomputing corrupted coefficients. We show that both approaches can be used for RC. We provide a synthetic presentation of all methods before discussing their pros and cons. We have implemented all these methods with calls to optimized BLAS routines, and we provide performance data for a wide range of failure rates and matrix sizes.KeywordsResilienceMatrix-matrix multiplicationAlgorithm-based fault tolerance (ABFT)Residual checking (RC)Silent errors

Full Text