Abstract
General Matrix Multiplication (GEMM) is instrumental in myriads of scientific, high-performance computing, and machine learning applications such as computer vision, recommendation models, and weather forecasts. It is vital to make them fail-safe in safety-critical and high-precision applications. Companies like Meta and Google have recently reported sporadic silent errors in GEMM computations traced to hardware sources. Silent errors are hard to detect, requiring specialized solutions to detect them. Hardware redundancy approaches such as double or triple modular redundancy effectively detect or correct such errors, but they have a large area and power overhead. Algorithm-based Fault Tolerance (ABFT) has been shown to offer an effective alternative at a far lower overhead. Modern CPUs feature advanced vector extensions (AVX) capable of executing SIMD instructions. This paper describes a new ABFT approach designed to take advantage of the AVX feature. Our core algorithm relies on the classical tile-based outer-product approach but enhances standard check-sum calculation using a tile vector. The implementation parameters are fine-tuned to fit the available number of AVX registers. Our results indicate that we can achieve 100% error detection in GEMM at an overhead of just 0.21% for the integer data type. Unfortunately, due to rounding errors, addition of floating-point numbers is not an associative operation, creating difficulties for ABFT. To mitigate the impact of rounding errors, we introduce the concept of relative error checking and perform error analysis for various error classes to show that the proposed approach totally eliminates false positive errors.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.