Abstract

General Matrix Multiplication (GEMM) is a computationally expensive operation that is used in many applications such as machine learning. Hardware accelerators are increasingly popular for speeding up GEMM computation, with Tiled Matrix Multiplication (TMUL) in recent Intel processors being an example. Unfortunately, the TMUL hardware is susceptible to errors, necessitating online error detection. The Algorithm-based Error Detection (ABED) technique is a powerful technique to detect errors in matrix multiplications. In this article, we consider implementation of an ABED technique that integrates seamlessly with the TMUL hardware to minimize performance overhead. Unfortunately, rounding errors introduced by floating-point operations do not allow a straightforward implementation of ABED in TMUL. Previously an error bound was considered for addressing rounding errors in ABED. If the error detection threshold is set too low, it will a trigger false alarm, while a loose bound will allow errors to escape detection. In this article, we propose an adaptive error threshold that takes into account the TMUL input values to address the problem of false triggers and error escapes and provide a taxonomy of various error classes. This threshold is obtained from theoretical error analysis but is not easy to implement in hardware. Consequently, we relax the threshold such that it can be easily computed in hardware. While ABED ensures error-free computation, it does not guarantee full coverage of all hardware faults. To address this problem, we propose an algorithmic pattern generation technique to ensure full coverage for all hardware faults. To evaluate the benefits of our proposed solution, we conducted fault injection experiments and show that our approach does not produce any false alarms or detection escapes for observable errors. We conducted additional fault injection experiments on a Deep Neural Network (DNN) model and find that if a fault is not detected, it does not cause any misclassification.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call