A Novel Fault-Tolerant Architecture for Tiled Matrix Multiplication

Sandeep Bal,Victor Da Cruz Ferreira,Sandip Kundu,Chandra Sekhar Mummidi,Sudarshan Srinivasan

doi:10.23919/date56975.2023.10136985

Abstract

General matrix multiplication (GEMM) is common to many scientific and machine-learning applications. Convolution, the dominant computation in Convolutional Neural Networks (CNNs), can be formulated as a GEMM problem. Due to its widespread use, a new generation of processors features GEMM acceleration in hardware. Intel recently announced an Advanced Matrix Multiplication (AMX®) instruction set for GEMM, which is supported by 1kB AMX registers and a Tile Multiplication unit (TMUL) for multiplying tiles (sub-matrices) in hardware. Silent Data Corruption (SDC) is a well-known problem that occurs when hardware generates corrupt output. Google and Meta recently reported findings of SDC in GEMM in their data centers. Algorithm-Based Fault Tolerance (ABFT) is an efficient mechanism for detecting and correcting errors in GEMM, but classic ABFT solutions are not optimized for hardware acceleration. In this paper, we present a novel ABFT implementation directly on hardware. Though the exact implementation of Intel TMUL is not known, we propose two different TMUL architectures representing two design points in the area-power-performance spectrum and illustrate how ABFT can be directly incorporated into the TMUL hardware. This approach has two advantages: (i) an error can be concurrently detected at the tile level, which is an improvement over finding such errors only after performing the full matrix multiplication; and (ii) we further demonstrate that performing ABFT at the hardware level has no performance impact and only a small area, latency, and power overhead.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Novel Fault-Tolerant Architecture for Tiled Matrix Multiplication

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

A Highly-Efficient Error Detection Technique for General Matrix Multiplication using Tiled Processing on SIMD Architecture
Chandra Sekhar Mummidi ... Sandip Kundu
-
Chandra Sekhar Mummidi, et. al.Chandra Sekhar Mummidi ... Sandip Kundu
01 Oct 2022
01 Oct 2022

Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs
Shixun Wu ... Bryan Wong
-
Shixun Wu, et. al.Shixun Wu ... Bryan Wong
21 Jun 2023
21 Jun 2023

Evaluation of Algorithm-Based Fault Tolerance for Machine Learning and Computer Vision under Neutron Radiation
Seth Roffe ... Alan D. George
-
Seth Roffe, et. al.Seth Roffe ... Alan D. George
01 Mar 2020
01 Mar 2020

An Approximate GEMM Unit for Energy-Efficient Object Detection.
Ratko Pilipović ... Janko Božič
Sensors (Basel, Switzerland) | VOL. 21
Ratko Pilipović, et. al.Ratko Pilipović ... Janko Božič
18 Jun 2021
Sensors (Basel, Switzerland) | VOL. 21

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Novel Fault-Tolerant Architecture for Tiled Matrix Multiplication

Abstract

Talk to us

Similar Papers