Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores

Pierre Blanchard,Srikara Pranesh,Florent Lopez,Nicholas J Higham,Theo Mary

doi:10.1137/19m1289546

Abstract

Computing units that carry out a fused multiply-add (FMA) operation with matrix arguments, referred to as tensor units by some vendors, have great potential for use in scientific computing. However, these units are inherently mixed precision, and existing rounding error analyses do not support them. We consider a mixed precision block FMA that generalizes both the usual scalar FMA and existing tensor units. We describe how to exploit such a block FMA in the numerical linear algebra kernels of matrix multiplication and LU factorization and give detailed rounding error analyses of both kernels. An important application is to GMRES-based iterative refinement with block FMAs, about which our analysis provides new insight. Our framework is applicable to the tensor core units in the NVIDIA Volta and Turing GPUs. For these we compare matrix multiplication and LU factorization with TC16 and TC32 forms of FMA, which differ in the precision used for the output of the tensor cores. Our experiments on an NVDIA V100 GPU confirm the predictions of the analysis that the TC32 variant is much more accurate than the TC16 one, and they show that the accuracy boost is obtained with almost no performance loss.

Highlights

A new development in high performance computing is the emergence of hardware supporting low precision floating-point formats such as the 16bit IEEE half precision format and the 16-bit bfloat16 format1 [23]
With uFMA = ulow, the bound is reduced by a factor approximately b, while with uFMA = uhigh, the factor of improvement is even larger and equal to min(n/2, ulow/uhigh)
With u = 0, the bounds are even smaller: for uFMA = ulow the improvement is negligible since it amounts to removal of the nuhigh term, while for uFMA = uhigh and nuhigh \gg 2ulow the bound is reduced by a factor approximately b

Summary

Introduction

A new development in high performance computing is the emergence of hardware supporting low precision floating-point formats such as the 16bit IEEE half precision format (fp16) and the 16-bit bfloat16 format1 [23]. We present algorithms for matrix multiplication and LU factorization with a block FMA and give detailed rounding error analyses of them.

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: SIAM journal on scientific computing : a publication of the Society for Industrial and Applied Mathematics	Publication Date: Jan 1, 2020
Citations: 37	License type: cc-by

R Discovery Prime

R Discovery Prime

Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: SIAM journal on scientific computing : a publication of the Society for Industrial and Applied Mathematics

Lead the way for us

Similar Papers

NVIDIA Tensor Core Programmability, Performance & Precision
Stefano Markidis ... Erwin Laure
-
Stefano Markidis, et. al.Stefano Markidis ... Erwin Laure
01 May 2018
01 May 2018

Matrix Multiplication in Multiword Arithmetic: Error Analysis and Application to GPU Tensor Cores
Massimiliano Fasi ... Theo Mary
SIAM journal on scientific computing : a publication of the Society for Industrial and Applied Mathematics | VOL. 45
Massimiliano Fasi, et. al.Massimiliano Fasi ... Theo Mary
28 Feb 2023
SIAM journal on scientific computing : a publication of the Society for Industrial and Applied Mathematics | VOL. 45

Optimizing Performance of Image Processing Algorithms on GPUs
Honghui Zhou ... Ying Qian
-
Honghui Zhou, et. al.Honghui Zhou ... Ying Qian
01 Jan 2021
01 Jan 2021

Design of a Coarse-Grained Processing Element for Matrix Multiplication on FPGA
Yuichi Okuyama ... Tokimasa Shirai
-
Yuichi Okuyama, et. al.Yuichi Okuyama ... Tokimasa Shirai
01 Sep 2014
01 Sep 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: SIAM journal on scientific computing : a publication of the Society for Industrial and Applied Mathematics