Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems.

Azzam Haidar,Harun Bayraktar,Nicholas J Higham,Jack Dongarra,Stanimire Tomov

doi:10.1098/rspa.2020.0110

Abstract

Double-precision floating-point arithmetic (FP64) has been the de facto standard for engineering and scientific simulations for several decades. Problem complexity and the sheer volume of data coming from various instruments and sensors motivate researchers to mix and match various approaches to optimize compute resources, including different levels of floating-point precision. In recent years, machine learning has motivated hardware support for half-precision floating-point arithmetic. A primary challenge in high-performance computing is to leverage reduced-precision and mixed-precision hardware. We show how the FP16/FP32 Tensor Cores on NVIDIA GPUs can be exploited to accelerate the solution of linear systems of equations Ax = b without sacrificing numerical stability. The techniques we employ include multiprecision LU factorization, the preconditioned generalized minimal residual algorithm (GMRES), and scaling and auto-adaptive rounding to avoid overflow. We also show how to efficiently handle systems with multiple right-hand sides. On the NVIDIA Quadro GV100 (Volta) GPU, we achieve a performance increase and 5× better energy efficiency versus the standard FP64 implementation while maintaining an FP64 level of numerical stability.

Highlights

A fundamental requirement in scientific computing is the ability to solve a system of linear equations Ax = b, (1.1)where A is a large, dense n × n non-singular matrix
To address this problem on GPU Tensor Cores (TCs), we develop a number of innovations for mixed-precision computations as well as leverage building blocks from high-performance computing (HPC) numerical libraries such as CUSOLVER and MAGMA, which provide state-ofthe-art, high-performance algorithms such as LU factorization—including a set of highly tuned mixed-precision iterative refinement algorithms using either the FP32 or the FP16 as lower precision for the LU factorization (e.g. FP32 → FP64 and FP16 → FP64) [8,9]
We note that the number of iterations that we report is the number of generalized minimal residual algorithm (GMRES) iterations, which is totalled across all GMRES calls in the case of the Iterative refinement with preconditioned GMRES (IRGM) solver

Summary

Introduction

A fundamental requirement in scientific computing is the ability to solve a system of linear equations. A persistent challenge has been to redesign the techniques for new architectures and to develop highly tuned implementations that resolve computational issues such as inefficient parallelization, scaling and use of mixed-precision calculations. To address this problem on GPU TCs, we develop a number of innovations for mixed-precision computations (outlined in §3) as well as leverage building blocks from HPC numerical libraries such as CUSOLVER and MAGMA, which provide state-ofthe-art, high-performance algorithms such as LU factorization—including a set of highly tuned mixed-precision iterative refinement algorithms using either the FP32 or the FP16 as lower precision for the LU factorization (e.g. FP32 → FP64 and FP16 → FP64) [8,9]

Related work

Contributions

Iterative refinement solver: background

Correction

Multiprecision factorizations

Preconditioned GMRES

Scaling techniques

Multiple right-hand side optimizations

10. Sensitivity of performance to FP64 compute throughput

11. Energy efficient implementation

12. Performance analysis

13. Experimental set-up

14. Numerical behaviour

15. Performance

16. Energy efficiency

Findings

17. Conclusions and future directions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Proceedings. Mathematical, physical, and engineering sciences	Publication Date: Nov 1, 2020
Citations: 32	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Proceedings. Mathematical, physical, and engineering sciences

Lead the way for us

Similar Papers

Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance
Hiroyuki Ootomo ... Rio Yokota
The International Journal of High Performance Computing Applications | VOL. 36
Hiroyuki Ootomo, et. al.Hiroyuki Ootomo ... Rio Yokota
03 Jun 2022
The International Journal of High Performance Computing Applications | VOL. 36

Towards high-performance and cost-effective distributed storage systems with information dispersal algorithms
Dongfang Zhao ... Corentin Debains
-
Dongfang Zhao, et. al.Dongfang Zhao ... Corentin Debains
01 Sep 2013
01 Sep 2013

Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers
Azzam Haidar ... Nicholas J Higham
-
Azzam Haidar, et. al.Azzam Haidar ... Nicholas J Higham
01 Nov 2018
01 Nov 2018

Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors
Wei Sun ... Henk Corporaal
IEEE Transactions on Parallel and Distributed Systems | VOL. 34
Wei Sun, et. al.Wei Sun ... Henk Corporaal
01 Jan 2023
IEEE Transactions on Parallel and Distributed Systems | VOL. 34

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Proceedings. Mathematical, physical, and engineering sciences