Abstract

Double-precision floating-point arithmetic (FP64) has been the de facto standard for engineering and scientific simulations for several decades. Problem complexity and the sheer volume of data coming from various instruments and sensors motivate researchers to mix and match various approaches to optimize compute resources, including different levels of floating-point precision. In recent years, machine learning has motivated hardware support for half-precision floating-point arithmetic. A primary challenge in high-performance computing is to leverage reduced-precision and mixed-precision hardware. We show how the FP16/FP32 Tensor Cores on NVIDIA GPUs can be exploited to accelerate the solution of linear systems of equations Ax = b without sacrificing numerical stability. The techniques we employ include multiprecision LU factorization, the preconditioned generalized minimal residual algorithm (GMRES), and scaling and auto-adaptive rounding to avoid overflow. We also show how to efficiently handle systems with multiple right-hand sides. On the NVIDIA Quadro GV100 (Volta) GPU, we achieve a performance increase and 5× better energy efficiency versus the standard FP64 implementation while maintaining an FP64 level of numerical stability.

Highlights

  • A fundamental requirement in scientific computing is the ability to solve a system of linear equations Ax = b, (1.1)where A is a large, dense n × n non-singular matrix

  • To address this problem on GPU Tensor Cores (TCs), we develop a number of innovations for mixed-precision computations as well as leverage building blocks from high-performance computing (HPC) numerical libraries such as CUSOLVER and MAGMA, which provide state-ofthe-art, high-performance algorithms such as LU factorization—including a set of highly tuned mixed-precision iterative refinement algorithms using either the FP32 or the FP16 as lower precision for the LU factorization (e.g. FP32 → FP64 and FP16 → FP64) [8,9]

  • We note that the number of iterations that we report is the number of generalized minimal residual algorithm (GMRES) iterations, which is totalled across all GMRES calls in the case of the Iterative refinement with preconditioned GMRES (IRGM) solver

Read more

Summary

Introduction

A fundamental requirement in scientific computing is the ability to solve a system of linear equations. A persistent challenge has been to redesign the techniques for new architectures and to develop highly tuned implementations that resolve computational issues such as inefficient parallelization, scaling and use of mixed-precision calculations. To address this problem on GPU TCs, we develop a number of innovations for mixed-precision computations (outlined in §3) as well as leverage building blocks from HPC numerical libraries such as CUSOLVER and MAGMA, which provide state-ofthe-art, high-performance algorithms such as LU factorization—including a set of highly tuned mixed-precision iterative refinement algorithms using either the FP32 or the FP16 as lower precision for the LU factorization (e.g. FP32 → FP64 and FP16 → FP64) [8,9]

Related work
Contributions
Iterative refinement solver: background
Correction
Multiprecision factorizations
Preconditioned GMRES
Scaling techniques
Multiple right-hand side optimizations
10. Sensitivity of performance to FP64 compute throughput
11. Energy efficient implementation
12. Performance analysis
13. Experimental set-up
14. Numerical behaviour
15. Performance
16. Energy efficiency
Findings
17. Conclusions and future directions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call