Acceleration of Tensor-Product Operations with Tensor Cores

Cu Cui

doi:10.1145/3695466

Abstract

In this paper, we explore the acceleration of tensor product operations in finite element methods, leveraging the computational power of the NVIDIA A100 GPU Tensor Cores. We provide an accessible overview of the necessary mathematical background and discuss our implementation strategies. Our study focuses on two common programming approaches for NVIDIA Tensor Cores: the C++ Warp Matrix Functions in nvcuda::wmma and the inline Parallel Thread Execution (PTX) instructions mma.sync.aligned . A significant focus is placed on the adoption of the versatile inline PTX instructions combined with a conflict-free shared memory access pattern, a key to unlocking superior performance. When benchmarked against traditional CUDA Cores, our approach yields a remarkable 2.3-fold increase in double precision performance, achieving 8 TFLOPS/s—45% of the theoretical maximum. Furthermore, in half-precision computations, numerical experiments demonstrate a fourfold enhancement in solving the Poisson equation using the flexible GMRES (FGMRES) method, preconditioned by a multigrid method in 3D. This is achieved while maintaining the same discretization error as observed in double precision computations. These results highlight the considerable benefits of using Tensor Cores for finite element operators with tensor products, achieving an optimal balance between computational speed and precision.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Acceleration of Tensor-Product Operations with Tensor Cores

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Parallel Computing

Lead the way for us

Similar Papers

Leveraging GPU Tensor Cores for Double Precision Euclidean Distance Calculations
Benoit Gallet ... Michael Gowanlock
-
Benoit Gallet, et. al.Benoit Gallet ... Michael Gowanlock
01 Dec 2022
01 Dec 2022

Quantum-Based Molecular Dynamics Simulations Using Tensor Cores.
Joshua Finkelstein ... Justin S Smith
Journal of Chemical Theory and Computation | VOL. 17
Joshua Finkelstein, et. al.Joshua Finkelstein ... Justin S Smith
01 Oct 2021
Journal of Chemical Theory and Computation | VOL. 17

Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance
Hiroyuki Ootomo ... Rio Yokota
The International Journal of High Performance Computing Applications | VOL. 36
Hiroyuki Ootomo, et. al.Hiroyuki Ootomo ... Rio Yokota
03 Jun 2022
The International Journal of High Performance Computing Applications | VOL. 36

NVIDIA Tensor Core Programmability, Performance & Precision
Stefano Markidis ... Ivy Bo Peng
-
Stefano Markidis, et. al.Stefano Markidis ... Ivy Bo Peng
01 May 2018
01 May 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Acceleration of Tensor-Product Operations with Tensor Cores

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Parallel Computing