EGEMM-TC

Boyuan Feng,Guoyang Chen,Yufei Ding,Yuke Wang,Weifeng Zhang,Yuan Xie

doi:10.1145/3437801.3441599

Abstract

Nvidia Tensor Cores achieve high performance with half-precision matrix inputs tailored towards deep learning workloads. However, this limits the application of Tensor Cores especially in the area of scientific computing with high precision requirements. In this paper, we build Emulated GEMM on Tensor Cores (EGEMM-TC) to extend the usage of Tensor Cores to accelerate scientific computing applications without compromising the precision requirements. First, EGEMM-TC employs an extendable workflow of hardware profiling and operation design to generate a lightweight emulation algorithm on Tensor Cores with extended-precision. Second, EGEMM-TC exploits a set of Tensor Core kernel optimizations to achieve high performance, including the highly-efficient tensorization to exploit the Tensor Core memory architecture and the instruction-level optimizations to coordinate the emulation computation and memory access. Third, EGEMM-TC incorporates a hardware-aware analytic model to offer large flexibility for automatic performance tuning across various scientific computing workloads and input datasets. Extensive evaluations show that EGEMM-TC can achieve on average 3.13× and 11.18× speedup over the cuBLAS kernels and the CUDA-SDK kernels on CUDA Cores, respectively. Our case study on several scientific computing applications further confirms that EGEMM-TC can generalize the usage of Tensor Cores and achieve about 1.8× speedup compared to the hand-tuned, highly-optimized implementations running on CUDA Cores.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

EGEMM-TC

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Accelerating Sparse Deep Neural Network Inference Using GPU Tensor Cores
Yufei Sun ... Long Zheng
-
Yufei Sun, et. al.Yufei Sun ... Long Zheng
19 Sep 2022
19 Sep 2022

Leveraging GPU Tensor Cores for Double Precision Euclidean Distance Calculations
Benoit Gallet ... Michael Gowanlock
-
Benoit Gallet, et. al.Benoit Gallet ... Michael Gowanlock
01 Dec 2022
01 Dec 2022

Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS
Han Zhao ... Youtao Zhang
-
Han Zhao, et. al.Han Zhao ... Youtao Zhang
01 Apr 2022
01 Apr 2022

Exploiting Intra-SM Parallelism in GPUs via Persistent and Elastic Blocks
Han Zhao ... Quan Chen
-
Han Zhao, et. al.Han Zhao ... Quan Chen
01 Oct 2021
01 Oct 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

EGEMM-TC

Abstract

Talk to us

Similar Papers