Improving GPU Throughput through Parallel Execution Using Tensor Cores and CUDA Cores

Khoa Ho,Saraju Mohanty,Hui Zhao,Adwait Jog

doi:10.1109/isvlsi54635.2022.00051

Abstract

To accelerate the execution of Machine Learning applications, recent GPUs use Tensor cores to speed up the general matrix multiplication (GEMM), which is the heart of deep learning. The Streaming Processors in such GPUs also contain CUDA cores to implement general computations. While the Tensor cores can significantly improve the performance of GEMM, the CUDA cores remain idle when Tensor cores are running. This leads to inefficient resource utilization. In this work, we propose to offload part of the GEMM operations from Tensor cores to CUDA cores to fully utilize GPU resources. We investigated the performance bottleneck in such offloading schemes and proposed architectural optimization to maximize the GPU throughput. Our technique is purely hardware-based and does not require a new compiler or other software support. Our evaluation results show that the proposed scheme can improve performance by 19% at the maximum.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Improving GPU Throughput through Parallel Execution Using Tensor Cores and CUDA Cores

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Leveraging GPU Tensor Cores for Double Precision Euclidean Distance Calculations
Benoit Gallet ... Michael Gowanlock
-
Benoit Gallet, et. al.Benoit Gallet ... Michael Gowanlock
01 Dec 2022
01 Dec 2022

Accelerating Sparse Deep Neural Network Inference Using GPU Tensor Cores
Yufei Sun ... Long Zheng
-
Yufei Sun, et. al.Yufei Sun ... Long Zheng
19 Sep 2022
19 Sep 2022

Exploiting Intra-SM Parallelism in GPUs via Persistent and Elastic Blocks
Han Zhao ... Quan Chen
-
Han Zhao, et. al.Han Zhao ... Quan Chen
01 Oct 2021
01 Oct 2021

EGEMM-TC
Boyuan Feng ... Yuan Xie
-
Boyuan Feng, et. al.Boyuan Feng ... Yuan Xie
17 Feb 2021
17 Feb 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improving GPU Throughput through Parallel Execution Using Tensor Cores and CUDA Cores

Abstract

Talk to us

Similar Papers