Efficient Mixed-Precision Tall-and-Skinny Matrix-Matrix Multiplication for GPUs

Hao Tang,Hiroaki Kobayashi,Masayuki Sato,Kazuhiko Komatsu

doi:10.15803/ijnc.11.2_267

Hao Tang, Hiroaki Kobayashi + Show 2 more

Open Access

https://doi.org/10.15803/ijnc.11.2_267

Copy DOI

Abstract

General matrix-matrix multiplication (GEMM) is a commonly used BLAS level-3 routine in big data analysis and scientific computations. To further enhance the capability for GEMM computation on GPUs, manufacturers have introduced dedicated hardware for tensor and matrix operations into modern GPU architectures, which is called the Tensor Core unit. Mixed-precision GEMM based on the Tensor Core units has been introduced into many BLAS libraries and deep learning frameworks. However, these implementations are usually designed for large square matrices while these implementations tend to have a low performance for irregular-shaped matrices, especially for tall-and-skinny matrices. This paper discusses on optimizing the GEMM computation suited for tall-and-skinny matrices on GPUs with three optimization methods: task mapping, memory access, and efficient use of Tensor core units by filling multiple fragments. First, the task mapping pattern of GEMM is optimized to make the implementation avoid launching too many thread blocks even when the sizes of input matrices are large. Second, the memory access pattern is optimized for half-precision tall-and-skinny matrices stored in the row-major layout. Third, Tensor Core units are effectively used even for extremely skinny matrices by filling multiple fragments into a Tensor Core operation. To examine the effectiveness of the proposed optimization methods, the experiments are conducted in two cases of GEMM that take tall-and-skinny matrices as input. With the proposed optimization methods, the evaluation results show that the optimized GEMM algorithms can make 1.07x to 3.19x and 1.04x to 3.70x speedups compared with the latest cuBLAS library on NVIDIA V100 and NVIDIA A100, respectively. By reducing the usage of the Tensor Core operations and utilizing the optimized memory access pattern, the optimized GEMM algorithms can save the energy consumptions of V100 and A100 by 34% to 74% and 62% to 82%, respectively.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal of Networking and Computing	Publication Date: Jan 1, 2021
Citations: 3	License type: free

R Discovery Prime

R Discovery Prime

Efficient Mixed-Precision Tall-and-Skinny Matrix-Matrix Multiplication for GPUs

Abstract

Talk to us

Similar Papers

More From: International Journal of Networking and Computing

Lead the way for us

Similar Papers

Matrix multiplication on batches of small matrices in half and half-complex precisions
Ahmad Abdelfattah ... Jack Dongarra
Journal of Parallel and Distributed Computing | VOL. 145
Ahmad Abdelfattah, et. al.Ahmad Abdelfattah ... Jack Dongarra
15 Jul 2020
Journal of Parallel and Distributed Computing | VOL. 145

Symbolic Matrix Multiplication for Multithreaded Sparse GEMM Utilizing Sparse Matrix Formats
Marcel Richter ... Gudula Runger
-
Marcel Richter, et. al.Marcel Richter ... Gudula Runger
01 Jul 2018
01 Jul 2018

Modeling the Energy Efficiency of GEMM using Optical Random Access Memory
Bingyi Zhang ... Ajey P Jacob
-
Bingyi Zhang, et. al.Bingyi Zhang ... Ajey P Jacob
19 Sep 2022
19 Sep 2022

KernelFaRer
João P L De Carvalho ... Ivan Korostelev
ACM Transactions on Architecture and Code Optimization | VOL. 18
João P L De Carvalho, et. al.João P L De Carvalho ... Ivan Korostelev
28 Jun 2021
ACM Transactions on Architecture and Code Optimization | VOL. 18

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Efficient Mixed-Precision Tall-and-Skinny Matrix-Matrix Multiplication for GPUs

Abstract

Talk to us

Similar Papers

More From: International Journal of Networking and Computing