Performance engineering for real and complex tall &amp; skinny matrix multiplication kernels on GPUs

Dominik Ernst,Gerhard Wellein,Jonas Thies,Georg Hager

doi:10.1177/1094342020965661

Abstract

General matrix-matrix multiplications with double-precision real and complex entries (DGEMM and ZGEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. NVIDIA’s current CUBLAS implementation delivers only a fraction of the potential performance as indicated by the roofline model in this case. We describe the challenges and key characteristics of an implementation that can achieve close to optimal performance. We further evaluate different strategies of parallelization and thread distribution and devise a flexible, configurable mapping scheme. To ensure flexibility and allow for highly tailored implementations we use code generation combined with autotuning. For a large range of matrix sizes in the domain of interest we achieve at least 2/3 of the roofline performance and often substantially outperform state-of-the art CUBLAS results on an NVIDIA Volta GPGPU.

Highlights

The general matrix-matrix multiplication (GEMM) is an essential linear algebra operation used in many numerical algorithms and hardware vendors usually supply an implementation that is well optimized for their hardware
This paper presents the necessary implementation techniques to achieve near-perfect performance for two tall & skinny matrix-matrix multiplication variants on an NVIDIA V100 GPGPU with real- and complex-valued matrices
In comparison to that paper we have added a different variant of matrix-matrix multiplication (TSMM), added a more indepth performance analysis, extended the analysis to double precision complex data types, and examined a new TSMTTSM thread mapping scheme

Summary

Introduction

The general matrix-matrix multiplication (GEMM) is an essential linear algebra operation used in many numerical algorithms and hardware vendors usually supply an implementation that is well optimized for their hardware. In case of NVIDIA, this is part of CUBLAS (NVIDIA, 2019a). Since these implementations are focused on mostly square matrices, they often perform poorly for matrices with unusual shapes. This paper covers two types of matrix multiplications with tall & skinny matrices, i.e. matrices that are much taller than they are wide. We define skinny as having in the range of 1⁄21; 64 columns, and tall as having more than 106 rows. Both types of multiplications involve the two tall & skinny matrices A and B, with sizes K Â M and K Â N , respectively, and K being the long dimension. We are interested in a highly efficient implementation of these operations using double precision real and complex data types on the NVIDIA Volta GPGPU, used nowadays in many HPC systems

Application

Roofline model

Contribution

Related work

Hardware

Thread mapping options

TSMTTSM

Leap frogging

Global reduction

Thread mapping

Data from C

Transposition and leap frogging

Tile sizes

Analysis

Comparison with libraries

Impact of reductions

Unrolling

Source of C

Thread count

Conclusion and outlook

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: The International Journal of High Performance Computing Applications	Publication Date: Oct 9, 2020
Citations: 6	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: The International Journal of High Performance Computing Applications

Lead the way for us

Similar Papers

Performance Engineering for a Tall & Skinny Matrix Multiplication Kernels on GPUs
Dominik Ernst ... Jonas Thies
-
Dominik Ernst, et. al.Dominik Ernst ... Jonas Thies
01 Jan 2020
01 Jan 2020

Biotic influences on species duration: interactions between traits in marine molluscs
James S Crampton ... Bruce A Marshall
Paleobiology | VOL. 36
James S Crampton, et. al.James S Crampton ... Bruce A Marshall
01 Jan 2009
Paleobiology | VOL. 36

Climatic Suitability, Life‐History Traits, Introduction Effort, and the Establishment and Spread of Introduced Mammals in Australia
David M Forsyth ... Geoff Moore
Conservation Biology | VOL. 18
David M Forsyth, et. al.David M Forsyth ... Geoff Moore
19 Mar 2004
Conservation Biology | VOL. 18

Range size and growth temperature influence Eucalyptus species responses to an experimental heatwave.
Michael J Aspinwall ... Paul D Rymer
Global change biology | VOL. 25
Michael J Aspinwall, et. al.Michael J Aspinwall ... Paul D Rymer
10 Mar 2019
Global change biology | VOL. 25

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Performance engineering for real and complex tall &amp; skinny matrix multiplication kernels on GPUs

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: The International Journal of High Performance Computing Applications

Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs