Performance Engineering for a Tall &amp; Skinny Matrix Multiplication Kernels on GPUs

Dominik Ernst,Jonas Thies,Gerhard Wellein,Georg Hager

doi:10.1007/978-3-030-43229-4_43

Abstract

General matrix-matrix multiplications with double-precision real and complex entries (DGEMM and ZGEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. NVIDIA's current CUBLAS implementation delivers only a fraction of the potential performance as indicated by the roofline model in this case. We describe the challenges and key characteristics of an implementation that can achieve close to optimal performance. We further evaluate different strategies of parallelization and thread distribution, and devise a flexible, configurable mapping scheme. To ensure flexibility and allow for highly tailored implementations we use code generation combined with autotuning. For a large range of matrix sizes in the domain of interest we achieve at least 2/3 of the roofline performance and often substantially outperform state-of-the art CUBLAS results on an NVIDIA Volta GPGPU.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Performance Engineering for a Tall & Skinny Matrix Multiplication Kernels on GPUs

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs
Dominik Ernst ... Gerhard Wellein
The International Journal of High Performance Computing Applications | VOL. 35
Dominik Ernst, et. al.Dominik Ernst ... Gerhard Wellein
09 Oct 2020
The International Journal of High Performance Computing Applications | VOL. 35

Biotic influences on species duration: interactions between traits in marine molluscs
James S Crampton ... Michael Foote
Paleobiology | VOL. 36
James S Crampton, et. al.James S Crampton ... Michael Foote
01 Jan 2009
Paleobiology | VOL. 36

Climatic Suitability, Life‐History Traits, Introduction Effort, and the Establishment and Spread of Introduced Mammals in Australia
David M Forsyth ... Mary Bomford
Conservation Biology | VOL. 18
David M Forsyth, et. al.David M Forsyth ... Mary Bomford
19 Mar 2004
Conservation Biology | VOL. 18

Range size and growth temperature influence Eucalyptus species responses to an experimental heatwave.
Michael J Aspinwall ... David T Tissue
Global Change Biology | VOL. 25
Michael J Aspinwall, et. al.Michael J Aspinwall ... David T Tissue
10 Mar 2019
Global Change Biology | VOL. 25

Publication Date: Jan 1, 2020
Citations: 3	License type: mit

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Performance Engineering for a Tall &amp; Skinny Matrix Multiplication Kernels on GPUs

Abstract

Talk to us

Similar Papers

Performance Engineering for a Tall & Skinny Matrix Multiplication Kernels on GPUs