Abstract

General matrix-matrix multiplications with double-precision real and complex entries (DGEMM and ZGEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. NVIDIA’s current CUBLAS implementation delivers only a fraction of the potential performance as indicated by the roofline model in this case. We describe the challenges and key characteristics of an implementation that can achieve close to optimal performance. We further evaluate different strategies of parallelization and thread distribution and devise a flexible, configurable mapping scheme. To ensure flexibility and allow for highly tailored implementations we use code generation combined with autotuning. For a large range of matrix sizes in the domain of interest we achieve at least 2/3 of the roofline performance and often substantially outperform state-of-the art CUBLAS results on an NVIDIA Volta GPGPU.

Highlights

  • The general matrix-matrix multiplication (GEMM) is an essential linear algebra operation used in many numerical algorithms and hardware vendors usually supply an implementation that is well optimized for their hardware

  • This paper presents the necessary implementation techniques to achieve near-perfect performance for two tall & skinny matrix-matrix multiplication variants on an NVIDIA V100 GPGPU with real- and complex-valued matrices

  • In comparison to that paper we have added a different variant of matrix-matrix multiplication (TSMM), added a more indepth performance analysis, extended the analysis to double precision complex data types, and examined a new TSMTTSM thread mapping scheme

Read more

Summary

Introduction

The general matrix-matrix multiplication (GEMM) is an essential linear algebra operation used in many numerical algorithms and hardware vendors usually supply an implementation that is well optimized for their hardware. In case of NVIDIA, this is part of CUBLAS (NVIDIA, 2019a). Since these implementations are focused on mostly square matrices, they often perform poorly for matrices with unusual shapes. This paper covers two types of matrix multiplications with tall & skinny matrices, i.e. matrices that are much taller than they are wide. We define skinny as having in the range of 1⁄21; 64Š columns, and tall as having more than 106 rows. Both types of multiplications involve the two tall & skinny matrices A and B, with sizes K  M and K  N , respectively, and K being the long dimension. We are interested in a highly efficient implementation of these operations using double precision real and complex data types on the NVIDIA Volta GPGPU, used nowadays in many HPC systems

Application
Roofline model
Contribution
Related work
Hardware
Thread mapping options
TSMTTSM
Leap frogging
Global reduction
Thread mapping
Data from C
Transposition and leap frogging
Tile sizes
Analysis
Comparison with libraries
Impact of reductions
Unrolling
Source of C
Thread count
Conclusion and outlook
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call