Matrix-Matrix Multiplication Using Multiple GPUs Connected by Nvlink

Yea Rem Choi,Vsevolod Nikolskiy,Vladimir Stegailov

doi:10.1109/glosic50886.2020.9267865

Matrix-Matrix Multiplication Using Multiple GPUs Connected by Nvlink

Yea Rem Choi, Vsevolod Nikolskiy + Show 1 more

https://doi.org/10.1109/glosic50886.2020.9267865

Copy DOI

Publication Date: Nov 17, 2020

Citations: 16

Affiliation: National Research University Higher School of Economics

#Sizes Of Tiles #Matrix-Matrix Multiplication + Show 8 more

Abstract
Full-Text PDF
Similar Papers

Abstract

In this work we present an original GPU-only parallel matrix-matrix multiplication algorithm (C = αA * B + βC) for servers with multiple GPUs connected by NVLink. The algorithm is implemented using CUDA. The data transfer patterns, the communication and computation overlap, and the overall performance of the algorithm are considered. By regulating the commands call order and the sizes of tiles, we tune the uninterrupted asynchronous data transmission and kernel execution. Two cases are considered: when all the data are stored in one GPU and when the matrices are distributed among several GPUs. The execution efficiency of this new algorithm is compared with cuBLAS-XT from the Nvidia CUDA Toolkit library.

Full Text