Comparing the performance of general matrix multiplication routine on heterogeneous computing systems

Aleksei Sorokin,Sergey Malkovsky,Georgiy Tsoy

doi:10.1016/j.jpdc.2021.10.002

Abstract

This paper contains the results of research on the general matrix multiplication routine performance on modern heterogeneous computing systems. In addition to the single-threaded and multi-threaded performance of the routine for matrices of double-precision real and complex numbers on the IBM POWER and Intel Xeon CPUs, the possibility of automatic offload calculation to NVIDIA GPUs, which is supported by certain BLAS library implementations, was studied. Special attention was paid to the impact on the performance of the bandwidth of the interconnects, which ensure CPU-to-GPU interaction. The obtained results show that IBM computing systems with a high-speed NVLink interconnect demonstrate the best performance doing matrix multiplication on GPUs. Accordingly, these systems can be used to accelerate the solution of tasks that utilize this routines without the need to significantly alter the existing software. It should be noted that CPUs of Intel computing system and the Intel MKL library show the best efficiency performing operations with small matrices. Research results can be used to develop approaches to improving the performance of software, which utilize the general matrix multiplication routine.

Full Text