Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs

M.A Clark,Alexei Strelchenko,Alejandro Vaquero,Mathias Wagner,Evan Weinberg

doi:10.1016/j.cpc.2018.06.019

M.A Clark, Alexei Strelchenko + Show 3 more

Open Access

https://doi.org/10.1016/j.cpc.2018.06.019

Copy DOI

Abstract

The cost of the iterative solution of a sparse matrix–vector system against multiple vectors is a common challenge within scientific computing. A tremendous number of algorithmic advances, such as eigenvector deflation and domain-specific multi-grid algorithms, have been ubiquitously beneficial in reducing this cost. However, they do not address the intrinsic memory-bandwidth constraints of the matrix–vector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a super-linear speed up. Block-Krylov solvers can naturally take advantage of such batched matrix–vector operations, further reducing the iterations to solution by sharing the Krylov space between solves. Practical implementations typically suffer from the quadratic scaling in the number of vector–vector operations. We present an implementation of the block Conjugate Gradient algorithm on NVIDIA GPUs which reduces the memory-bandwidth complexity of vector–vector operations from quadratic to linear. As a representative case, we consider the domain of lattice quantum chromodynamics and present results for one of the fermion discretizations. Using the QUDA library as a framework, we demonstrate a 5× speedup compared to highly-optimized independent Krylov solves on NVIDIA’s SaturnV cluster.

Full Text