Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs

Abid Rafique,Nachiket Kapre,George A Constantinides

doi:10.1109/tpds.2014.6

Abstract

Trading communication with redundant computation can increase the silicon efficiency of FPGAs and GPUs in accelerating communication-bound sparse iterative solvers. While $k$ iterations of the iterative solver can be unrolled to provide $O(k)$ reduction in communication cost, the extent of this unrolling depends on the underlying architecture, its memory model, and the growth in redundant computation. This paper presents a systematic procedure to select this algorithmic parameter $k$ , which provides communication-computation tradeoff on hardware accelerators like FPGA and GPU. We provide predictive models to understand this tradeoff and show how careful selection of $k$ can lead to performance improvement that otherwise demands significant increase in memory bandwidth. On an Nvidia C2050 GPU, we demonstrate a 1.9 $\times$ -42.6 $\times$ speedup over standard iterative solvers for a range of benchmarks and that this speedup is limited by the growth in redundant computation. In contrast, for FPGAs, we present an architecture-aware algorithm that limits off-chip communication but allows communication between the processing cores. This reduces redundant computation and allows large $k$ and hence higher speedups. Our approach for FPGA provides a 0.3 $\times$ -4.4 $\times$ speedup over same-generation GPU devices where $k$ is picked carefully for both architectures for a range of benchmarks.

Full Text