In this paper, we present PfSolve — a new, performant, cross-platform, and open-source implementation of tridiagonal and bidiagonal matrix solvers for the GPU architecture. Released as a stand-alone library, PfSolve can solve systems of arbitrary size that fit into the memory of a single GPU with a potential extension to multi-GPU support in the future. The code works in single, double, and double-double emulation of quad precision using only \(0.1\% \) of the original system size as additional memory. PfSolve is based on the in-house implementation of the Parallel Thomas algorithm optimized for GPU execution by using warp-level instructions and occupancy optimizations, which are discussed in detail in the paper. This work also presents an accuracy analysis of the Parallel Thomas algorithm for tridiagonal matrices with various dominance factors (approximately, the ratio of the off-diagonal to diagonal terms) and demonstrates that PfSolve achieves a considerable speedup over vendor solutions on modern HPC GPUs like Nvidia H100 and AMD MI210. The source code for PfSolve is available on GitHub.
Read full abstract