Abstract

The tridiagonal solver is an important kernel and is widely supported in mainstream numerical libraries. While parallel algorithms have been studied for many-core architectures, the performance of current algorithms and implementations is still hindered by input size sensitivity and cross-platform portability. In this paper, we propose a novel algorithm WM-pGE for the batched solution of diagonally dominant tridiagonal systems. The algorithm balances the key design objectives, including computation complexity, memory complexity, parallelism, and input size sensitivity, better than existing algorithms. Moreover, an elegant formulation is presented to show the implementation and cross-platform optimization without loss of efficiency and generality, by extracting the platform-dependent works into only four vector operators. The results from our batched tridiagonal experiments show that the proposed algorithm outperforms the prior work PCR-pThomas by 25% and 12% on NVIDIA Tesla V100 in single and double precision, respectively. On Intel KNL, our method achieves a 10% improvement in performance over PCR-pThomas in double precision.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call