Abstract

The tridiagonal solver is an important kernel in many scientific and engineering applications. Although quite a few parallel algorithms have been exploited recently, challenges still remain when solving tridiagonal systems on many-core architectures. In this paper, quantitative analysis is conducted to guide the selection of algorithms on different architectures, and a fast register-PCR-pThomas algorithm for Intel MIC architecture is presented to achieve the best utilization of in-core vectorization and registers. By choosing the most suitable number of PCR reductions, we further propose an improved algorithm, named register-PCR-half-pThomas algorithm, which minimizes the computational cost and the number of registers for use. The optimizations of manual prefetching and strength reduction are also discussed for the new algorithm. Evaluation on Intel Xeon Phi 7110P shows that our register-PCR-pThomas can outperform the MKL solver by 7.7 times in single precision and 3.4 times in double precision. Moreover, our optimized register-PCR-half-pThomas can further boost the performance by another 43.9% in single precision and 20.4% in double precision. Our best algorithm can outperform the latest official GPU library (cuSPARSE) on Nvidia K40 by 3.7 times and 2.4 times in single and double precision, respectively.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.