Abstract

Engineering, scientific, and financial applications often require the simultaneous solution of a large number of independent tridiagonal systems of equations with varying coefficients. Since the number of systems is large enough to offer considerable parallelism on manycore systems, the choice between different tridiagonal solution algorithms, such as Thomas, Cyclic Reduction (CR) or Parallel Cyclic Reduction (PCR) needs to be reexamined. This work investigates the optimal choice of tridiagonal algorithm for CPU, Intel MIC, and NVIDIA GPU with a focus on minimizing the amount of data transfer to and from the main memory using novel algorithms and the register-blocking mechanism, and maximizing the achieved bandwidth. It also considers block tridiagonal solutions, which are sometimes required in Computational Fluid Dynamic (CFD) applications. A novel work-sharing and register blocking--based Thomas solver is also presented.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call