A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations

Azzam Haidar,Stanimire Tomov,Mawussi Zounon,Jack Dongarra,Ahmad Abdelfattah

doi:10.1109/tpds.2017.2783929

Abstract

We present a high-performance GPU kernel with a substantial speedup over vendor libraries for very small matrix computations. In addition, we discuss most of the challenges that hinder the design of efficient GPU kernels for small matrix algorithms. We propose relevant algorithm analysis to harness the full power of a GPU, and strategies for predicting the performance, before introducing a proper implementation. We develop a theoretical analysis and a methodology for high-performance linear solvers for very small matrices. As test cases, we take the Cholesky and LU factorizations and show how the proposed methodology enables us to achieve a performance close to the theoretical upper bound of the hardware. This work investigates and proposes novel algorithms for designing highly optimized GPU kernels for solving batches of hundreds of thousands of small-size Cholesky and LU factorizations. Our focus on efficient batched Cholesky and batched LU kernels is motivated by the increasing need for these kernels in scientific simulations (e.g., astrophysics applications). Techniques for optimal memory traffic, register blocking, and tunable concurrency are incorporated in our proposed design. The proposed GPU kernels achieve performance speedups versus CUBLAS of up to $6\times$ for the factorizations, using double precision arithmetic on an NVIDIA Pascal P100 GPU.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Parallel and Distributed Systems	Publication Date: Jan 3, 2018
Citations: 47	License type: publisher-specific-oa

R Discovery Prime

R Discovery Prime

A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Parallel and Distributed Systems

Lead the way for us

Similar Papers

Factorization and Inversion of a Million Matrices using GPUs: Challenges and Countermeasures
Ahmad Abdelfattah ... Jack Dongarra
Procedia Computer Science | VOL. 108
Ahmad Abdelfattah, et. al.Ahmad Abdelfattah ... Jack Dongarra
01 Jan 2017
Procedia Computer Science | VOL. 108

Roundoff-Error-Free Algorithms for Solving Linear Systems via Cholesky and LU Factorizations
Adolfo R Escobedo ... Erick Moreno-Centeno
INFORMS Journal on Computing | VOL. 27
Adolfo R Escobedo, et. al.Adolfo R Escobedo ... Erick Moreno-Centeno
01 Nov 2015
INFORMS Journal on Computing | VOL. 27

Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors
Muthu Manikandan Baskaran ... Atanas Rountev
ACM SIGPLAN Notices | VOL. 44
Muthu Manikandan Baskaran, et. al.Muthu Manikandan Baskaran ... Atanas Rountev
14 Feb 2009
ACM SIGPLAN Notices | VOL. 44

Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors
Muthu Manikandan Baskaran ... P Sadayappan
-
Muthu Manikandan Baskaran, et. al.Muthu Manikandan Baskaran ... P Sadayappan
14 Feb 2009
14 Feb 2009

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Parallel and Distributed Systems