GSoFa: Scalable Sparse Symbolic LU Factorization on GPUs

Anil Gaihre,Xiaoye Sherry Li,Hang Liu

doi:10.1109/tpds.2021.3090316

Abstract

Decomposing a matrix <inline-formula><tex-math notation="LaTeX">$\mathbf {A}$</tex-math></inline-formula> into a lower matrix <inline-formula><tex-math notation="LaTeX">$\mathbf {L}$</tex-math></inline-formula> and an upper matrix <inline-formula><tex-math notation="LaTeX">$\mathbf {U}$</tex-math></inline-formula> , which is also known as LU decomposition, is an essential operation in numerical linear algebra. For a sparse matrix, LU decomposition often introduces more nonzero entries in the <inline-formula><tex-math notation="LaTeX">$\mathbf {L}$</tex-math></inline-formula> and <inline-formula><tex-math notation="LaTeX">$\mathbf {U}$</tex-math></inline-formula> factors than in the original matrix. A symbolic factorization step is needed to identify the nonzero structures of <inline-formula><tex-math notation="LaTeX">$\mathbf {L}$</tex-math></inline-formula> and <inline-formula><tex-math notation="LaTeX">$\mathbf {U}$</tex-math></inline-formula> matrices. Attracted by the enormous potentials of the Graphics Processing Units (GPUs), an array of efforts have surged to deploy various LU factorization steps except for the symbolic factorization, to the best of our knowledge, on GPUs. This article introduces gSoFa , the first G PU-based s ymb o lic fa ctorization design with the following three optimizations to enable scalable LU symbolic factorization for nonsymmetric pattern sparse matrices on GPUs. First, we introduce a novel fine-grained parallel symbolic factorization algorithm that is well suited for the Single Instruction Multiple Thread (SIMT) architecture of GPUs. Second, we tailor supernode detection into a SIMT friendly process and strive to balance the workload, minimize the communication and saturate the GPU computing resources during supernode detection. Third, we introduce a three-pronged optimization to reduce the excessive space consumption problem faced by multi-source concurrent symbolic factorization. Taken together, gSoFa achieves up to 31× speedup from 1 to 44 Summit nodes (6 to 264 GPUs) and outperforms the state-of-the-art CPU project, on average, by 5×. Notably, gSoFa also achieves up to 47 percent of the peak memory throughput of a V100 GPU in the Summit Supercomputer.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Parallel and Distributed Systems	Publication Date: Apr 1, 2022
Citations: 4	License type: publisher-specific-oa

R Discovery Prime

R Discovery Prime

GSoFa: Scalable Sparse Symbolic LU Factorization on GPUs

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Parallel and Distributed Systems

Lead the way for us

Similar Papers

Accelerating genetic algorithms with GPU computing: A selective overview
John Runwei Cheng ... Mitsuo Gen
Computers & Industrial Engineering | VOL. 128
John Runwei Cheng, et. al.John Runwei Cheng ... Mitsuo Gen
29 Dec 2018
Computers & Industrial Engineering | VOL. 128

A Massively Parallel Reservoir Simulator on the GPU Architecture
Maitham Alhubail ... Thomas Byer
-
Maitham Alhubail, et. al.Maitham Alhubail ... Thomas Byer
19 Oct 2021
19 Oct 2021

A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures
Mitch Horton ... Jack Dongarra
-
Mitch Horton, et. al.Mitch Horton ... Jack Dongarra
01 Jul 2011
01 Jul 2011

Performance Analysis of Benchmarks for GPU-based Linear Programming Problem Solvers
Usman Ali Shah ... Suhail Yousaf
-
Usman Ali Shah, et. al.Usman Ali Shah ... Suhail Yousaf
01 Mar 2019
01 Mar 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

GSoFa: Scalable Sparse Symbolic LU Factorization on GPUs

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Parallel and Distributed Systems