Abstract
Decomposing a matrix <inline-formula><tex-math notation="LaTeX">$\mathbf {A}$</tex-math></inline-formula> into a lower matrix <inline-formula><tex-math notation="LaTeX">$\mathbf {L}$</tex-math></inline-formula> and an upper matrix <inline-formula><tex-math notation="LaTeX">$\mathbf {U}$</tex-math></inline-formula> , which is also known as LU decomposition, is an essential operation in numerical linear algebra. For a sparse matrix, LU decomposition often introduces more nonzero entries in the <inline-formula><tex-math notation="LaTeX">$\mathbf {L}$</tex-math></inline-formula> and <inline-formula><tex-math notation="LaTeX">$\mathbf {U}$</tex-math></inline-formula> factors than in the original matrix. A <i>symbolic factorization</i> step is needed to identify the nonzero structures of <inline-formula><tex-math notation="LaTeX">$\mathbf {L}$</tex-math></inline-formula> and <inline-formula><tex-math notation="LaTeX">$\mathbf {U}$</tex-math></inline-formula> matrices. Attracted by the enormous potentials of the Graphics Processing Units (GPUs), an array of efforts have surged to deploy various LU factorization steps except for the symbolic factorization, to the best of our knowledge, on GPUs. This article introduces <small>gSoFa</small> , the first <u>G</u> PU-based <u>s</u> ymb <u>o</u> lic <u>fa</u> ctorization design with the following three optimizations to enable scalable LU symbolic factorization for <i>nonsymmetric pattern</i> sparse matrices on GPUs. First, we introduce a novel fine-grained parallel symbolic factorization algorithm that is well suited for the <i>Single Instruction Multiple Thread</i> (SIMT) architecture of GPUs. Second, we tailor supernode detection into a SIMT friendly process and strive to balance the workload, minimize the communication and saturate the GPU computing resources during supernode detection. Third, we introduce a three-pronged optimization to reduce the excessive space consumption problem faced by multi-source concurrent symbolic factorization. Taken together, <small>gSoFa</small> achieves up to 31× speedup from 1 to 44 Summit nodes (6 to 264 GPUs) and outperforms the state-of-the-art CPU project, on average, by 5×. Notably, <small>gSoFa</small> also achieves up to 47 percent of the peak memory throughput of a V100 GPU in the Summit Supercomputer.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE Transactions on Parallel and Distributed Systems
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.