GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

Kai He,Sheldon X -D Tan,Hai Wang,Guoyong Shi

doi:10.1109/tvlsi.2015.2421287

Abstract

Lower upper (LU) factorization for sparse matrices is the most important computing step for circuit simulation problems. However, parallelizing LU factorization on the graphic processing units (GPUs) turns out to be a difficult problem due to intrinsic data dependence and irregular memory access, which diminish GPU computing power. In this paper, we propose a new sparse LU solver on GPUs for circuit simulation and more general scientific computing. The new method, which is called GPU accelerated LU factorization (GLU) solver (for GPU LU), is based on a hybrid right-looking LU factorization algorithm for sparse matrices. We show that more concurrency can be exploited in the right-looking method than the left-looking method, which is more popular for circuit analysis, on GPU platforms. At the same time, the GLU also preserves the benefit of column-based left-looking LU method, such as symbolic analysis and column-level concurrency. We show that the resulting new parallel GPU LU solver allows the parallelization of all three loops in the LU factorization on GPUs. While in contrast, the existing GPU-based left-looking LU factorization approach can only allow parallelization of two loops. Experimental results show that the proposed GLU solver can deliver $5.71\times $ and $1.46\times $ speedup over the single-threaded and the 16-threaded PARDISO solvers, respectively, $19.56\times $ speedup over the KLU solver, $47.13\times $ over the UMFPACK solver, and $1.47\times $ speedup over a recently proposed GPU-based left-looking LU solver on the set of typical circuit matrices from the University of Florida (UFL) sparse matrix collection. Furthermore, we also compare the proposed GLU solver on a set of general matrices from the UFL, GLU achieves $6.38\times $ and $1.12\times $ speedup over the single-threaded and the 16-threaded PARDISO solvers, respectively, $39.39\times $ speedup over the KLU solver, $24.04\times $ over the UMFPACK solver, and $2.35\times $ speedup over the same GPU-based left-looking LU solver. In addition, comparison on self-generated $RLC$ mesh networks shows a similar trend, which further validates the advantage of the proposed method over the existing sparse LU solvers.

Full Text