A new hybrid GPU-CPU sparse LDLT factorization algorithm with GPU and CPU factorizing concurrently

Yunmou Liu,Pu Chen,Zhuogen Li,Hui Du

doi:10.1016/j.jocs.2024.102312

Abstract

This paper proposes a new task assignment scheme for sparse LDLT factorization on a hybrid GPU-CPU platform, and correspondingly an efficient supernodal algorithm, GPU Node and CPU Pipelining (GNCP). GNCP assigns a large number of weakly coupled tasks to GPU and CPU respectively so that GPU and CPU can execute different factorization tasks concurrently without explicit synchronization, which has not been achieved by previous researches. More precisely, based on the number of CPU threads, the global memory of GPU as well the size of the matrix, we introduce the concept of the truncation level in the partition tree built by the multi-level graph partitioning. The sum of FLOP associated with decoupled nodes in the truncation level accounts for about 65% of the whole computations in tested numerical examples. GNCP assigns the factorization tasks of each node in the truncation level to GPU. The data of each node in the truncation level are copied to the global memory of GPU and factorized entirely and independently by GPU one-by-one. Once factorization of any node in the truncation level has been completed, the cross-factorizations of this node to its ancestors and the following related operations are decoupled from other GPU tasks. These tasks are assigned to CPU threads with a reentrant pipelining strategy, accounting for about 35% of the whole calculations. Using this novel design, CPU and GPU are working concurrently and a large part of factorization is performed in overlapped time. Numerical tests for matrices from SuiteSparse matrix collection show higher performance than the well-known hybrid GPU-CPU solver CHOLMOD.

Full Text