Abstract

Solving triangular systems is the building block for preconditioned GMRES algorithm. Inexact preconditioning becomes attractive because of the feature of high parallelism on accelerators. In this paper, we propose and implement an iterative, inexact block triangular solve on multi-GPUs based on PETSc’s framework. In addition, by developing a distributed block sparse matrix-vector multiplication procedure and investigating the optimized vector operations, we form the multi-GPU-enabled preconditioned GMRES with the block Jacobi preconditioner. In the implementation, the GPU-Direct technique is employed to avoid host-device memory copies. The preconditioning step used by PETSc’s structure and the cuSPARSE library are also investigated for performance comparisons. The experiments show that the developed GMRES with inexact preconditioning on 8 GPUs can achieve up to 4.4x speedup over the CPU-only implementation with exact preconditioning using 8 MPI processes.

Highlights

  • Solving a large sparse linear system of equations is always necessary in scientific applications

  • Khodja et al [3] implemented the Generalized Minimal Residual (GMRES) algorithm on a GPU cluster by exploiting Message Passing Interface (MPI) and CUDA, and they focused on minimizing the communication between processes using the compressed storage and hypergraph partitioning techniques

  • Algorithm 5 lists the main steps of the function which we develop and integrate into PETSc. e first step, called the preprocessing phase, is to estimate the memory requirement, allocate adequate spaces and extract possible parallelism for the subsequent solve phase. e preprocessing step is required to be executed only once on the local GPUs because the lower (L(Bi)) and upper (U(Bi)) factors remain unchanged during the iterative process of GMRES once the factors has been constructed by the incomplete LU (ILU) factorization. e second step is to perform two block sparse triangular systems by calling cusparseDbsrsv2_solve twice with and (U(Bi))

Read more

Summary

Introduction

Solving a large sparse linear system of equations is always necessary in scientific applications. Gao et al [8] proposed an efficient GPU kernel on the sparse matrixvector multiplication (SpMV) in GMRES and applied the optimized GMRES to solving the two-dimensional Maxwell’s equations He et al [9] presented an efficient GPU implementation of the GMRES with ILU preconditioners for solving large linear dynamic systems. The popular numerical libraries such as Intel MKL [30], NVIDIA’s cuSPARSE [28], and PETSc [31] support BCSR format In this format, the block sparse matrix A(nr×,ns) with nnzb nonzero blocks is represented by block rows using three arrays rowptr, colval, and blkval. We assume the indexing starts from 0 in C programming language:

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call