Batched linear solvers, which solve many small related but independent problems, are increasingly important for highly parallel processors such as graphics processing units (GPUs). GPUs need a substantial amount of work to keep them operating efficiently and it is not an option to solve smaller problems one-by-one. Because of the small size of each problem, the task of implementing a parallel partitioning scheme and mapping the problem to hardware is not trivial. In recent history, significant attention has been given to batched dense linear algebra. However, there is also an interest in utilizing sparse iterative solvers in a batched form.An example use case is found in a gyrokinetic Particle-In-Cell (PIC) code used for modeling magnetically confined fusion plasma devices. The collision operator has been identified as a bottleneck, and a proxy app has been created for facilitating optimizations and porting to GPUs. The current collision kernel linear solver does not run on the GPU—a major bottleneck. As these matrices are sparse and well-conditioned, batched iterative sparse solvers are an attractive option.A batched sparse iterative solver capability has recently been developed in the Ginkgo library. In this paper, we describe how Ginkgo's batched solver technology can integrate into the XGC collision kernel and accelerate the simulation process. Comparisons for the solve times on NVIDIA V100 and A100 GPUs and AMD MI100 GPUs with one dual-socket Intel Xeon Skylake CPU node with 40 cores are presented for matrices from the collision kernel of XGC. Further, the speedups observed for the overall collision kernel are presented in comparison to different modern CPUs on multiple supercomputer systems. The results suggest that Ginkgo's batched sparse iterative solvers are well suited for efficient utilization of the GPU for this problem, and the performance portability of Ginkgo in conjunction with Kokkos (used within XGC as the heterogeneous programming model) allows seamless execution on exascale-oriented heterogeneous architectures.
Read full abstract