Conjugate Gradient Solvers with High Accuracy and Bit-wise Reproducibility between CPU and GPU using Ozaki scheme

Daichi Mukunoki,Katsuhisa Ozaki,Roman Iakymchuk,Takeshi Ogita

doi:10.1145/3432261.3432270

Abstract

On Krylov subspace methods such as the Conjugate Gradient (CG) method, the number of iterations until convergence may increase due to the loss of computational accuracy caused by rounding errors in floating-point computations. At the same time, because the order of the computation is nondeterministic on parallel computation, the result and the behavior of the convergence may be nonidentical in different computational environments, even for the same input. In this study, we present an accurate and reproducible implementation of the unpreconditioned CG method on x86 CPUs and NVIDIA GPUs. In our method, while all variables are stored on FP64, all inner product operations (including matrix-vector multiplications) are performed using the Ozaki scheme. The scheme delivers the correctly rounded computation as well as bit-level reproducibility among different computational environments. In this paper, we show some examples where the standard FP64 implementation of CG results in nonidentical results across different CPUs and GPUs. We then demonstrate the applicability and the effectiveness of our approach in terms of accuracy and reproducibility and their performance on both CPUs and GPUs. Furthermore, we compare the performance of our method against an existing accurate and reproducible CG implementation based on the Exact Basic Linear Algebra Subprograms (ExBLAS) on CPUs.

Full Text