A Massively Parallel Semicoarsening Multigrid Linear Solver on Multi-Core and Multi-GPU Architectures

A M Manea,H A Tchelepi

doi:10.2118/182718-ms

Abstract

Abstract In this work, we have designed and implemented a massively parallel version of the Semicoarsening Multigrid Solver (Schaffer 1998), which is capable of handling highly heterogeneous and anisotropic 3D reservoirs, on a parallel architecture with multiple GPU's. For comparison purposes, the same algorithm was also implemented on a shared-memory multi-core architecture. The implementation exploits the parallelism in every module of the original Multigrid algorithm, including both the setup stage and the solution stage, without modifying the original algorithm basic steps. The benefits of this approach are twofold: maintaining the inherent strong linear convergence of the serial Multigrid algorithm, and making advantage of the shared-memory architecture to minimize the need for communication. The design of the algorithm uses a combination of plane relaxation and semicoarsening to efficiently handle anisotropies in 3D, (Dendy et al. 1989). Since the z-direction in most reservoir models is a direction of strong-coupling compared to the x- and y- directions, semicoarsening is employed in the z-direction, and plane relaxation is used for relaxation on x-y planes. Besides the need to solve 2D-systems for plane-relaxation, a set of 2D systems must be also solved on each multigrid level during the setup stage to get an approximate representation of the exact prolongation operator described in Schaffer (1998). For handling both types of 2D systems, a massively parallel version of the 2D Black Box Multigrid (Alcouffe et al. 1981) was designed to handle those 2D solves. To be able to handle problems involving high anisotropies in the x- and y- directions, the 2D Black-Box Multigrid uses alternating line-relaxation with zebra ordering to parallelize across multiple line solves. Due to the inherent granularity difference between the GPU threads and the multi-core threads, line-relaxation was designed to use Thomas Algorithm on the multi-core architecture and Parallel Cyclic Reduction (NVIDIA Corporation 2014b) on the GPU architecture. In both the 3D Semicoarsening Multigrid and the 2D Black-Box Multigrid, V-cycling was used to avoid spending more time at coarser levels and thus affecting the parallel efficiency. To minimize the expensive communication between the host and the GPU (and amongst GPU's), every 2D-solve is explicitly handled by a single GPU. The two versions of the solver were tested using various highly heterogeneous multi-million-cell problems derived from SPE10 Second Dataset Benchmark. For problems with sizes large enough, the GPU implementation, running on KEPLER-Based K40c cards, is found to be always faster than the multi-core implementation running on 12 Intel® Xeon® E5-2620 v2 2.10 GHz cores. In addition, the inherent serial nature of multiplicative multigrid, along with the approach taken to minimize the communication through PCI-e, were found to limit the scalability beyond 3-4 cores/GPU's.

Full Text