Summary To realize the potential of the latest high-performance computing (HPC) architectures for reservoir simulation, scalable linear solvers are necessary. We describe a parallel algebraic multiscale solver (AMS) for the pressure equation of heterogeneous reservoir models. AMS is a two-level algorithm that uses domain decomposition with a localization assumption. In AMS, basis functions, which are local (subdomain) solutions computed during the setup phase, are used to construct the coarse-scale system and grid-transfer operators between the fine and coarse levels. The solution phase is composed of two stages: global and local. The global stage involves solving the coarse-scale system and interpolating the solution to the fine grid. The local stage involves application of a smoother on the fine-scale approximation. The design and implementation of a scalable AMS on multicore and many-core architectures, including the decomposition, memory allocation, data flow, and compute kernels, are described in detail. These adaptations are necessary to obtain good scalability on state-of-the-art HPC systems. The specific methods and parameters, such as the coarsening ratio (Cr), basis-function solver, and relaxation scheme, have significant effects on the asymptotic convergence rate and parallel computational efficiency. The balance between convergence rate and parallel efficiency as a function of Cr and the local stage parameters is analyzed in detail. The performance of AMS is demonstrated using heterogeneous 3D reservoir models, including geostatistically generated fields and models derived from SPE10 (Christie and Blunt 2001). The problems range in size from several million to 128 million cells. AMS shows excellent behavior for handling fixed-size problems as a function of the number of cores (so-called strong scaling). Specifically, for a 128-million-cell problem, a ninefold speedup is obtained on a single-node 12-core shared-memory architecture (dual-socket multicore Intel Xeon E5-2620-v2), and more than 12-fold on a single-node 20-core shared-memory architecture (dual-socket multicore Intel Xeon E5-2690-v2). These are encouraging results given the limited memory bandwidth that cores can share within a single node, which tends to be the major bottleneck for truly scalable solvers. We also compare the robustness and performance of our method with the parallel system algebraic mutligrid (SAMG) solver (Stüben 2012) from Fraunhofer SCAI.
Read full abstract