Summary With the advent of the multicore central-processing unit (CPU), today's commodity PC clusters are effectively a collection of interconnected parallel computers, each with multiple multicore CPUs and large shared random access memory (RAM), connected together by means of high-speed networks. Each computer, referred to as a compute node, is a powerful parallel computer on its own. Each compute node can be equipped further with acceleration devices such as the general-purpose graphical processing unit (GPGPU) to further speed up computational-intensive portions of the simulator. Reservoir-simulation methods that can exploit this heterogeneous hardware system can be used to solve very-large-scale reservoir-simulation models and run significantly faster than conventional simulators. Because typical PC clusters are essentially distributed share-memory computers, this suggests that the use of the mixed-paradigm parallelism (distributed-shared memory), such as message-passing interface and open multiprocessing (MPI-OMP), should work well for computational efficiency and memory use. In this work, we compare and contrast the single-paradigm programming models, MPI or OMP, with the mixed paradigm, MPI-OMP, programming model for a class of solver method that is suited for the different modes of parallelism. The results showed that the distributed memory (MPI-only) model has superior multicompute-node scalability, whereas the shared memory (OMP-only) model has superior parallel performance on a single compute node. The mixed MPI-OMP model and OMP-only model are more memory-efficient for the multicore architecture than the MPI-only model because they require less or no halo-cell storage for the subdomains. To exploit the fine-grain shared memory parallelism available on the GPGPU architecture, algorithms should be suited to the single-instruction multiple-data (SIMD) parallelism, and any recursive operations are serialized. In addition, solver methods and data store need to be reworked to coalesce memory access and to avoid shared memory-bank conflicts. Wherever possible, the cost of data transfer through the peripheral component interconnect express (PCIe) bus between the CPU and GPGPU needs to be hidden by means of asynchronous communication. We applied multiparadigm parallelism to accelerate compositional reservoir simulation on a GPGPU-equipped PC cluster. On a dual-CPU-dual-GPGPU compute node, the parallelized solver running on the dual-GPGPU Fermi M2090Q achieved up to 19 times speedup over the serial CPU (1-core) results and up to 3.7 times speedup over the parallel dual-CPU X5675 results in a mixed MPI + OMP paradigm for a 1.728-million-cell compositional model. Parallel performance shows a strong dependency on the subdomain sizes. Parallel CPU solve has a higher performance for smaller domain partitions, whereas GPGPU solve requires large partitions for each chip for good parallel performance. This is related to improved cache efficiency on the CPU for small subdomains and the loading requirement for massive parallelism on the GPGPU. Therefore, for a given model, the multinode parallel performance decreases for the GPGPU relative to the CPU as the model is further subdivided into smaller subdomains to be solved on more compute nodes. To illustrate this, a modified SPE5 (Killough and Kossack 1987) model with various grid dimensions was run to generate comparative results. Parallel performances for three field compositional models of various sizes and dimensions are included to further elucidate and contrast CPU-GPGPU single-node and multiple-node performances. A PC cluster with the Tesla M2070Q GPGPU and the 6-core Xeon X5675 Westmere was used to produce the majority of the reported results. Another PC cluster with the Tesla M2090Q GPGPU was available for some cases, and the results are reported for the modified SPE5 (Killough and Kossack 1987) problems for comparison.
Read full abstract