An efficient GPU implementation and scaling for higher-order 3D stencils

Omer Anjum,Mohammad Almasri,Simon Garcia De Gonzalo,Wen-Mei Hwu

doi:10.1016/j.ins.2021.11.042

Abstract

Stencil computation patterns are the backbone of many scientific and engineering simulations. The stencil computation is known to be constrained by its high demand of memory bandwidth, which limits performance on accelerators such as GPUs. Prior GPU-based approaches concentrated on stencils with only axis-aligned grid points with 2D caching schemes. However, for stencils with non-axis grid points, prior approaches use 3D caching or multi-pass 2D caching schemes. These methods suffer from either large number of global memory accesses or required large size of the shared memory. In this work, we present an efficient GPU implementation scheme “Scatter Without Write Conflict” (SWiC) for large advanced 3D stencil patterns involving non-axis-aligned grid points. Unlike other 3D caching schemes, SWiC only needs 2D caching and a single pass over the stencil per iteration. SWiC achieves, significant reductions in the global memory accesses, without increasing the size of shared memory. SWiC can also be applied to simple axis-aligned stencils without any performance loss. Moreover, we propose a scalable implementation for the halo region exchange of 3D stencils on multi-GPU nodes. For evaluation, we test SWiC on three Nvidia GPU generations and show that our approach significantly outperforms existing state-of-the-art GPU implementations with a speedup ranging from 1.6× to 5.75×. We also provide a detailed scaling analysis in multi-node and multi-GPU environments.

Full Text