Abstract

This paper describes optimization for high-dimensional stencil computations on accelerators involving complex memory access patterns, which appear in five dimensional fusion plasma turbulence codes, GYSELA and GT5D. They include different types of memory access patterns, the indirect memory access in GYSELA with a Semi-Lagrangian scheme and the strided memory access in GT5D with a Finite-Difference scheme. We focus on the affinity of the memory access patterns to accelerators such as GPGPUs and Xeon Phi coprocessors. On both devices, the Array of Structure of Array (AoSoA) data layout is preferable for contiguous memory accesses. It is shown that the effective local cache usage by improving spatial and temporal data locality is critical on Xeon Phi. On GPGPU, the texture memory usage improves the performance of the indirect memory accesses in the Semi-Lagrangian scheme. The reuse of registers by taking account of the physical symmetry of the Finite-Difference scheme reduces the amount of memory accesses. Through these optimizations, we achieve acceleration of 3.9 (8.1) on Xeon Phi (GPGPU) for the Semi-Lagrangian scheme and of 1.4 (3.9) on Xeon Phi (GPGPU) for the Finite-Different scheme with respect to the fully optimized codes on Sandy Bridge.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call