Fourier pseudo-spectral methods for nonlinear partial differential equations are of wide interest in many areas of advanced computational science, including direct numerical simulation of three-dimensional (3-D) turbulence governed by the Navier-Stokes equations in fluid dynamics. This paper presents a new capability for simulating turbulence at a new record resolution up to 35 trillion grid points, on the world's first exascale computer, Frontier, comprising AMD MI250x GPUs with HPE's Slingshot interconnect and operated by the US Department of Energy's Oak Ridge Leadership Computing Facility (OLCF). Key programming strategies designed to take maximum advantage of the machine architecture involve performing almost all computations on the GPU which has the same memory capacity as the CPU, performing all-to-all communication among sets of parallel processes directly on the GPU, and targeting GPUs efficiently using OpenMP offloading for intensive number-crunching including 1-D Fast Fourier Transforms (FFT) performed using AMD ROCm library calls. With 99% of computing power on Frontier being on the GPU, leaving the CPU idle leads to a net performance gain via avoiding the overhead of data movement between host and device except when needed for some I/O purposes. Memory footprint including the size of communication buffers for MPI_ALLTOALL is managed carefully to maximize the largest problem size possible for a given node count.Detailed performance data including separate contributions from different categories of operations to the elapsed wall time per step are reported for five grid resolutions, from 20483 on a single node to 327683 on 4096 or 8192 nodes out of 9408 on the system. Both 1D and 2D domain decompositions which divide a 3D periodic domain into slabs and pencils respectively are implemented. The present code suite (labeled by the acronym GESTS, GPUs for Extreme Scale Turbulence Simulations) achieves a figure of merit (in grid points per second) exceeding goals set in the Center for Accelerated Application Readiness (CAAR) program for Frontier. The performance attained is highly favorable in both weak scaling and strong scaling, with notable departures only for 20483 where communication is entirely intra-node, and for 327683, where a challenge due to small message sizes does arise. Communication performance is addressed further using a lightweight test code that performs all-to-all communication in a manner matching the full turbulence simulation code. Performance at large problem sizes is affected by both small message size due to high node counts as well as dragonfly network topology features on the machine, but is consistent with official expectations of sustained performance on Frontier. Overall, although not perfect, the scalability achieved at the extreme problem size of 327683 (and up to 8192 nodes — which corresponds to hardware rated at just under 1 exaflop/sec of theoretical peak computational performance) is arguably better than the scalability observed using prior state-of-the-art algorithms on Frontier's predecessor machine (Summit) at OLCF. New science results for the study of intermittency in turbulence enabled by this code and its extensions are to be reported separately in the near future.
Read full abstract