Weak Scale Research Articles

Exascale High Performance Computing (HPC) represents a tremendous opportunity to push the boundaries of Computational Fluid Dynamics (CFD), but despite the consolidated trend towards the use of Graphics Processing Units (GPUs), programmability is still an issue. STREAmS-2 (Bernardini et al. Comput. Phys. Commun. 285 (2023) 108644) is a compressible solver for canonical wall-bounded turbulent flows capable of harvesting the potential of NVIDIA GPUs. Here we extend the already available CUDA Fortran backend with a novel HIP backend targeting AMD GPU architectures. The main implementation strategies are discussed along with a novel Python tool that can generate the HIP and CPU code versions allowing developers to focus their attention only on the CUDA Fortran backend. Single GPU performance is analysed focusing on NVIDIA A100 and AMD MI250x cards which are currently at the core of several HPC clusters. The gap between peak GPU performance and STREAmS-2 performance is found to be generally smaller for NVIDIA cards. Roofline analysis allows tracing this behavior to unexpectedly different computational intensities of the same kernel using the two cards. Additional single-GPU comparisons are performed to assess the impact of grid size, number of parallelized loops, thread masking and thread divergence. Parallel performance is measured on the two largest EuroHPC pre-exascale systems, LUMI (AMD GPUs) and Leonardo (NVIDIA GPUs). Strong scalability reveals more than 80% efficiency up to 16 nodes for Leonardo and up to 32 for LUMI. Weak scalability shows an impressive efficiency of over 95% up to the maximum number of nodes tested (256 for LUMI and 512 for Leonardo). This analysis shows that STREAmS-2 is the perfect candidate to fully exploit the power of current pre-exascale HPC systems in Europe, allowing users to simulate flows with over a trillion mesh points, thus reducing the gap between the Reynolds numbers achievable in high-fidelity simulations and those of real engineering applications.

Read full abstract

Fourier pseudo-spectral methods for nonlinear partial differential equations are of wide interest in many areas of advanced computational science, including direct numerical simulation of three-dimensional (3-D) turbulence governed by the Navier-Stokes equations in fluid dynamics. This paper presents a new capability for simulating turbulence at a new record resolution up to 35 trillion grid points, on the world's first exascale computer, Frontier, comprising AMD MI250x GPUs with HPE's Slingshot interconnect and operated by the US Department of Energy's Oak Ridge Leadership Computing Facility (OLCF). Key programming strategies designed to take maximum advantage of the machine architecture involve performing almost all computations on the GPU which has the same memory capacity as the CPU, performing all-to-all communication among sets of parallel processes directly on the GPU, and targeting GPUs efficiently using OpenMP offloading for intensive number-crunching including 1-D Fast Fourier Transforms (FFT) performed using AMD ROCm library calls. With 99% of computing power on Frontier being on the GPU, leaving the CPU idle leads to a net performance gain via avoiding the overhead of data movement between host and device except when needed for some I/O purposes. Memory footprint including the size of communication buffers for MPI_ALLTOALL is managed carefully to maximize the largest problem size possible for a given node count.Detailed performance data including separate contributions from different categories of operations to the elapsed wall time per step are reported for five grid resolutions, from 20483 on a single node to 327683 on 4096 or 8192 nodes out of 9408 on the system. Both 1D and 2D domain decompositions which divide a 3D periodic domain into slabs and pencils respectively are implemented. The present code suite (labeled by the acronym GESTS, GPUs for Extreme Scale Turbulence Simulations) achieves a figure of merit (in grid points per second) exceeding goals set in the Center for Accelerated Application Readiness (CAAR) program for Frontier. The performance attained is highly favorable in both weak scaling and strong scaling, with notable departures only for 20483 where communication is entirely intra-node, and for 327683, where a challenge due to small message sizes does arise. Communication performance is addressed further using a lightweight test code that performs all-to-all communication in a manner matching the full turbulence simulation code. Performance at large problem sizes is affected by both small message size due to high node counts as well as dragonfly network topology features on the machine, but is consistent with official expectations of sustained performance on Frontier. Overall, although not perfect, the scalability achieved at the extreme problem size of 327683 (and up to 8192 nodes — which corresponds to hardware rated at just under 1 exaflop/sec of theoretical peak computational performance) is arguably better than the scalability observed using prior state-of-the-art algorithms on Frontier's predecessor machine (Summit) at OLCF. New science results for the study of intermittency in turbulence enabled by this code and its extensions are to be reported separately in the near future.

Read full abstract

Weak Scale Research Articles

Related Topics

Articles published on Weak Scale

From GPU to CPU (and Beyond): Extending Hardware Support in GPUSPH Through a SYCL‐Inspired Interface

Decoding the Gaugino Code Naturally at High-Lumi LHCs

Bringing the Peccei-Quinn mechanism down to Earth

High-speed turbulent flows towards the exascale: STREAmS-2 porting and performance

Fast inspirals and the treatment of orbital resonances

Sustainable rewritable paper based on photoresponsive tungsten oxide quantum dots for anti-counterfeiting and waterproofing

Resolving the ultracollinear paradox with effective field theory

GPU-enabled extreme-scale turbulence simulations: Fourier pseudo-spectral algorithms at the exascale using OpenMP offloading

A multiverse outside of the swampland

Local Second Order Mo̷ller-Plesset Theory with a Single Threshold Using Orthogonal Virtual Orbitals: A Distributed Memory Implementation.

Superior ablation resistance of plasma sprayed SiC based coating by structural optimized powders

Massively parallel axisymmetric fluid model for streamer discharges

Toward an efficient second‐order method for computing the surface gravitational potential on spherical‐polar meshes

Thermoviscous dissipation of nonlinear acoustic waves in channels with wavy walls.

An approximate block factorization preconditioner for mixed-dimensional beam-solid interaction

Composite dark matter and neutrino masses from a light hidden sector

Electroweak evolution equations and isospin conservation

Weak time-scale separation at the onset of oscillatory magnetoconvection in rapidly rotating fluids

Stau pairs from natural SUSY at high luminosity LHC

Asynchronous global–local non-invasive coupling for nonlinear monotone patches: Application to plasticity problems

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Weak Scale Research Articles

Related Topics

Articles published on Weak Scale

From GPU to CPU (and Beyond): Extending Hardware Support in GPUSPH Through a SYCL‐Inspired Interface

Decoding the Gaugino Code Naturally at High-Lumi LHCs

Bringing the Peccei-Quinn mechanism down to Earth

High-speed turbulent flows towards the exascale: STREAmS-2 porting and performance

Fast inspirals and the treatment of orbital resonances

Sustainable rewritable paper based on photoresponsive tungsten oxide quantum dots for anti-counterfeiting and waterproofing

Resolving the ultracollinear paradox with effective field theory

GPU-enabled extreme-scale turbulence simulations: Fourier pseudo-spectral algorithms at the exascale using OpenMP offloading

A multiverse outside of the swampland

Local Second Order Mo̷ller-Plesset Theory with a Single Threshold Using Orthogonal Virtual Orbitals: A Distributed Memory Implementation.

Superior ablation resistance of plasma sprayed SiC based coating by structural optimized powders

Massively parallel axisymmetric fluid model for streamer discharges

Toward an efficient second‐order method for computing the surface gravitational potential on spherical‐polar meshes

Thermoviscous dissipation of nonlinear acoustic waves in channels with wavy walls.

An approximate block factorization preconditioner for mixed-dimensional beam-solid interaction

Composite dark matter and neutrino masses from a light hidden sector

Electroweak evolution equations and isospin conservation

Weak time-scale separation at the onset of oscillatory magnetoconvection in rapidly rotating fluids

Stau pairs from natural SUSY at high luminosity LHC

Asynchronous global–local non-invasive coupling for nonlinear monotone patches: Application to plasticity problems