Current supercomputers often have a heterogeneous architecture using both conventional Central Processing Units (CPUs) and Graphics Processing Units (GPUs). At the same time, numerical simulation tasks frequently involve multiphysics scenarios whose components run on different hardware due to multiple reasons, e.g., architectural requirements, pragmatism, etc. This leads naturally to a software design where different simulation modules are mapped to different subsystems of the heterogeneous architecture. We present a detailed performance analysis for such a hybrid four-way coupled simulation of a fully resolved particle-laden flow. The Eulerian representation of the flow utilizes GPUs, while the Lagrangian model for the particles runs on conventional CPUs. Two characteristic model situations involving dense and dilute particle systems are used as benchmark scenarios. First, a roofline model is employed to predict the node level performance and to show that the lattice-Boltzmann-based Eulerian fluid simulation reaches very good performance on a single GPU. Furthermore, the GPU-GPU communication for a large-scale Eulerian flow simulation results in only moderate slowdowns. This is due to the efficiency of the CUDA-aware MPI communication, combined with the use of communication hiding techniques. On 1024 A100 GPUs, an overall parallel efficiency of up to 71% is achieved. While the flow simulation has good performance characteristics, the integration of the stiff Lagrangian particle system requires frequent CPU-CPU communications that can become a bottleneck, especially when simulating the dense particle system. Additionally, special attention is paid to the CPU-GPU communication overhead since this is essential for coupling the particles to the flow simulation. However, thanks to our problem-aware co-partitioning, the CPU-GPU communication overhead is found to be negligible. As a lesson learned from this development, four criteria are postulated that a hybrid implementation must meet for the efficient use of heterogeneous supercomputers.
Read full abstract