The solution of the pressure Poisson equation arising in the numerical solution of incompressible Navier–Stokes equations (INSE) is by far the most expensive part of the computational procedure, and often the major restricting factor for parallel implementations. Improvements in iterative linear solvers, e.g. deploying Krylov-based techniques and multigrid preconditioners, have been successfully applied for solving the INSE on CPU-based parallel computers. These numerical schemes, however, do not necessarily perform well on GPUs, mainly due to differences in the hardware architecture. Our previous work using many P100 GPUs of a flagship supercomputer showed that porting a highly optimized MPI-parallel CPU-based INSE solver to GPUs, accelerates significantly the underlying numerical algorithms, while the overall acceleration remains limited (Zolfaghari et al. [3]). The performance loss was mainly due to the Poisson solver, particularly the V-cycle geometric multigrid preconditioner. We also observed that the pure compute time for the GPU kernels remained nearly constant as grid size was increased. Motivated by these observations, we present herein an algebraically simpler, yet more advanced parallel implementation for the solution of the Poisson problem on large numbers of distributed GPUs. Data parallelism is achieved by using the classical Jacobi method with successive over-relaxation and an optimized iterative driver routine. Task parallelism is enhanced via minimizing GPU-GPU data exchanges as iterations proceed to reduce the communication overhead. The hybrid parallelism results in nearly 300 times less time-to-solution and thus computational cost (measured in node-hours) for the Poisson problem, compared to our best-case scenario CPU-based parallel implementation which uses a preconditioned BiCGstab method. The Poisson solver is then embedded in a flow solver with explicit third-order Runge-Kutta scheme for time-integration, which has been previously ported to GPUs. The flow solver is validated and computationally benchmarked for the transition and decay of the Taylor-Green Vortex at Re=1600 and the flow around a solid sphere at ReD=3700. Good strong scaling is demonstrated for both benchmarks. Further, nearly 70% lower electrical energy consumption than the CPU implementation is reported for Taylor-Green vortex case. We finally deploy the solver for DNS of systolic flow in a bileaflet mechanical heart valve, and present new insight into the complex laminar-turbulent transition process in this prosthesis.
Read full abstract