We proposed a simple one-step index (OSI) algorithm for solving the lattice Boltzmann equation, particularly the streaming of particle distribution functions (PDFs) on a single grid system. The OSI algorithm is derived from the conventional A-B pattern. The memory addresses of the PDFs are fixed in this algorithm and consistent with collision principles. The streaming process is implicitly computed by reassigning their indexes corresponding to the time steps, spatial coordinates, and directions of the PDFs. The algorithm is simple to program because it reads and writes the PDFs only once per time step and does not require the synchronization of odd and even time steps. In this implementation, the data layout of the PDFs is the structure of arrays (SoA), suitable for the memory access pattern of graphics processing units (GPUs). The accuracy and single-precision performance of the proposed algorithm for the three-dimensional lid-driven cavity flow simulation with the D3Q19 model were validated and tested on an NVIDIA A100 having a 40 GB PCIe using CUDA and OpenACC. Performances of 8.4 and 8.1 giga lattice updates per second were obtained for CUDA and OpenACC, respectively. OpenACC can outperform CUDA by up to 95% with significantly less programming work. The bandwidth usage rates on a single GPU were 96% and 94% for CUDA and OpenACC, respectively, close to the theoretical values. Lattice Boltzmann method parallelism is implemented using CUDA and MPI for multi-GPU usage. Finally, computation and communication overlaps were implemented to optimize the parallel efficiency, where the weak scaling parallel efficiency exceeded 0.98 on up to 512 GPUs.
Read full abstract