This work presents a graphics processing units (GPU) parallel algorithm of a cell-centered finite volume lattice Boltzmann method (FVLBM) on unstructured meshes. In the present GPU parallel algorithm, the parallelization is performed in the physical space. To reduce the frequency of GPU memory accesses, this algorithm develops coalesced access to GPU memory. In addition, to avoid the race for resources leading to data anomalies, such as dirty read or phantom read etc., and the double counting for flux calculation, the efficient face-based data structure often used for flux calculation in cells in the central processing unit (CPU) version of FVLBM is modified into a face-based data structure used for the fluxes on all faces, followed by a cell-based loop for the final residuals in all cells. Therefore, the proposed GPU parallel algorithm does not need to use the resource lock and retains the high efficiency of the face-based data structure in the fluxes computation to enhance its’ parallel efficiency. Additionally, to demonstrate the computational efficiency of the proposed GPU parallel algorithm, various benchmark studies are performed in this work by the proposed parallel scheme on a double precision NVIDIA GeForce RTX 3090Ti GPU card, including (a) the lid-driven flow in a two-dimensional (2D) square cavity, (b) a 2D flow past a cylinder, and (c) the lid-driven flow in a three-dimensional (3D) cubic cavity. The numerical results show that the proposed GPU parallel algorithm can be as accurate as the original CPU serial scheme with 1 to 2 orders of speedup.
Read full abstract