This paper presents a parallel implementation of the Discrete Unified Gas Kinetic Scheme (DUGKS) on the GPU system using the CUDA Fortran and CUDA C++ programming languages. Firstly, we conducted an extensive revision of our original CPU-based code, resulting in a threefold decrease in memory usage. This new implementation is also paired with a novel approach to compute cell face flux using trilinear interpolation. It is shown analytically that the interpolation-based approach to flux calculation is more accurate compared to the one used in the original DUGKS. The initial simulation results using this new approach suggest that trilinear interpolation can reduce numerical errors on a coarse mesh. For example, in the case of the decaying Taylor-Green vortex flow at a 1283 mesh resolution, the relative numerical error in the energy dissipation rate at t⁎=2, using the spectral simulation result as the benchmark, is approximately 30% lower than that of the original implementation. The improved GPU DUGKS method is applied to laminar and turbulent flows in periodic and wall-bounded boundary configurations. A performance comparison of the GPU implementation is also presented and compared to the previous CPU implementation. A maximum speedup of 7.64x was achieved on a desktop-level GPU compared to a 32-core CPU. The strong scaling test, conducted on an eight-GPU node, demonstrated the efficient utilization of available multiple GPU resources by the code. Program summaryProgram Title: DUGKS-GPUCPC Library link to program files:https://doi.org/10.17632/yykv5s9g2n.1Developer's repository link:https://github.com/kairzhan/DUGKS-GPUCode Ocean capsule:https://codeocean.com/capsule/3b1e4f74-cdd3-4781-8923-8514d5923dfb/Licensing provisions: GNU General Public License 3Programming language: CUDA C++, CUDA FortranSupplementary material:Nature of problem: DUGKS-GPU is an advanced numerical code designed to accelerate the simulation of turbulent fluid flows through the application of the kinetic method known as the Discrete Unified Gas Kinetic Scheme (DUGKS). This computational tool comprises a CUDA Fortran version, tailored for a single GPU, as well as a CUDA C++ version developed for multi-GPU platforms.Solution method: The code employs a second-order finite-volume discretization of the discrete velocity Boltzmann equation with the BGK collision model. It can be used to simulate small to medium-scale problems on modern multi-GPU platforms with CUDA architecture.