Peridynamics (PD) using integral equations has unique advantages in modeling structural damage evolution, but the expensive computational and memory costs associated with the nonlocal nature of PD severely limit its engineering applications. This study presents a CUDA-based multi-GPU parallel scheme for meshfree PD simulation, which is compatible with dual-horizon PD supporting multiscale spatial discretization. Efficient utilization of computing resources across all available GPUs is achieved by rationalizing data structures across multiple GPUs and using the CUDA memory model. Using atomic operations to avoid computational dual-bond forces further improves code performance. The classical Kalthoff-Winkler experiment and compact-tensile fatigue test were first carried out using 3 NVIDIA Tesla T4 GPUs. Two examples showed that the proposed scheme significantly accelerates the PD model and that the acceleration enhances approximately linearly with the number of GPUs. The speedup of the PD interaction force computation reaches up to 560.7x. Compared to double-precision floating-point values, the GPU performs much better in handling single-precision values. Finally, a PD simulation of rail head heat conduction under moving heat sources with nearly 76 million particles was realized, demonstrating the potential of the proposed scheme to support large-scale PD simulations.