GPU-HADVPPM: high-efficient parallel GPU design of the Piecewise Parabolic Method (PPM) for horizontal advection in air quality model (CAMx)

Kai Cao,Qizhong Wu

doi:10.5194/egusphere-egu23-4859

Kai Cao, Qizhong Wu

https://doi.org/10.5194/egusphere-egu23-4859

Copy DOI

Export

Save

Cite

Publication Date: May 15, 2023

Abstract
Full-Text
Similar Papers

Abstract

Listen

With semiconductor technology gradually approaching its physical and thermal limits, Graphics processing units (GPUs) are becoming an attractive solution in many scientific applications due to their high performance. This paper presents an application of GPU accelerators in air quality model. We endeavor to demonstrate an approach that runs a PPM solver of horizontal advection (HADVPPM) for air quality model CAMx on GPU clusters. Specifically, we first convert the HADVPPM from its original Fortran form to a new Compute Unified Device Architecture C (CUDA C) code to make it computable on the GPU (GPU-HADVPPM). Then, a series of optimization measures are taken, including reducing the CPU-GPU communication frequency, increasing the size of data computation on GPU, and optimizing the GPU memory access order to improve the overall computing performance of CAMx. Finally, a heterogeneous, hybrid programming paradigm (MPI+CUDA) is presented and utilized with the GPU-HADVPPM on GPU clusters. When the consistency of its results is verified, offline experiment results show that running GPU-HADVPPM on one K40 and V100 GPU can achieve up to 845.4x and 1113.6x acceleration. By implementing a series of optimization schemes, the CAMx model coupled with GPU-HADVPPM resulted in a 12.7x and 94.8x improvement in computational efficiency using a GPU accelerator card on a K40 and V100 cluster, respectively. The multi-GPU acceleration algorithm enables 3.9x speedup with 8 CPU cores and 8 GPU accelerators on V100 cluster.&#160;Figure 1. The calling and computation process of the HADVPPM function on the CPU-GPU.Figure 2. (a) The offline performance of the HADVPPM scheme on CPU and GPU. The unit of the wall times for the offline performance experiments is millisecond(ms); (b) The total elapsed time of CAMx-CUDA V1.3 on multiple GPUs. The unit of elapsed time for experiments is seconds (s). The orange bar indicates the elapsed time of CAMx on the CPU, the blue bar shows the elapsed time on the CPU-GPU heterogeneous platform, and the red line indicates its speedup ratio on the heterogeneous platform.

Full Text