Abstract
The importance of graphics processing units (GPUs) in accelerating HPC applications is evident by the fact that a large number of supercomputing clusters are GPU enabled. Many of these HPC applications use message passing interface (MPI) as their programming model. These MPI applications frequently exchange data that is noncontiguous in GPU memory. MPI provides derived datatypes (DDTs) to represent such data. Past research on DDTs mainly focused on optimizing the pack–unpack kernels. Modern HCAs are capable of gathering/scattering data from/to noncontiguous GPU memory regions. We propose a low-overhead HCA-assisted scheme to improve the performance of GPU-based noncontiguous exchanges without the GPU-based pack–unpack kernels. We show that the proposed scheme provides up to 2× benefits compared to the existing pack-based scheme at the benchmark level. Furthermore, we show up to 17% improvement with the SW4Lite application compared to other MPI libraries, such as MVAPICH2-GDR and OpenMPI+UCX.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have