Accelerating Deep Learning Using Interconnect-Aware UCX Communication for MPI Collectives

Yltan Hassan Temucin,Pedram Alizadeh,Benjamin Kitor,Amir Hossein Sojoodi,Ahmad Afsahi

doi:10.1109/mm.2022.3148670

Abstract

Deep learning workloads on modern multi-graphics processing unit (GPU) nodes are highly dependent on intranode interconnects, such as NVLink and PCIe, for high-performance communication. In this article, we take on the challenge to design an interconnect-aware multipath GPU-to-GPU communication using unified communication X (UCX) to utilize all available bandwidth for both NVLink-based systems and those that use a mixture of NVLink and PCIe. Our proposed multipath data transfer mechanism pipelines and stripes the message across multiple intrasocket communication channels and memory regions to achieve 1.84× higher bandwidth for Open message passing interface (MPI) on NVLink-based systems and 1.23× on NVLink and PCIe systems. We then utilize this mechanism to propose a three-stage hierarchical, pipelined MPI_Allreduce design as well as a flat pipelined two-stage algorithm for two different node topologies. For large messages, our proposed algorithms achieve a high speedup when compared to other MPI implementations. We also observe significant speedup for the proposed MPI_Allreduce with Horovod + TensorFlow with a variety of deep learning models.

Full Text