Abstract

Deep learning workloads on modern multi-graphics processing unit (GPU) nodes are highly dependent on intranode interconnects, such as NVLink and PCIe, for high-performance communication. In this article, we take on the challenge to design an interconnect-aware multipath GPU-to-GPU communication using unified communication X (UCX) to utilize all available bandwidth for both NVLink-based systems and those that use a mixture of NVLink and PCIe. Our proposed multipath data transfer mechanism pipelines and stripes the message across multiple intrasocket communication channels and memory regions to achieve 1.84× higher bandwidth for Open message passing interface (MPI) on NVLink-based systems and 1.23× on NVLink and PCIe systems. We then utilize this mechanism to propose a three-stage hierarchical, pipelined MPI_Allreduce design as well as a flat pipelined two-stage algorithm for two different node topologies. For large messages, our proposed algorithms achieve a high speedup when compared to other MPI implementations. We also observe significant speedup for the proposed MPI_Allreduce with Horovod + TensorFlow with a variety of deep learning models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call