Speeding up Collective Communications Through Inter-GPU Re-Routing

Kiran Ranganath,Shuaiwen Leon Song,Daniel Wong,Amirali Abdolrashidi

doi:10.1109/lca.2019.2933842

Kiran Ranganath, Shuaiwen Leon Song + Show 2 more

Open Access

https://doi.org/10.1109/lca.2019.2933842

Copy DOI

Abstract

In order to address the vast needs of disparate domains, computing engines are becoming more sophisticated and complex. A typical high-performance computational engine is composed of several accelerator units, in most cases GPUs, plus one or more CPU controllers. All these components are becoming increasingly interconnected to satisfy bandwidth and latency tolerance demands from modern workloads. Due to these constraints, solutions to efficiently interconnect them or to systematically manage their traffic-such as PCIe v3, NVLink v1 and v2 on the hardware side, and NVIDIA Collective Communication Library (NCCL) and AMD ROCM layer on the software side-are becoming more commonplace inside HPC systems and cloud data centers. However, as the number of accelerators increases, workloads (especially machine learning) might not be able to fully exploit the computational substrate due to inefficient use of hardware interconnects. Such scenarios can lead to performance bottlenecks where high-bandwidth links are not used by the underlying libraries and under-performing links are overused. This work proposes Workload Optimization Through Inter-GPU Re-routing (WOTIR), which consists of enhanced NCCL-based collective primitives that aim to boost bandwidth utilization (through more efficient routing) and reduce communication overhead. WOTIR targets GPUs with no direct NVLink communication path (which leads to PCIe communications) and instead re-routes communication through intermediate GPUs to bridge NVLink segments and avoid PCIe communications. Such method allows the maximum possible utilization of the NVLink bandwidth between the GPUs without routing through the PCIe bus. Using this method, we see a reduction of up to 34 percent in execution time for selected machine learning workloads when non-optimal GPU allocations arise.

Full Text