Abstract

The recent advent of advanced fabrics like NVIDIA NVLink is enabling the deployment of dense Graphics Processing Unit (GPU) systems, e.g., DGX-2 and Summit. The Message Passing Interface (MPI) has been the dominant programming model to design distributed applications on such clusters. The MPI Tools Interface (MPI_T) provides an opportunity for performance tools and external software to introspect and understand MPI runtime behavior at a deeper level to detect performance and scalability issues. However, the lack of low-overhead and scalable monitoring tools have thus far prevented a comprehensive study of efficiency and utilization of high-performance interconnects such as NVLinks on high-performance GPU-enabled clusters. In this paper, we address this deficiency by proposing and designing an in-depth, real-time analysis, profiling, and visualization tool for high-performance GPU-enabled clusters with NVLinks. The proposed tool builds on the top of the OSU InfiniBand Network Analysis and Monitoring Tool (INAM). It provides insights into the efficiency of different communication patterns by examining the utilization of underlying GPU interconnects. The contributions of the proposed tool are two-fold: 1) domain scientists and system administrators can understand how applications and runtime libraries interact with underlying high-performance interconnects, and 2)Proposed tool enables designers of high-performance communication libraries to gain low-level knowledge to optimize existing designs and develop new algorithms to optimally utilize cutting-edge interconnects on GPU clusters. To the best of our knowledge, this is the first such tool which is capable of presenting a unified and holistic view of MPI-level and fabric level information for emerging NVLink-enabled high-performance GPU clusters.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call