Abstract

SummaryGPU accelerators have established themselves in the state‐of‐the‐art clusters by offering high performance and energy efficiency. In such systems, efficient inter‐process GPU communication is of paramount importance to application performance. This paper investigates various algorithms in conjunction with the latest GPU features to improve GPU collective operations. First, we propose a GPU Shared Buffer‐aware (GSB) algorithm and a Binomial Tree Based (BTB) algorithm for GPU collectives on single‐GPU nodes. We then propose a hierarchical framework for clusters with multi‐GPU nodes. By studying various combinations of algorithms, we highlight the importance of choosing the right algorithm within each level. The evaluation of our framework on MPI_Allreduce shows promising performance results for large message sizes.To address the shortcoming for small and medium messages, we present the benefit of using the Hyper‐Q feature and the MPS service in jointly using CUDA IPC and host‐staged copy types to perform multiple inter‐process communications. However, we argue that efficient designs are still required to further harness this potential. Accordingly, we propose a static and a dynamic algorithm for MPI_Allgather and MPI_Allreduce and present their effectiveness on various message sizes. Our profiling results indicate that the achieved performance is mainly rooted in overlapping different copy types.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call