Abstract
Collective communications occupy 20-90% of total execution times in many MPI applications. In this paper, we propose strategies for automatically identifying the most time-consuming collective operations that also act as scalability bottlenecks. We then explore the use of MPI-3’s non-blocking collectives for these communications. We also rearrange the codes to adequately overlap the independent computations with the non-blocking collective communications. Applying these strategies for different graph and machine learning applications, we obtained up to 33% performance improvements for large-scale runs on a Cray supercomputer.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have