Abstract

Non-blocking collectives have been recently standardized by the Message Passing Interface (MPI) Forum. However, intelligent designs offered by the MPI communication runtimes are likely to be the key factors that drive their adoption. While hardware based solutions for non-blocking collective operations have shown promise, they require specialized hardware support and currently have several performance and scalability limitations. Alternatively, researchers have proposed software-based, Functional Partitioning solutions for non-blocking collectives, that rely on spare cores in each node to progress non-blocking collectives. However, these designs also require additional memory resources, and involve expensive copy operations. Such factors limit the overall performance and scalability benefits associated with using non-blocking collectives in MPI. In this paper, we propose a high performance, shared-memory backed, user-level approach based on functional partitioning, to design MPI-3 non-blocking collectives. Our approach relies on using one ``Communication Servlet (CS) thread per node to seamlessly execute the non-blocking collective operations on behalf of the application processes. Our design also eliminates the need for additional memory resources and expensive copy operations between the application processes and the CS. We demonstrate that our solution can deliver near-perfect computation/communication overlap with large message, dense collective operations, such as MPI_Ialltoallv, while using just one core per node. We also study the benefits of our approach with a popular parallel 3D-FFT kernel, which has been re-designed to use the MPI_Ialltoallv operation. We observe that our proposed designs can improve the performance of the P3DFFT kernel by up to 27%, with 2,048 processes on the TACC Stampede system.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call