Abstract
Scientific computing has seen an immense growth in recent years. The Message Passing Interface (MPI) has become the de-facto standard for parallel programming model for distributed memory systems. As the system scale increases, application writers often try to increase the overlap of computation and communication. The MPI-2 standard expanded MPI to include one-sided communication semantics that has the potential for overlapping computation with communication. In this model synchronization between processes needs to be done explicitly to ensure completion before using the data. Fence is one of the mechanisms of providing such synchronization in the one-sided model. In this paper, we study a set of different alternatives for designing the fence synchronization mechanisms. We analyze the various trade-offs of these designs on networks like InfiniBand that provide Remote Direct Memory Access (RDMA) capabilities. We propose a novel design for implementing fence synchronization that uses RDMA write with Immediate mechanism (Fence-Imm-RI) provided by InfiniBand networks. We then characterize the performance of different designs with various one-sided communication pattern microbenchmarks for both latency as well as overlap capability. The new Fence-Imm-RI scheme performs the best in scenarios that require low synchronization overhead as well as good overlap capability (close to 90% overlap for large messages) as opposed to the other designs that can provide either low synchronization overhead or good overlap capability.
Paper version not known (
Free)
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have