Abstract

Modern interconnects offer remote direct memory access (RDMA) features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0 standard defines a programming interface for exploiting RDMA networks directly, however, it's scalability and practicability has to be demonstrated in practice. In this work, we develop scalable bufferless protocols that implement the MPI-3.0 specification. Our protocols support scaling to millions of cores with negligible memory consumption while providing highest performance and minimal overheads. To arm programmers, we provide a spectrum of performance models for all critical functions and demonstrate the usability of our library and models with several application studies with up to half a million processes. We show that our design is comparable to, or better than UPC and Fortran Coarrays in terms of latency, bandwidth and message rate. We also demonstrate application performance improvements with comparable programming complexity.

Highlights

  • We provide a detailed performance evaluation and performance models that can be used for algorithm development and to demonstrate the scalability to future systems

  • We demonstrated how MPI-3.0 can be implemented over remote direct memory access (RDMA) networks to achieve similar performance to UPC and Fortran Coarrays while offering all of MPI’s convenient functionality (e.g., Topologies and Datatypes)

  • We provide detailed performance models, that help choosing among the multiple options

Read more

Summary

Motivation

Network interfaces evolve rapidly to implement a growing set of features directly in hardware. R. Gerstenberger et al / Enabling highly-scalable remote memory access programming with MPI-3 One Sided (1) time by avoiding message matching and synchronization overheads, (2) energy by reducing datamovement, e.g., it avoids additional copies of eager messages, and (3) space by removing the need for receiver buffering. Several programming environments allow to access RDMA hardware more or less directly: PGAS languages such as Unified Parallel C (UPC [38]) or Fortran Coarrays [17] and libraries such as Cray SHMEM [3] or MPI-2.2 One Sided [27]. We assume that we only have small bounded buffer space at each process, no remote software agent, and only put, get, and some basic atomic operations for remote memory access This makes our protocols applicable to all current RDMA networks and is forward-looking towards exascale interconnect architectures. All protocols and schemes discussed in the following can be used on any RDMAcapable network

Use-case
Scalable window creation
Scalable window synchronization
Communication functions
Blocking calls
Detailed performance modeling and evaluation
Latency and bandwidth
Synchronization schemes
Application evaluation
Distributed hashtable
Dynamic sparse data exchange
MIMD lattice computation
Related work
Findings
Discussion and conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call