Multi-Threading and Lock-Free MPI RMA Based Graph Processing on KNL and POWER Architectures

Dhabaleswar K Panda,Mingzhe Li,Hari Subramoni,Xiaoyi Lu

doi:10.1145/3236367.3236371

Abstract

Intel Knights Landing (KNL) and IBM POWER architectures are becoming widely deployed on modern supercomputing systems due to its powerful components. MPI Remote Memory Access (RMA) model that provides one-sided communication semantics has been seen as an attractive approach for developing High-Performance Data Analytics (HPDA) applications such as graph processing with irregular communication characteristics. To take advantage of a large number of hardware threads offered by KNL and POWER, HPDA applications and MPI RMA runtime need to be re-designed to get optimal performance. In this paper, we propose multi-threading and lock-free designs in the MPI runtime as well as Graph500 application on KNL and POWER architectures. At the micro-bench level, our proposed runtime-level designs are able to reduce the latency of uni-directional MPI_Put and MPI_Get by up to 3X compared to IntelMPI and Spectrum MPI. At the application level, with 1,024 processes on 32 KNL nodes, our proposed design could outperform IntelMPI library by 32%. With 512 processes on eight POWER nodes, our proposed design could outperform Spectrum MPI library by 19%. To the best of our knowledge, this is the first paper to design and evaluate MPI RMA-based graph processing applications on KNL and POWER architectures.

Full Text