MPI Process and Network Device Affinitization for Optimal HPC Application Performance

Ravindra Babu Ganapathi,Aravind Gopalakrishnan,Russell W Mcguire

doi:10.1109/hoti.2017.12

Abstract

High Performance Computing(HPC) applications are highly optimized to maximize allocated resources for the job such as compute resources, memory and storage. Optimal performance for MPI applications requires the best possible affinity across all the allocated resources. Typically, setting process affinity to compute resources is well defined, i.e MPI processes on a compute node have processor affinity set for one to one mapping between MPI processes and the physical processing cores. Several well defined methods exist to efficiently map MPI processes to a compute node. With the growing complexity of HPC systems, platforms are designed with complex compute and I/O subsystems. Capacity of I/O devices attached to a node are expanded with PCIe switches resulting in large numbers of PCIe endpoint devices. With a lot of heterogeneity in systems, applications programmers are forced to think harder about affinitizing processesas it affects performance based on not only compute but also NUMA placement of IO devices. Mapping of process to processor cores and the closest IO device(s) is not straightforward. While operating systems do a reasonable job of trying to keep a process physically located near the processor core(s) and memory, they lack the application developer's knowledge of process workflow and optimal IO resource allocation when more than one IO device is connected to the compute node. In this paper we look at ways to assuage the problems of affinity choices by abstracting the device selection algorithm from MPI application layer. MPI continues to be the dominant programming model for HPC and hence our focus in this paper is limited to providing a solution for MPI based applications. Our solution can be extended to other HPC programming modelssuch as Partitioned Global Address Space(PGAS) or a hybrid MPI and PGAS based applications. We propose a solution to solve NUMA effects at the MPI runtime level independent of MPI applications. Our experiments are conducted on a two node system where each node consists of two socket Intel® Xeon® servers, attached with up to four Intel® Omni-Path fabric devices connected over PCIe. The performance benefits seen by MPI applications by affinitizing MPI processes with best possible network device is evident from the results where we notice up to 40% improvement in uni-directional bandwidth, 48% bi-directional bandwidth, 32% improvement in latency measurements and finally up to 40% improvement in message rate.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

MPI Process and Network Device Affinitization for Optimal HPC Application Performance

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

HPC Process and Optimal Network Device Affinitization
Ravindra Babu Ganapathi ... Aravind Gopalakrishnan
IEEE Transactions on Multi-Scale Computing Systems | VOL. 4
Ravindra Babu Ganapathi, et. al.Ravindra Babu Ganapathi ... Aravind Gopalakrishnan
01 Oct 2018
IEEE Transactions on Multi-Scale Computing Systems | VOL. 4

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience
J Jose ... K C Kandalla
-
J Jose, et. al.J Jose ... K C Kandalla
01 May 2013
01 May 2013

Performance and Accuracy Trade-offs of HPC Application Modeling and Simulation
Zhou Tong ... Michael Lang
-
Zhou Tong, et. al.Zhou Tong ... Michael Lang
01 May 2018
01 May 2018

Improving HPC Application Performance in Public Cloud
Rashid Hassani ... Peter Luksch
IERI Procedia | VOL. 10
Rashid Hassani, et. al.Rashid Hassani ... Peter Luksch
01 Jan 2014
IERI Procedia | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

MPI Process and Network Device Affinitization for Optimal HPC Application Performance

Abstract

Talk to us

Similar Papers