Abstract
In this paper algorithms to perform barrier synchronization in MPI applications on HPC clusters of NUMA machines are investigated. We consider a case when all MPI processes, need to be synchronized, reside on a same multi socket NUMA machine. In particular, such a problem arises in hierarchical (topology-aware) barriers. Barrier algorithms for SMP/NUMA systems use shared counters and flags in a memory to communicate with each other. To minimize a latency of barrier algorithms it is important to place shared counters and flags in a memory of NUMA node which has minimal summary distance to other used NUMA nodes. We proposed the MinNumaDist algorithm for choosing the root process which is used to allocate shared flags and counters in a memory of its NUMA node. The algorithm selects the root rank with minimal summary distance from its NUMA node to NUMA nodes of all remaining processes. It reduces barrier synchronization time on asymmetric subsystems of processor cores (NUMA nodes and processor sockets have different number of assigned processes). Our experiments on dual socket NUMA machines show that the MinNumaDist decreases the latency of centralized barrier algorithms (central counter, flat tree, flat tree gather/release) on 10-170% in average.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.