Abstract

The performance of collective communication operations determines the overall performance of MPI applications. Different algorithms have been developed and implemented for each MPI collective operation, but none proved superior in all situations. Therefore, MPI implementations have to solve the problem of selecting the optimal algorithm for the collective operation depending on the platform, the number of processes involved, the message size(s), etc. The current solution method is purely empirical.Recently, an alternative solution method using analytical performance models of collective algorithms has been proposed and proved both accurate and efficient for one-process-per-CPU configurations. The method derives the analytical performance models of algorithms from their code implementation rather than from high-level mathematical definitions, and estimates the parameters of the models separately for each algorithm. The method is network and topology oblivious and uses the Hockney model for point-to-point communications. In this paper, we extend that selection method to the case of clusters of multi-core processors, where each core of the platform runs a process of the MPI application.We present the proposed approach using Open MPI broadcast algorithms, and experimentally validate it on three different clusters of multi-core processors, Grisou, Gros and MareNostrum4.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call