Abstract

Xeon Phi coprocessors are gaining popularity in the high performance computing community owing to its rendition of a highly parallel environment and X86 compatibility. The coprocessors, which conform to Intel's Many Integrated Core (MIC) architecture, are being deployed at large scale also because they yield a high performance per Watt. Each Xeon Phi coprocessor, despite offering 1 Teraflop performance, is connected to systems as PCIe devices and hence experiences the accompanying bandwidth and latency degradations. MPI libraries need to be designed in an architecturally-aware manner and must leverage on software stacks available on the MIC to ensure minimum expenditure of time in communication. Along with the optimization of send-receive MPI primitives, collectives which are widely used by scientific applications need to designed at the algorithm level in a way that alleviates architectural bottlenecks. In this work, we propose novel algorithms based on hierarchical communication algorithm designs and pipelining techniques to improve the performance of the MPI_Gather collective. At the micro-benchmark level, for an 256-process MPI job with the root of the gather on the MIC, the proposed algorithms reduce the average MPI_Gather latency by up to 83% and 87% compared to the existing MVAPICH2 and Intel MPI implementations of the operation, respectively.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.