Optimized MPI Gather Collective for Many Integrated Core (MIC) InfiniBand Clusters

A Venkatesh,Dhabaleswar K Panda,K Kandalla

doi:10.1109/xsw.2013.12

Abstract

Xeon Phi coprocessors are gaining popularity in the high performance computing community owing to its rendition of a highly parallel environment and X86 compatibility. The coprocessors, which conform to Intel's Many Integrated Core (MIC) architecture, are being deployed at large scale also because they yield a high performance per Watt. Each Xeon Phi coprocessor, despite offering 1 Teraflop performance, is connected to systems as PCIe devices and hence experiences the accompanying bandwidth and latency degradations. MPI libraries need to be designed in an architecturally-aware manner and must leverage on software stacks available on the MIC to ensure minimum expenditure of time in communication. Along with the optimization of send-receive MPI primitives, collectives which are widely used by scientific applications need to designed at the algorithm level in a way that alleviates architectural bottlenecks. In this work, we propose novel algorithms based on hierarchical communication algorithm designs and pipelining techniques to improve the performance of the MPI_Gather collective. At the micro-benchmark level, for an 256-process MPI job with the root of the gather on the MIC, the proposed algorithms reduce the average MPI_Gather latency by up to 83% and 87% compared to the existing MVAPICH2 and Intel MPI implementations of the operation, respectively.

Full Text