Optimization of MPI collective operations on the IBM Blue Gene/Q supercomputer

Sameer Kumar,Dong Chen,Daniel Faraj,Philip Heidelberger,Amith Mamidala

doi:10.1177/1094342014552086

Abstract

The Blue Gene/Q (BG/Q) machine is the latest in the line of IBM massively parallel supercomputers, designed to scale to 262,144 nodes and 16 million threads. Each BG/Q node has 68 hardware threads. Hybrid programming paradigms, which use message passing among nodes and multi-threading within nodes, enable applications to achieve high throughput on BG/Q. In this paper, we present scalable algorithms to optimize MPI collective operations by taking advantage of the various features of the BG/Q torus and collective networks. We achieve an 8 byte double-sum MPI_Allreduce latency of 10.25 ms on 1,572,864 MPI ranks. We accelerate summing of network packets with local buffers by the use of the Quad Processing SIMD unit in the BG/Q cores and executing the sums on multiple communication threads supported by the optimized communication libraries. The achieved net gain is a peak throughput of 6.3 GB/s for double-sum allreduce. We also achieve over 90% of network peak for MPI_Alltoall with 65,536 MPI ranks.

Full Text