Abstract

With processor speeds no longer doubling every 18—24 months owing to the exponential increase in power consumption and heat dissipation, modern high-end computing systems tend to rely less on the performance of single processing units and instead rely on achieving high performance by using the parallelism of a massive number of low-frequency/low-power processing cores. Using such low-frequency cores, however, puts a premium on end-host pre- and post-communication processing required within communication stacks, such as the Message Passing Interface (MPI) implementation. Similarly, small amounts of serialization within the communication stack that were acceptable on small/medium systems can be brutal on massively parallel systems. Thus, in this paper, we study the different non-data-communication overheads within the MPI implementation on the IBM Blue Gene/P system. Specifically, we analyze various aspects of MPI, including the MPI stack overhead itself, overhead of allocating and queueing requests, queue searches within the MPI stack, multi-request operations, and various others. Our experiments, that scale up to 131,072 cores of the largest Blue Gene/P system in the world (80% of the total system size), reveal several insights into overheads in the MPI stack, which were not previously considered significant, but can have a substantial impact on such massive systems.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call