Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms

Khalid Hasanov,Alexey Lastovetsky,Jean-Noël Quintin

doi:10.1007/s11227-014-1133-x

Abstract

Many state-of-the-art parallel algorithms, which are widely used in scientific applications executed on high-end computing systems, were designed in the twentieth century with relatively small-scale parallelism in mind. Indeed, while in 1990s a system with few hundred cores was considered a powerful supercomputer, modern top supercomputers have millions of cores. In this paper, we present a hierarchical approach to optimization of message-passing parallel algorithms for execution on large-scale distributed-memory systems. The idea is to reduce the communication cost by introducing hierarchy and hence more parallelism in the communication scheme. We apply this approach to SUMMA, the state-of-the-art parallel algorithm for matrix---matrix multiplication, and demonstrate both theoretically and experimentally that the modified Hierarchical SUMMA significantly improves the communication cost and the overall performance on large-scale platforms.

Full Text