Minimizing the communication time for matrix multiplication on multiprocessors

S.Lennart Johnsson

doi:10.1016/0167-8191(93)90029-k

Abstract

We present one matrix multiplication algorithm for two-dimensional arrays of processing nodes, and one algorithm for three-dimensional nodal arrays. One-dimensional nodal arrays are treated as a degenerate case. The algorithms are designed to utilize fully the communications bandwidth in high degree networks in which the one-, two-, or three-dimensional arrays may be embedded. For binary n-cubes, our algorithms offer a speedup of the communication over previous algorithms for square matrices and square two-dimensional arrays by a factor of n 2 . Configuring the N = 2 n processing nodes as a three-dimensional array may reduce the communication complexity by a factor of N 1 6 compared to a two-dimensional nodal array. The three-dimensional algorithm requires temporary storage proportional to the length of the nodal array axis aligned with the axis shared between the multiplier and the multiplicand. The optimal two-dimensional nodal array shape with respect to communication has a ratio between the numbers of node rows and columns equal to the ratio between the numbers of matrix rows and columns of the product matrix, with the product matrix accumulated in-place. The optimal three-dimensional nodal array shape has a ratio between the lengths of the machine axes equal approximately to the ratio between the lengths of the three axes in matrix multiplication. For product matrices of extreme shape, one-dimensional nodal array shapes are optimal when N n ≲ 2P R for P > R, or N n ≲ 2R P for R ≥ P, where P is the number of rows and R the number of columns of the product matrix. All our algorithms use standard communication functions.

Full Text