A Tight I/O Lower Bound for Matrix Multiplication

Tyler M Smith,Robert A Van De Geijn

doi:10.1145/3362694

Abstract

A tight lower bound for required I/O when computing a matrix-matrix multiplication on a processor with two layers of memory is established. Prior work obtained weaker lower bounds by reasoning about the number of phases needed to perform C:=AB, where each phase is a series of operations involving S reads and writes to and from fast memory, and S is the size of fast memory. A lower bound on the number of phases was then determined by obtaining an upper bound on the number of scalar multiplications performed per phase. This paper follows the same high level approach, but improves the lower bound by considering C:=AB+C instead of C:=AB, and obtains the maximum number of scalar fused multiply-adds (FMAs) per phase instead of scalar additions. Key to obtaining the new result is the decoupling of the per-phase I/O from the size of fast memory. The new lower bound is 2mnk/ S - 2S where S is the size of fast memory. The constant for the leading term is an improvement of a factor 4/ 2. A theoretical algorithm that attains the lower bound is given, and how the state-of-the-art Goto's algorithm also in some sense meets the lower bound is discussed.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Tight I/O Lower Bound for Matrix Multiplication

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Mathematical Software

Lead the way for us

Journal: ACM Transactions on Mathematical Software	Publication Date: May 7, 2020
Citations: 4

Similar Papers

Minimizing Communication in Numerical Linear Algebra
Grey Ballard ... Oded Schwartz
SIAM Journal on Matrix Analysis and Applications | VOL. 32
Grey Ballard, et. al.Grey Ballard ... Oded Schwartz
01 Jul 2011
SIAM Journal on Matrix Analysis and Applications | VOL. 32

3D-LIN: A configurable low-latency interconnect for multi-core clusters with 3D stacked L1 memory
Giulia Beanato ... Yusuf Leblebici
-
Giulia Beanato, et. al.Giulia Beanato ... Yusuf Leblebici
01 Oct 2012
01 Oct 2012

3D-LIN: A configurable low-latency interconnect for multi-core clusters with 3D stacked L1 memory
Giulia Beanato ... Igor Loi
-
Giulia Beanato, et. al.Giulia Beanato ... Igor Loi
01 Oct 2012
01 Oct 2012

Scalable Data Management on Hybrid Memory System for Deep Neural Network Applications
Wei Rang ... Donglin Yang
-
Wei Rang, et. al.Wei Rang ... Donglin Yang
15 Dec 2021
15 Dec 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Tight I/O Lower Bound for Matrix Multiplication

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Mathematical Software