Processor-Aware Cache-Oblivious Algorithms✱

Yuan Tang,Weiguo Gao

doi:10.1145/3472456.3472506

Abstract

Frigo et al. proposed an ideal cache model and a recursive technique to design sequential cache-efficient algorithms in a cache-oblivious fashion. Ballard et al. pointed out that it is a fundamental open problem to extend the technique to an arbitrary architecture. Ballard et al. raised another open question on how to parallelize Strassen’s algorithm exactly and efficiently on an arbitrary number of processors. We propose a novel way of partitioning a cache-oblivious algorithm to achieve perfect strong scaling on an arbitrary number, even a prime number, of processors within a certain range in a shared-memory setting. Our approach is Processor-Aware but Cache-Oblivious (PACO). We apply the approach to classic rectangular matrix-matrix multiplication (MM) and Strassen’s algorithm. We provide an almost exact solution to the open problem on parallelizing Strassen. Though this paper focuses mainly on a homogeneous shared-memory setting, we also discuss the extensions of our approach to a distributed-memory and a heterogeneous settings. Our approach may provide a new perspective on extending the recursive cache-oblivious technique to an arbitrary architecture. Preliminary experiments show that our MM algorithm outperforms significantly Intel MKL’s dgemm. A full version of this paper is hosted on arXiv.

Full Text