Balanced Partitioning of Several Cache-Oblivious Algorithms

Yuan Tang

doi:10.1145/3350755.3400214

Abstract

Frigo et al. proposed an ideal cache model and a recursive cache-oblivious technique to design sequential cache-efficient algorithms in an oblivious fashion. Ballard et al. pointed out that it is a fundamental open problem to extend the technique to an arbitrary architecture. Ballard et al. raised another open question on how to parallelize Strassen's algorithm exactly and efficiently on an arbitrary number of processors. We propose a novel way of partitioning a cache-oblivious algorithm to achieve perfect strong scaling on an arbitrary number, even a prime number, of processors within a certain range in a shared-memory setting. Our approach is Processor-Aware but Cache-Oblivious (PACO). We demonstrate our approach on several important cache-oblivious algorithms, including longest common sub-sequence (LCS), classic rectangular matrix multiplication (MM), Strassen's algorithm, and comparison-based sorting. By our approach, we provide an almost exact solution to the open problem on parallelizing Strassen. We discuss how to extend our approach to a distributed-memory architecture, or even a heterogeneous computing system. Hence, our work may provide a new perspective on the fundamental open problem of extending the recursive cache-oblivious technique to an arbitrary architecture. By preliminary experiments, our algorithms outperform significantly state-of-the-art Processor-Oblivious (PO) and Processor-Aware (PA) counterparts.

Full Text