NUMA-Aware Multicore Matrix Multiplication

Wail Y Alkowaileet,David Carrillo-Cisneros,Isaac D Scherson,Robert V Lim

doi:10.1142/s0129626414500066

Wail Y Alkowaileet, David Carrillo-Cisneros + Show 2 more

Open Access

https://doi.org/10.1142/s0129626414500066

Copy DOI

Abstract

A user-level scheduling along with a specific data alignment for matrix multiplication in cache-coherent Non-Uniform Memory Access (ccNUMA) architectures is presented. Addressing the data locality problem that could occur in such systems potentially alleviates memory bottlenecks. We show experimentally that an agnostic thread scheduler (e.g., OpenMP 3.1) from the data placement on a ccNUMA machine produces a high number of cache-misses. To overcome this memory contention problem, we show how proper memory mapping and scheduling manage to tune an existing matrix multiplication implementation and reduce the number of cache-misses by 67% and consequently, reduce the computation time by up to 22%. Finally, we show a relationship between cache-misses and the gained speedup as a novel figure of merit to measure the quality of the method.

Full Text