Abstract

A user-level scheduling along with a specific data alignment for matrix multiplication in cache-coherent Non-Uniform Memory Access (ccNUMA) architectures is presented. Addressing the data locality problem that could occur in such systems potentially alleviates memory bottlenecks. We show experimentally that an agnostic thread scheduler (e.g., OpenMP 3.1) from the data placement on a ccNUMA machine produces a high number of cache-misses. To overcome this memory contention problem, we show how proper memory mapping and scheduling manage to tune an existing matrix multiplication implementation and reduce the number of cache-misses by 67% and consequently, reduce the computation time by up to 22%. Finally, we show a relationship between cache-misses and the gained speedup as a novel figure of merit to measure the quality of the method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call