A template library to integrate thread scheduling and locality management for NUMA multiprocessors

Zoltán Majó ,Thomas Groß

doi:10.5555/2342788.2342800

Abstract

Many multicore multiprocessors have a non-uniform memory architecture (NUMA), and for good performance, data and computations must be partitioned so that (ideally) all threads execute on the processor that holds their data. However, many multithreaded applications show heavy use of shared data structures that are accessed by all threads of the application. Automatic data placement and thread scheduling for these applications is (still) difficult.We present a template library for shared data structures that allows a programmer to express both the data layout (how the data space is partitioned) as well as thread mapping and scheduling (when and where a thread is executed). The template library supports programmers in dividing computations and data for reducing the percentage of costly remote memory accesses in NUMA multicore multiprocessors. Initial experience with ferret, a program with irregular memory access patterns from the PARSEC benchmark suite, shows that this approach can reduce the number of remote accesses from 42% to 10% and results in a performance improvement of 3% without overwhelming the programmer.

Full Text