Abstract

Reducing the cost of memory accesses, both in terms of performance and energy consumption, is a major challenge in shared-memory architectures. Modern systems have deep and complex memory hierarchies with multiple cache levels and memory controllers, leading to a Non-Uniform Memory Access (NUMA) behavior. In such systems, there are two ways to improve the memory affinity: First, by mapping threads that share data to cores with a shared cache, cache usage and communication performance are optimized. Second, by mapping memory pages to memory controllers that perform the most accesses to them and are not overloaded, the average cost of accesses is reduced. We call these two techniques thread mapping and data mapping, respectively. Thread and data mapping should be performed in an integrated way to achieve a compounding effect that results in higher improvements overall. Previous work in this area requires expensive tracing operations to perform the mapping, or require changes to the hardware or to the parallel application. In this paper, we propose kMAF, a mechanism that performs integrated thread and data mapping in the kernel. kMAF uses the page faults of parallel applications to characterize their memory access behavior and performs the mapping during the execution of the application based on the detected behavior. In an evaluation with a large set of parallel benchmarks executing on three NUMA architectures, kMAF achieved substantial performance and energy efficiency improvements, close to an Oracle-based mechanism and significantly higher than previous proposals.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call