Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior

Yoongu Kim,Michael Papamichael,Onur Mutlu,Mor Harchol-Balter

doi:10.1109/micro.2010.51

Abstract

In a modern chip-multiprocessor system, memory is a shared resource among multiple concurrently executing threads. The memory scheduling algorithm should resolve memory contention by arbitrating memory access in such a way that competing threads progress at a relatively fast and even pace, resulting in high system throughput and fairness. Previously proposed memory scheduling algorithms are predominantly optimized for only one of these objectives: no scheduling algorithm provides the best system throughput and best fairness at the same time. This paper presents a new memory scheduling algorithm that addresses system throughput and fairness separately with the goal of achieving the best of both. The main idea is to divide threads into two separate clusters and employ different memory request scheduling policies in each cluster. Our proposal, Thread Cluster Memory scheduling (TCM), dynamically groups threads with similar memory access behavior into either the latency-sensitive (memory-non-intensive) or the bandwidth-sensitive (memory-intensive) cluster. TCM introduces three major ideas for prioritization: 1) we prioritize the latency-sensitive cluster over the bandwidth-sensitive cluster to improve system throughput, 2) we introduce a ``niceness'' metric that captures a thread's propensity to interfere with other threads, 3) we use niceness to periodically shuffle the priority order of the threads in the bandwidth-sensitive cluster to provide fair access to each thread in a way that reduces inter-thread interference. On the one hand, prioritizing memory-non-intensive threads significantly improves system throughput without degrading fairness, because such ``light'' threads only use a small fraction of the total available memory bandwidth. On the other hand, shuffling the priority order of memory-intensive threads improves fairness because it ensures no thread is disproportionately slowed down or starved. We evaluate TCM on a wide variety of multiprogrammed workloads and compare its performance to four previously proposed scheduling algorithms, finding that TCM achieves both the best system throughput and fairness. Averaged over 96 workloads on a 24-core system with 4 memory channels, TCM improves system throughput and reduces maximum slowdown by 4.6%/38.6% compared to ATLAS (previous work providing the best system throughput) and 7.6%/4.6% compared to PAR-BS (previous work providing the best fairness).

Highlights

High latency of off-chip memory accesses has long been a critical bottleneck in thread performance
Compared to Parallelism-aware batch scheduling (PAR-BS), the most fair previous algorithm, Thread Cluster Memory scheduling (TCM) provides significantly better system throughput (7.6% higher weighted speedup) and better fairness (4.6% lower maximum slowdown)
PAR-BS suffers from relatively low system throughput since memory requests from memory-intensive threads can block those from memory-non-intensive threads

Summary

Introduction

High latency of off-chip memory accesses has long been a critical bottleneck in thread performance. This has been further exacerbated in chip-multiprocessors where memory is shared among concurrently executing threads; when a thread accesses memory, it contends with other threads and, as a result, can be slowed down compared to when it has the memory entirely to itself. TCM defines a thread’s memory access behavior using three components as identified by previous work: memory intensity [5], bank-level parallelism [14], and rowbuffer locality [19]. Memory intensity is the frequency at which a thread misses in the last-level cache and generates memory requests. It is measured in the unit of (cache) misses per thousand instructions or MPKI. It is the existence of multiple memory banks and their particular internal organization that give rise to banklevel parallelism and row-buffer locality, respectively

Methods

Results

Conclusion