Cluster Cache Monitor: Leveraging the Proximity Data in CMP

Dongsheng Wang,Guohong Li,Sanchuan Guo,Zhenyu Liu,Olivier Temam

doi:10.1007/s10766-014-0339-0

Abstract

As the number of cores and the working sets of parallel workloads increase, shared L2 caches exhibit fewer misses than private L2 caches by making a better use of the total available cache capacity, but they also induce higher overall L1 miss latencies because of the longer average distance between two nodes, and the potential congestions at certain nodes. One of the main causes of the long L1 miss latencies are accesses to home nodes of the directory. However, we have observed that there is a high probability that the target data of an L1 miss resides in the L1 cache of a neighbor node. In such cases, these long-distance accesses to the home nodes can be potentially avoided. We organize the multi-core into clusters of \(2\times 2\) nodes, and in order to leverage the aforementioned property, we introduce the Cluster Cache Monitor (CCM). The CCM is a hardware structure in charge of detecting whether an L1 miss can be served by one of the cluster L1 caches, and two cluster-related states are added in the coherence protocol in order to avoid long-distance accesses to home nodes upon hits in the cluster L1 caches. We evaluate this approach on a 64-node multi-core using SPLASH-2 and PARSEC benchmarks, and we find that the CCM can reduce the execution time by 15 % and reduce the energy by 14 %, while saving 28 % of the directory storage area compared to a standard multi-core with a shared L2. We also show that the CCM outperforms recent mechanisms, such as ASR, DCC and RNUCA.

Highlights

Introduction and MotivationEach node of a multi-core usually contains a core, a private L1 cache, and L2 storage [4,7,28], see Fig. 1
Since the Cache Monitor (CCM) must process the requests of the four network interfaces, and since this simple Cluster Tag Array (CTA) implementation can process only one request at a time, we buffer incoming requests in order to cope with L1 miss requests from two or more network interfaces issued at the same cycle, see the bottom of Fig. 6a
CCM requires less overall energy than Adaptive Selective Replication (ASR), Distributed Cooperative Caching (DCC) and Reactive NUCA (RNUCA) for two reasons: (1) because the CCM is small compared to the respective additional structures of ASR and DCC, so that the L1 + L2 static energy is lower, and (2) because the overall execution time of CCM is lower than all three other mechanisms, thanks in large part to the lower L1 miss latency

Summary

Introduction and Motivation

Each node of a multi-core usually contains a core, a private L1 cache, and L2 storage [4,7,28], see Fig. 1. This long access latency to the home node is all the more wasteful, both time-wise and energy-wise, if the requested data itself is located in a nearby node, as illustrated, see Owner node We observe that this case is frequent: upon an L1 miss in one node, there is a high probability that the requested data is located in the L1 cache of a nearby node, a form of node-level spatial locality. 59 % of L1 misses can be serviced by neighbor cluster nodes, and up to 89 % for radiosity This behavior is inherent to the pattern of access to shared data among threads of parallel programs.

Related Work

Replication

Migration

Leveraging Data Proximity

Hierarchical Coherence

Overview of the CCM Operations

Hardware Structure of CCM

MRUTB and CRB

Modifications of Network Interface

CCM Coherence Protocol

Experimental Methodology

Energy