Cache Access Research Articles

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the aggregate cache capacity and minimizes off-chip memory requests. At the same time, the growing on-chip communication delay favors core-private caches that replicate data to minimize delays on global wires. Recent hybrid proposals offer lower average latency than conventional designs, but they address the placement requirements of only a subset of the data accessed by the application, require complex lookup and coherence mechanisms that increase latency, or fail to scale to high core counts. In this work, we observe that the cache access patterns of a range of server and scientific workloads can be classified into distinct classes, where each class is amenable to different block placement policies. Based on this observation, we propose Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache. R-NUCA cooperates with the operating system to support intelligent placement, migration, and replication without the overhead of an explicit coherence mechanism for the on-chip last-level cache. In a range of server, scientific, and multiprogrammed workloads, R-NUCA matches the performance of the best cache design for each workload, improving performance by 14% on average over competing designs and by 32% at best, while achieving performance within 5% of an ideal cache design.

Read full abstract

Non-uniform cache architectures (NUCAs) are a novel design paradigm for large last-level on-chip caches, which have been introduced to deliver low access latencies in wire-delay-dominated environments. Their structure is partitioned into sub-banks and the resulting access latency is a function of the physical position of the requested data. Typically, NUCA caches employ a switched network, made up of links and routers with buffered queues, to connect the different sub-banks and the cache controller, and the characteristics of the network elements may affect the performance of the entire system. This work analyses how different parameters for the network routers, namely cut-through latency and buffering capacity, affect the overall performance of NUCA-based systems for the single processor case, assuming a reference NUCA organisation proposed in literature. The entire analysis is performed utilising a cycle-accurate execution-driven simulator of the entire system and real workloads. The results indicate that the sensitivity of the system to the cut-through latency is very high, thus limiting the effectiveness of the NUCA solution, and that modest buffering capacity is sufficient to achieve a good performance level. As a consequence, in this work we propose an alternative clustered NUCA organisation that limits the average number of hops experienced by cache accesses. This organisation is better performing and scales better as the cut-through latency increases, thus simplifying the implementation of routers, and it is also more effective than another latency reduction solution proposed in literature (hybrid network).

Read full abstract

Cache Access Research Articles

Related Topics

Articles published on Cache Access

Macro Data Load: An Efficient Mechanism for Enhancing Loaded Data Reuse

Online Capacity Identification of Multitier Websites Using Hardware Performance Counters

An Adaptive Various-Width Data Cache for Low Power Design

An Evaluation of an OS-Based Coherence Scheme for Tiled CMPs

C-AMTE: A location mechanism for flexible cache management in chip multiprocessors

Energy- and Performance-Efficient Communication Framework for Embedded MPSoCs through Application-Driven Release Consistency

A hybrid cache placement scheme for multi-hop wireless service network

Design and Analysis of On-Chip Networks for Large-Scale Cache Systems

Aggressive drowsy cache cells

Low-power L2 cache design for multi-core processors

On reducing load/store latencies of cache accesses

A Novel instruction stream buffer for VLIW architectures

Process-Variation-Aware Adaptive Cache Architecture and Management

Reactive NUCA

Memory mapped ECC

A High Density and Low Power Cache Based on Novel SRAM Cell

Impact of on-chip network parameters on NUCA cache performances

A low power and high density cache memory based on novel SRAM cell

Scalability study of cache access mechanisms in multiple-cell wireless networks

Design of a low‐power way‐predicting cache using valid‐bit pre‐decision strategy

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Cache Access Research Articles

Related Topics

Articles published on Cache Access

Macro Data Load: An Efficient Mechanism for Enhancing Loaded Data Reuse

Online Capacity Identification of Multitier Websites Using Hardware Performance Counters

An Adaptive Various-Width Data Cache for Low Power Design

An Evaluation of an OS-Based Coherence Scheme for Tiled CMPs

C-AMTE: A location mechanism for flexible cache management in chip multiprocessors

Energy- and Performance-Efficient Communication Framework for Embedded MPSoCs through Application-Driven Release Consistency

A hybrid cache placement scheme for multi-hop wireless service network

Design and Analysis of On-Chip Networks for Large-Scale Cache Systems

Aggressive drowsy cache cells

Low-power L2 cache design for multi-core processors

On reducing load/store latencies of cache accesses

A Novel instruction stream buffer for VLIW architectures

Process-Variation-Aware Adaptive Cache Architecture and Management

Reactive NUCA

Memory mapped ECC

A High Density and Low Power Cache Based on Novel SRAM Cell

Impact of on-chip network parameters on NUCA cache performances

A low power and high density cache memory based on novel SRAM cell

Scalability study of cache access mechanisms in multiple-cell wireless networks

Design of a low‐power way‐predicting cache using valid‐bit pre‐decision strategy