Average Memory Latency Research Articles

The rise of general-purpose computing on GPUs has influenced architectural innovation on them. The introduction of an on-chip cache hierarchy is one such innovation. High L1 miss rates on GPUs, however, indicate inefficient cache usage due to myriad factors, such as cache thrashing and extensive multithreading. Such high L1 miss rates in turn place high demands on the shared L2 bandwidth. Extensive congestion in the L2 access path therefore results in high memory access latencies. In memory-intensive applications, these latencies get exposed due to a lack of active compute threads to mask such high latencies. In this article, we aim to reduce the pressure on the shared L2 bandwidth, thereby reducing the memory access latencies that lie in the critical path. We identify significant replication of data among private L1 caches, presenting an opportunity to reuse data among L1s. We further show how this reuse can be exploited via an L1 Cooperative Caching Network (CCN), thereby reducing the bandwidth demand on L2. In the proposed architecture, we connect the L1 caches with a lightweight ring network to facilitate intercore communication of shared data. We show that this technique reduces traffic to the L2 cache by an average of 29%, freeing up the bandwidth for other accesses. We also show that the CCN reduces the average memory latency by 24%, thereby reducing core stall cycles by 26% on average. This translates into an overall performance improvement of 14.7% on average (and up to 49%) for applications that exhibit reuse across L1 caches. In doing so, the CCN incurs a nominal area and energy overhead of 1.3% and 2.5%, respectively. Notably, the performance improvement with our proposed CCN compares favorably to the performance improvement achieved by simply doubling the number of L2 banks by up to 34%.

Since the introduction of virtual memory demand-paging and cache memories, computer systems have been exploiting spatial and temporal locality to reduce the average latency of a memory reference. In this paper, we introduce the notion of value locality, a third facet of locality that is frequently present in real-world programs, and describe how to effectively capture and exploit it in order to perform load value prediction. Temporal and spatial locality are attributes of storage locations, and describe the future likelihood of references to those locations or their close neighbors. In a similar vein, value locality describes the likelihood of the recurrence of a previously-seen value within a storage location. Modern processors already exploit value locality in a very restricted sense through the use of control speculation (i.e. branch prediction), which seeks to predict the future value of a single condition bit based on previously-seen values. Our work extends this to predict entire 32- and 64-bit register values based on previously-seen values. We find that, just as condition bits are fairly predictable on a per-static-branch basis, full register values being loaded from memory are frequently predictable as well. Furthermore, we show that simple microarchitectural enhancements to two modern microprocessor implementations (based on the PowerPC 620 and Alpha 21164) that enable load value prediction can effectively exploit value locality to collapse true dependencies, reduce average memory latency and bandwidth requirements, and provide measurable performance gains.

Average Memory Latency Research Articles

Related Topics

Articles published on Average Memory Latency

Rainbow: A composable coherence protocol for multi‐chip servers

ShaVe-ICE

Cooperative Caching for GPUs

Shared Memory Multicore MicroBlaze System with SMP Linux Support

A fine‐grained thread‐aware management policy for shared caches

A Power-Aware Multi-Level Cache Organization Effective for Multi-Core Embedded Systems

Asymmetric Cache Coherency

A third-generation SPARC V9 64-b microprocessor

The limits and effectiveness of data prefetching on scalable multiprocessors

The limits and effectiveness of data prefetching on scalable multiprocessors

Value locality and load value prediction

Value locality and load value prediction

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Average Memory Latency Research Articles

Related Topics

Articles published on Average Memory Latency

Rainbow: A composable coherence protocol for multi‐chip servers

ShaVe-ICE

Cooperative Caching for GPUs

Shared Memory Multicore MicroBlaze System with SMP Linux Support

A fine‐grained thread‐aware management policy for shared caches

A Power-Aware Multi-Level Cache Organization Effective for Multi-Core Embedded Systems

Asymmetric Cache Coherency

A third-generation SPARC V9 64-b microprocessor

The limits and effectiveness of data prefetching on scalable multiprocessors

The limits and effectiveness of data prefetching on scalable multiprocessors

Value locality and load value prediction

Value locality and load value prediction