Hardware Prefetching Research Articles

Because of stringent power constraints, aggressive latency-hiding approaches, such as prefetching, are absent in the state-of-the-art embedded processors. There are two main reasons that make prefetching power inefficient. First, compiler-inserted prefetch instructions increase code size and, therefore, could increase I-cache power. Second, inaccurate prefetching (especially for hardware prefetching) leads to high D-cache power consumption because of useless accesses. In this work, we show that it is possible to support power-efficient prefetching through bit-differential offset assignment. We target the prefetching of relocatable stack variables with a high degree of precision. By assigning the offsets of stack variables in such a way that most consecutive addresses differ by 1 bit, we can prefetch them with compact prefetch instructions to save I-cache power. The compiler first generates an access graph of consecutive memory references and then attempts a layout of the memory locations in the smallest hypercube. Each dimension of the hypercube represents a 1-bit differential addressing. The embedding is carried out in as compact a hypercube as possible in order to save memory space. Each load/store instruction carries a hint regarding prefetching the next memory reference by encoding its differential address with respect to the current one. To reduce D-cache power cost, we further attempt to assign offsets so that most of the consecutive accesses map to the same cache line. Our prefetching is done using a one entry line buffer [Wilson et al. 1996]. Consequently, many look-ups in D-cache reduce to incremental ones. This results in D-cache activity reduction and power savings. Our prefetcher requires both compiler and hardware support. In this paper, we provide implementation on the processor model close to ARM with small modification to the ISA. We tackle issues such as out-of-order commit, predication, and speculation through simple modifications to the processor pipeline on noncritical paths. Our goal in this work is to boost performance while maintaining/lowering power consumption. Our results show 12% speedup and slight power reduction. The runtime virtual space loss for stack and static data is about 11.8%.

Pointer-chasing applications tend to traverse composite data structures consisting of multiple independent pointer chains. While the traversal of any single pointer chain leads to the serialization of memory operations, the traversal of independent pointer chains provides a source of memory parallelism. This article investigates exploiting such interchain memory parallelism for the purpose of memory latency tolerance, using a technique called multi--chain prefetching . Previous works [Roth et al. 1998;Roth and Sohi 1999] have proposed prefetching simple pointer-based structures in a multi--chain fashion. However, our work enables multi--chain prefetching for arbitrary data structures composed of lists, trees, and arrays.This article makes five contributions in the context of multi--chain prefetching. First, we introduce a framework for compactly describing linked data structure (LDS) traversals, providing the data layout and traversal code work information necessary for prefetching. Second, we present an off-line scheduling algorithm for computing a prefetch schedule from the LDS descriptors that overlaps serialized cache misses across separate pointer-chain traversals. Our analysis focuses on static traversals. We also propose using speculation to identify independent pointer chains in dynamic traversals. Third, we propose a hardware prefetch engine that traverses pointer-based data structures and overlaps multiple pointer chains according to the computed prefetch schedule. Fourth, we present a compiler that extracts LDS descriptors via static analysis of the application source code, thus automating multi--chain prefetching. Finally, we conduct an experimental evaluation of compiler-instrumented multi--chain prefetching and compare it against jump pointer prefetching [Luk and Mowry 1996], prefetch arrays [Karlsson et al. 2000], and predictor-directed stream buffers (PSB) [Sherwood et al. 2000].Our results show compiler-instrumented multi--chain prefetching improves execution time by 40% across six pointer-chasing kernels from the Olden benchmark suite [Rogers et al. 1995], and by 3% across four SPECint2000 benchmarks. Compared to jump pointer prefetching and prefetch arrays, multi--chain prefetching achieves 34% and 11% higher performance for the selected Olden and SPECint2000 benchmarks, respectively. Compared to PSB, multi--chain prefetching achieves 27% higher performance for the selected Olden benchmarks, but PSB outperforms multi--chain prefetching by 0.2% for the selected SPECint2000 benchmarks. An ideal PSB with an infinite Markov predictor achieves comparable performance to multi--chain prefetching, coming within 6% across all benchmarks. Finally, speculation can enable multi--chain prefetching for some dynamic traversal codes, but our technique loses its effectiveness when the pointer-chain traversal order is highly dynamic.

Hardware Prefetching Research Articles

Related Topics

Articles published on Hardware Prefetching

고성능 데이터 캐시 메모리 구조

Comparing memory systems for chip multiprocessors

Power-efficient prefetching for embedded processors

Cache-conscious coallocation of hot data streams

Two-phase prediction of L1 data cache misses

Power-efficient prefetching via bit-differential offset assignment on embedded processors

A general framework for prefetch scheduling in linked data structures and its application to multi-chain prefetching

Dual Cache Architecture for Low Cost and High Performance

An intelligent cache system with hardware prefetching for high performance

Guided region prefetching

Improving Data Prefetching Efficacy in Multimedia Applications

Timekeeping in the memory system

Designing a modern memory hierarchy with hardware prefetching

Optimal loop scheduling for hiding memory latency based on two-level partitioning and prefetching

Dynamic access ordering for streamed computations

Tolerating late memory traps in ILP processors

An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors

CPU cache prefetching: Timing evaluation of hardware implementations

Performance evaluation and cost analysis of cache protocol extensions for shared-memory multiprocessors

An evaluation of memory consistency models for shared-memory systems with ILP processors

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Hardware Prefetching Research Articles

Related Topics

Articles published on Hardware Prefetching

고성능 데이터 캐시 메모리 구조

Comparing memory systems for chip multiprocessors

Power-efficient prefetching for embedded processors

Cache-conscious coallocation of hot data streams

Two-phase prediction of L1 data cache misses

Power-efficient prefetching via bit-differential offset assignment on embedded processors

A general framework for prefetch scheduling in linked data structures and its application to multi-chain prefetching

Dual Cache Architecture for Low Cost and High Performance

An intelligent cache system with hardware prefetching for high performance

Guided region prefetching

Improving Data Prefetching Efficacy in Multimedia Applications

Timekeeping in the memory system

Designing a modern memory hierarchy with hardware prefetching

Optimal loop scheduling for hiding memory latency based on two-level partitioning and prefetching

Dynamic access ordering for streamed computations

Tolerating late memory traps in ILP processors

An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors

CPU cache prefetching: Timing evaluation of hardware implementations

Performance evaluation and cost analysis of cache protocol extensions for shared-memory multiprocessors

An evaluation of memory consistency models for shared-memory systems with ILP processors