Abstract

Achieving low load-to-use latency with low energy and storage overheads is critical for performance. Existing techniques either prefetch into the pipeline (via address prediction and validation) or provide data reuse in the pipeline (via register sharing or L0 caches). These techniques provide a range of tradeoffs between latency, reuse, and overhead. In this work, we present a pipeline prefetching technique that achieves state-of-the-art performance and data reuse without additional data storage, data movement, or validation overheads by adding address tags to the register file. Our addition of register file tags allows us to forward (reuse) load data from the register file with no additional data movement, keep the data alive in the register file beyond the instruction’s lifetime to increase temporal reuse, and coalesce prefetch requests to achieve spatial reuse. Further, we show that we can use the existing memory order violation detection hardware to validate prefetches and data forwards without additional overhead. Our design achieves the performance of existing pipeline prefetching while also forwarding 32% of the loads from the register file (compared to 15% in state-of-the-art register sharing), delivering a 16% reduction in L1 dynamic energy (1.6% total processor energy), with an area overhead of less than 0.5%.

Highlights

  • Satisfying loads as early as possible is critical for performance

  • To detect register reuse and perform load-to-load forwarding through the physical register file, we introduce a register translation table, the Address Tag, Register Tag (AT-RT) table

  • These results demonstrate that Address Tag-Register Tag (AT-RT) is able to achieve the full performance of pipeline prefetching with even better energy savings than register reuse

Read more

Summary

INTRODUCTION

This is necessary to account for local and remote stores and results in a doubling of L1 cache accesses for all pipeline prefetches This problem can be addressed by installing prefetched data in an intermediate storage between the CPU and the L1 [15, 27]. While previous pipeline prefetching techniques either increase pressure on the L1 [5, 39] or require extra data storage [15, 27], our solution avoids both problems, but reduces L1 accesses by forwarding the data from the PRF when reuse is detected. This provides spatial reuse, but without the data storage and data movement overheads of an L0. By using the PRF for data storage, we are able to accomplish this with only the overhead of a small tag array (0.5% of the CPU core area)

Pipeline Prefetching and Value-prediction
Register Sharing
Detecting Memory Ordering Violations
MOTIVATION
Forwarding Validation
Interaction with the Memory Bypass Predictor
SRAM Cache Layout
Coalescing Loads Efficiently
Exposing more Locality than the LQ Window
Simulation and Modeling
Memory Access Reduction
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call