Early Address Prediction

Ricardo Alves,David Black-Schaffer,Stefanos Kaxiras

doi:10.1145/3458883

Ricardo Alves, David Black-Schaffer + Show 1 more

Open Access

https://doi.org/10.1145/3458883

Copy DOI

Abstract

Achieving low load-to-use latency with low energy and storage overheads is critical for performance. Existing techniques either prefetch into the pipeline (via address prediction and validation) or provide data reuse in the pipeline (via register sharing or L0 caches). These techniques provide a range of tradeoffs between latency, reuse, and overhead. In this work, we present a pipeline prefetching technique that achieves state-of-the-art performance and data reuse without additional data storage, data movement, or validation overheads by adding address tags to the register file. Our addition of register file tags allows us to forward (reuse) load data from the register file with no additional data movement, keep the data alive in the register file beyond the instruction’s lifetime to increase temporal reuse, and coalesce prefetch requests to achieve spatial reuse. Further, we show that we can use the existing memory order violation detection hardware to validate prefetches and data forwards without additional overhead. Our design achieves the performance of existing pipeline prefetching while also forwarding 32% of the loads from the register file (compared to 15% in state-of-the-art register sharing), delivering a 16% reduction in L1 dynamic energy (1.6% total processor energy), with an area overhead of less than 0.5%.

Highlights

Satisfying loads as early as possible is critical for performance
To detect register reuse and perform load-to-load forwarding through the physical register file, we introduce a register translation table, the Address Tag, Register Tag (AT-RT) table
These results demonstrate that Address Tag-Register Tag (AT-RT) is able to achieve the full performance of pipeline prefetching with even better energy savings than register reuse

Summary

INTRODUCTION

This is necessary to account for local and remote stores and results in a doubling of L1 cache accesses for all pipeline prefetches This problem can be addressed by installing prefetched data in an intermediate storage between the CPU and the L1 [15, 27]. While previous pipeline prefetching techniques either increase pressure on the L1 [5, 39] or require extra data storage [15, 27], our solution avoids both problems, but reduces L1 accesses by forwarding the data from the PRF when reuse is detected. This provides spatial reuse, but without the data storage and data movement overheads of an L0. By using the PRF for data storage, we are able to accomplish this with only the overhead of a small tag array (0.5% of the CPU core area)

Pipeline Prefetching and Value-prediction

Detecting Memory Ordering Violations

MOTIVATION

Forwarding Validation

Interaction with the Memory Bypass Predictor

SRAM Cache Layout

Coalescing Loads Efficiently

Exposing more Locality than the LQ Window

Simulation and Modeling

Memory Access Reduction

CONCLUSION

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: ACM Transactions on Architecture and Code Optimization	Publication Date: Jun 8, 2021
Citations: 3	License type: cc-by

R Discovery Prime

R Discovery Prime

Early Address Prediction

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: ACM Transactions on Architecture and Code Optimization

Lead the way for us

Similar Papers

Design Optimization of Register File Throughput and Energy Using a Virtual Prototyping (ViPro) Tool
Ningxi Liu ... Benton Calhoun
-
Ningxi Liu, et. al.Ningxi Liu ... Benton Calhoun
01 Jul 2016
01 Jul 2016

A new hybrid key pre-distribution scheme for wireless sensor networks
Alok Kumar ... Alwyn Roshan Pais
Wireless Networks | VOL. 25
Alok Kumar, et. al.Alok Kumar ... Alwyn Roshan Pais
09 Mar 2018
Wireless Networks | VOL. 25

Energy-Efficient DNN Computing on GPUs Through Register File Management
Xin Wang ... Wei Zhang
-
Xin Wang, et. al.Xin Wang ... Wei Zhang
01 Sep 2018
01 Sep 2018

A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors
Mark Gebhart ... William J Dally
ACM Transactions on Computer Systems | VOL. 30
Mark Gebhart, et. al.Mark Gebhart ... William J Dally
01 Apr 2012
ACM Transactions on Computer Systems | VOL. 30

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Early Address Prediction

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: ACM Transactions on Architecture and Code Optimization