Abstract
Achieving low load-to-use latency with low energy and storage overheads is critical for performance. Existing techniques either prefetch into the pipeline (via address prediction and validation) or provide data reuse in the pipeline (via register sharing or L0 caches). These techniques provide a range of tradeoffs between latency, reuse, and overhead. In this work, we present a pipeline prefetching technique that achieves state-of-the-art performance and data reuse without additional data storage, data movement, or validation overheads by adding address tags to the register file. Our addition of register file tags allows us to forward (reuse) load data from the register file with no additional data movement, keep the data alive in the register file beyond the instruction’s lifetime to increase temporal reuse, and coalesce prefetch requests to achieve spatial reuse. Further, we show that we can use the existing memory order violation detection hardware to validate prefetches and data forwards without additional overhead. Our design achieves the performance of existing pipeline prefetching while also forwarding 32% of the loads from the register file (compared to 15% in state-of-the-art register sharing), delivering a 16% reduction in L1 dynamic energy (1.6% total processor energy), with an area overhead of less than 0.5%.
Highlights
Satisfying loads as early as possible is critical for performance
To detect register reuse and perform load-to-load forwarding through the physical register file, we introduce a register translation table, the Address Tag, Register Tag (AT-RT) table
These results demonstrate that Address Tag-Register Tag (AT-RT) is able to achieve the full performance of pipeline prefetching with even better energy savings than register reuse
Summary
This is necessary to account for local and remote stores and results in a doubling of L1 cache accesses for all pipeline prefetches This problem can be addressed by installing prefetched data in an intermediate storage between the CPU and the L1 [15, 27]. While previous pipeline prefetching techniques either increase pressure on the L1 [5, 39] or require extra data storage [15, 27], our solution avoids both problems, but reduces L1 accesses by forwarding the data from the PRF when reuse is detected. This provides spatial reuse, but without the data storage and data movement overheads of an L0. By using the PRF for data storage, we are able to accomplish this with only the overhead of a small tag array (0.5% of the CPU core area)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: ACM Transactions on Architecture and Code Optimization
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.