Data prefetching and address pre-calculation through instruction pre-execution with two-step physical register deallocation

Akihiro Yamamoto,Yusuke Tanaka,Hideki Ando,Toshio Shimada

doi:10.1145/1327171.1327175

Abstract

This paper proposes an instruction pre-execution scheme that reduces latency and early scheduling of loads for a high performance processor. Our scheme exploits the difference between the available amount of instruction-level parallelism with an unlimited number of physical registers and that with an actual number of physical registers. We introduce a scheme called two-step physical register deallocation. Our scheme deallocates physical registers at the renaming stage as a first step, and eliminates pipeline stalls caused by a physical register shortage. Instructions wait for the final deallocation as a second step in the instruction window. While waiting, the scheme allows pre-execution of instructions. This enables prefetching of load data and early calculation of memory effective addresses. In particular, our execution-based scheme has the strength on prefetch of data with an irregular access pattern. Considering the strength of an automatic prefetcher for a regular access pattern, combining it with our scheme offers the best use of our scheme. The evaluation results show that the combined scheme significantly improve performance over a processor with an automatic prefetcher.

Full Text