Optimizing Instruction Scheduling through Combined In-Order and O-O-O Execution in SMT Processors

Hui Wang Hui Wang,R Sangireddy,S Baldawa

doi:10.1109/tpds.2008.97

Hui Wang Hui Wang, R Sangireddy + Show 1 more

https://doi.org/10.1109/tpds.2008.97

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

The resource sharing nature of Simultaneous Multithreading (SMT) processors and the presence of long latency instructions from concurrent threads make the instruction scheduling window (IW), which is a primary shared component among key pipeline structures in SMT, a performance bottleneck. Due to the tight constraints on its physical size, the IW faces more severe pressure to handle the instructions from various threads while attempting to avoid resource monopolization by some low-ILP threads. It is particularly challenging to optimize the efficiency and fairness in IW utilization to fulfill the affordable performance by SMT under the shadow of long latency instructions. Most of the existing optimization schemes in SMT processors rely on the fetch policy to control the instructions that are allowed to enter the pipeline, while little effort is put to control the long latency instructions that are already located in the IW. In this paper, we propose streamline buffers to handle the long latency instructions that have already entered the pipeline and clog the IW, while the controlling fetch policies take time to react. Each streamline buffer extracts from IW and holds a chain of instructions from a thread that are stalled by dependency on a long latency load. When the load value returns, the streamline buffer then serves these instructions directly to in-order execution, avoiding any instruction replay. This is done in supplement to the conventional IW that serves in parallel the other instructions for out-of-order (o-o-o) execution. Analysis of SPEC2000 integer and FP benchmarks reveals that instructions dependent on long latency loads, typically have their first source operand ready within 5 percent-15 percent of their total wait time in the IW. Our scheme is able to utilize this asymmetry in source operands' ready time to achieve a complexity effective design. As compared to the baseline SMT architecture, our design when working in conjunction with earlier proposed ICOUNT.2.8 fetch policy for 4-threads effectively reduces the IW full rate by 9.4 percent (11 percent for 2-thread), improves average IPC for MIXED workloads by 9.6 percent (8 percent for MEM workloads and 4.4 percent for CPU workloads), and fairness by 7.56 percent (7.24 percent for 2-thread). Similar enhancements are observed when run in conjunction with an RR.2.8 fetch policy. Further, our scheme when combined with DCRA improves the performance on the average by 21.7 percent, while DCRA improves by 16.3 percent when run alone.

Full Text