Instruction Window Research Articles

Superscalar processors currently have the potential to fetch multiple basic blocks per cycle by employing one of several recently proposed instruction fetch mechanisms. However, this increased fetch bandwidth cannot be exploited unless pipeline stages further downstream correspondingly improve. In particular, register renaming a large number of instructions per cycle is difficult. A large instruction window, needed to receive multiple basic blocks per cycle, will slow down dependence resolution and instruction issue. This paper addresses these and related issues by proposing (i) partitioning of the instruction window into multiple blocks, each holding a dynamic code sequence; (ii) logical partitioning of the register file into a global file and several local files, the latter holding registers local to a dynamic code sequence; (iii) the dynamic recording and reuse of register renaming information for registers local to a dynamic code sequence. Performance studies show these mechanisms improve performance over traditional superscalar processors by factors ranging from 1.5 to a little over 3 for the SPEC Integer programs. Next, it is observed that several of the loops in the benchmarks display vector-like behavior during execution, even if the static loop bodies are likely complex for compile-time vectorization. A dynamic loop vectorization mechanism that builds on top of the above mechanisms is briefly outlined. The mechanism vectorizes up to 60% of the dynamic instructions for some programs, albeit the average number of iterations per loop is quite small.

Detecting independent operations is a prime objective for computers that are capable of issuing and executing multiple operations simultaneously. The number of instructions that are simultaneously examined for detecting those that are independent is the scope of concurrency detection. The authors present an analytical model for predicting the performance impact of varying the scope of concurrency detection as a function of available resources, such as number of pipelines in a superscalar architecture. The model developed can show where a performance bottleneck might be: insufficient resources to exploit discovered parallelism, insufficient instruction stream parallelism, or insufficient scope of concurrency detection. The cost associated with speculative execution is examined via a set of probability distributions that characterize the inherent parallelism in the instruction stream. These results were derived using traces from a Multiflow TRACE SCHEDULING compacting FORTRAN 77 and C compilers. The experiments provide misprediction delay estimates for 11 common application-level benchmarks under scope constraints, assuming speculative, out-of-order execution and run time scheduling. The throughput prediction of the analytical model is shown to be close to the measured static throughput of the compiler output. >

Instruction Window Research Articles

Related Topics

Articles published on Instruction Window

Memory dependence prediction using store sets

Improving superscalar instruction dispatch and issue by exploiting dynamic code sequences

Complexity-effective superscalar processors

Dynamic speculation and synchronization of data dependences

One billion transistors, one uniprocessor, one chip

Instruction window size trade-offs and characterization of program parallelism

Interrupt handling for out-of-order execution processors

The expandable split window paradigm for exploiting fine-grain parallelsim

Hiding memory latency using dynamic scheduling in shared-memory multiprocessors

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Instruction Window Research Articles

Related Topics

Articles published on Instruction Window

Memory dependence prediction using store sets

Improving superscalar instruction dispatch and issue by exploiting dynamic code sequences

Complexity-effective superscalar processors

Dynamic speculation and synchronization of data dependences

One billion transistors, one uniprocessor, one chip

Instruction window size trade-offs and characterization of program parallelism

Interrupt handling for out-of-order execution processors

The expandable split window paradigm for exploiting fine-grain parallelsim

Hiding memory latency using dynamic scheduling in shared-memory multiprocessors