A Comparison of Asymptotically Scalable Superscalar Processors

B C Kuszmaul,D S Henry,G H Loh

doi:10.1007/s00224-001-1029-z

Abstract

The poor scalability of existing superscalar processors has been of great concern to the computer engineering community. In particular, the critical-path lengths of many components in existing implementations grow as Θ(n 2 ) where n is the fetch width, the issue width, or the window size. This paper describes two scalable processor architectures, Ultrascalar I and Ultrascalar II, and compares their VLSI complexities (gate delays, wire-length delays, and area.) Both processors are implemented by a large collection of ALUs with controllers (together called execution stations ) connected together by a network of parallel-prefix tree circuits. A fat-tree network connects an interleaved cache to the execution stations. These networks provide the full functionality of superscalar processors including renaming, out-of-order execution, and speculative execution. The difference between the processors is in the mechanism used to transmit register values from one execution station to another. Both architectures use a parallel-prefix tree to communicate the register values between the execution stations. Ultrascalar I transmits an entire copy of the register file to each station, and the station chooses which register values it needs based on the instruction. Ultrascalar I uses an H-tree layout. Ultrascalar II uses a mesh-of-trees and carefully sends only the register values that will actually be needed by each subtree to reduce the number of wires required on the chip. The complexity results are as follows: The complexity is described for a processor which has an instruction-set architecture containing L logical registers and can execute n instructions in parallel. The chip provides enough memory bandwidth to execute up to M(n) memory operations per cycle. (M is assumed to have a certain regularity property.) In all the processors, the VLSI area is the square of the wire delay. Ultrascalar I has gate delay O(log n) and wire delay \tauwires = \Theta(\sqrt{n}L) if $M(n)$ is $O(n^{1/2-\varepsilon})$, \tauwires = \Theta(\sqrt{n}(L+\log n)) if $M(n)$ is $\Theta(n^{1/2})$, \tauwires = \Theta(\sqrt{n}L+M(n)) if $M(n)$ is $\Omega(n^{1/2+\varepsilon})$ for ɛ>0 . Ultrascalar II has gate delay Θ(log L+log n) . The wire delay is Θ(n) , which is optimal for n=O(L) . Thus, Ultrascalar II dominates Ultrascalar I for n=O(L 2 ) , otherwise Ultrascalar I dominates Ultrascalar II. We introduce a hybrid ultrascalar that uses a two-level layout scheme: Clusters of execution stations are layed out using the Ultrascalar II mesh-of-trees layout, and then the clusters are connected together using the H-tree layout of Ultrascalar I. For the hybrid (in which n≥ L ), the wire delay is Θ(\sqrt nL+M(n)) , which is optimal. For n≥ L , the hybrid dominates both Ultrascalar I and Ultrascalar II. We also present an empirical comparison of Ultrascalar I and the hybrid, both layed out using the Magic VLSI editor. For a processor that has 32 32-bit registers and a simple integer ALU, the hybrid requires about 11 times less area.

Full Text