Abstract
The poor scalability of existing superscalar processors has been of great concern to the computer engineering community. In particular, the critical-path lengths of many components in existing implementations grow as Θ(n 2 ) where n is the fetch width, the issue width, or the window size. This paper describes two scalable processor architectures, Ultrascalar I and Ultrascalar II, and compares their VLSI complexities (gate delays, wire-length delays, and area.) Both processors are implemented by a large collection of ALUs with controllers (together called execution stations ) connected together by a network of parallel-prefix tree circuits. A fat-tree network connects an interleaved cache to the execution stations. These networks provide the full functionality of superscalar processors including renaming, out-of-order execution, and speculative execution. The difference between the processors is in the mechanism used to transmit register values from one execution station to another. Both architectures use a parallel-prefix tree to communicate the register values between the execution stations. Ultrascalar I transmits an entire copy of the register file to each station, and the station chooses which register values it needs based on the instruction. Ultrascalar I uses an H-tree layout. Ultrascalar II uses a mesh-of-trees and carefully sends only the register values that will actually be needed by each subtree to reduce the number of wires required on the chip. The complexity results are as follows: The complexity is described for a processor which has an instruction-set architecture containing L logical registers and can execute n instructions in parallel. The chip provides enough memory bandwidth to execute up to M(n) memory operations per cycle. (M is assumed to have a certain regularity property.) In all the processors, the VLSI area is the square of the wire delay. Ultrascalar I has gate delay O(log n) and wire delay \tauwires = \Theta(\sqrt{n}L) if $M(n)$ is $O(n^{1/2-\varepsilon})$, \tauwires = \Theta(\sqrt{n}(L+\log n)) if $M(n)$ is $\Theta(n^{1/2})$, \tauwires = \Theta(\sqrt{n}L+M(n)) if $M(n)$ is $\Omega(n^{1/2+\varepsilon})$ for ɛ>0 . Ultrascalar II has gate delay Θ(log L+log n) . The wire delay is Θ(n) , which is optimal for n=O(L) . Thus, Ultrascalar II dominates Ultrascalar I for n=O(L 2 ) , otherwise Ultrascalar I dominates Ultrascalar II. We introduce a hybrid ultrascalar that uses a two-level layout scheme: Clusters of execution stations are layed out using the Ultrascalar II mesh-of-trees layout, and then the clusters are connected together using the H-tree layout of Ultrascalar I. For the hybrid (in which n≥ L ), the wire delay is Θ(\sqrt nL+M(n)) , which is optimal. For n≥ L , the hybrid dominates both Ultrascalar I and Ultrascalar II. We also present an empirical comparison of Ultrascalar I and the hybrid, both layed out using the Magic VLSI editor. For a processor that has 32 32-bit registers and a simple integer ALU, the hybrid requires about 11 times less area.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.