Superscalar Research Articles

Chip multiprocessors - also called multi-core microprocessors or CMPs for short - are now the only way to build high-performance microprocessors, for a variety of reasons. Large uniprocessors are no longer scaling in performance, because it is only possible to extract a limited amount of parallelism from a typical instruction stream using conventional superscalar instruction issue techniques. In addition, one cannot simply ratchet up the clock speed on today's processors, or the power dissipation will become prohibitive in all but water-cooled systems. Compounding these problems is the simple fact that with the immense numbers of transistors available on today's microprocessor chips, it is too costly to design and debug ever-larger processors every year or two. CMPs avoid these problems by filling up a processor die with multiple, relatively simpler processor cores instead of just one huge core. The exact size of a CMPs cores can vary from very simple pipelines to moderately complex superscalar processors, but once a core has been selected the CMPs performance can easily scale across silicon process generations simply by stamping down more copies of the hard-to-design, high-speed processor core in each successive chip generation. In addition, parallel code execution, obtained by spreading multiple threads of execution across the various cores, can achieve significantly higher performance than would be possible using only a single core. While parallel threads are already common in many useful workloads, there are still important workloads that are hard to divide into parallel threads. The low inter-processor communication latency between the cores in a CMP helps make a much wider range of applications viable candidates for parallel execution than was possible with conventional, multi-chip multiprocessors; nevertheless, limited parallelism in key applications is the main factor limiting acceptance of CMPs in some types of systems.

Read full abstract

The analysis of program executions reveals that most integer and multimedia applications make heavy use of narrow-width operations, i.e., instructions exclusively using narrow-width operands and producing a narrow-width result. Moreover, this usage is relatively well distributed over the application. We observed this program property on the MediaBench and SPEC2000 benchmarks with about 40% of the instructions being narrow-width operations. Current superscalar processors use 64-bit datapaths to execute all the instructions of the applications. In this paper, we suggest the use of a width-partitioned microarchitecture (WPM) to master the hardware complexity of a superscalar processor. For a four-way issue machine, we split the processor in two two-way clusters: the main cluster executing 64-bit operations, load/store, and complex operations and a narrow cluster executing the 16-bit operations. We resort to partitioning to decouple the treatment of the narrow-width operations from that of the other program instructions. This provides the benefit of greatly simplifying the design of the critical processor components in each cluster (e.g., the register file and the bypass network). The dynamic interleaving of the two instruction types allows maintaining the workload balanced among clusters. WPM also helps to reduce the complexity of the interconnection fabric and of the issue logic. In fact, since the 16-bit cluster can only communicate narrow-width data, the datapath-width of the interconnect fabric can be significantly reduced, yielding a corresponding saving of the interconnect power and area. We explore different possible configurations of WPM, discussing the various implementation tradeoffs. We also examine a speculative steering heuristic to distribute the narrow-width operations among clusters. A detailed analysis of the complexity factors shows using WPM instead of a classical 64-bit two-cluster microarchitecture can save power and silicon area with a minimal impact on the overall performance.

Read full abstract

Superscalar Research Articles

Related Topics

Articles published on Superscalar

Exploiting computer resources for fast nearest neighbor classification

Hybrid multi-core architecture for boosting single-threaded performance

Power Estimation of Partitioned Register Files in a Clustered Architecture with Performance Evaluation

ALP

Efficient architecture/compiler co-exploration using analytical models

Dynamic branch prediction and control speculation

SIMDE: An educational simulator of ILP architectures with dynamic and static scheduling

Hardware/Software Interface Design of Godson-2 Simultaneous Multithreading Processor

A Top-Down Approach to Architecting CPI Component Performance Counters

Trace Cache Miss Rate

An Energy-Efficient Instruction Scheduler Design with Two-Level Shelving and Adaptive Banking

Implementing a 1GHz Four-Issue Out-of-Order Execution Microprocessor in a Standard Cell ASIC Methodology

Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency

A case for a complexity-effective, width-partitioned microarchitecture

Modeling out-of-order processors for WCET analysis

CAVA

MicroThread Based (MTB) coarse grained fault tolerance superscalar processor architecture

Area-Performance Trade-offs in Tiled Dataflow Architectures

Sequential in-core sorting performance for a SQL data service and for parallel sorting on heterogeneous clusters

Cache replacement algorithms with nonuniform miss costs

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Superscalar Research Articles

Related Topics

Articles published on Superscalar

Exploiting computer resources for fast nearest neighbor classification

Hybrid multi-core architecture for boosting single-threaded performance

Power Estimation of Partitioned Register Files in a Clustered Architecture with Performance Evaluation

ALP

Efficient architecture/compiler co-exploration using analytical models

Dynamic branch prediction and control speculation

SIMDE: An educational simulator of ILP architectures with dynamic and static scheduling

Hardware/Software Interface Design of Godson-2 Simultaneous Multithreading Processor

A Top-Down Approach to Architecting CPI Component Performance Counters

Trace Cache Miss Rate

An Energy-Efficient Instruction Scheduler Design with Two-Level Shelving and Adaptive Banking

Implementing a 1GHz Four-Issue Out-of-Order Execution Microprocessor in a Standard Cell ASIC Methodology

Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency

A case for a complexity-effective, width-partitioned microarchitecture

Modeling out-of-order processors for WCET analysis

CAVA

MicroThread Based (MTB) coarse grained fault tolerance superscalar processor architecture

Area-Performance Trade-offs in Tiled Dataflow Architectures

Sequential in-core sorting performance for a SQL data service and for parallel sorting on heterogeneous clusters

Cache replacement algorithms with nonuniform miss costs