ALU Instructions Research Articles

Even in the multicore era, there is a continuous demand to increase the performance of single-threaded applications. However, the conventional path of increasing both issue width and instruction window size inevitably leads to the power wall. Value prediction (VP) was proposed in the mid 90's as an alternative path to further enhance the performance of wide-issue superscalar processors. Still, it was considered up to recently that a performance-effective implementation of Value Prediction would add tremendous complexity and power consumption in almost every stage of the pipeline Nonetheless, recent work in the field of VP has shown that given an efficient confidence estimation mechanism, prediction validation could be removed from the out-of-order engine and delayed until commit time. As a result, recovering from mispredictions via selective replay can be avoided and a much simpler mechanism -- pipeline squashing -- can be used, while the out-of-order engine remains mostly unmodified. Yet, VP and validation at commit time entails strong constraints on the Physical Register File. Write ports are needed to write predicted results and read ports are needed in order to validate them at commit time, potentially rendering the overall number of ports unbearable. Fortunately, VP also implies that many single-cycle ALU instructions have their operands predicted in the front-end and can be executed in-place, in-order. Similarly, the execution of single-cycle instructions whose result has been predicted can be delayed until commit time since predictions are validated at commit time Consequently, a significant number of instructions -- 10% to 60% in our experiments -- can bypass the out-of-order engine, allowing the reduction of the issue width, which is a major contributor to both out-of-order engine complexity and register file port requirement. This reduction paves the way for a truly practical implementation of Value Prediction. Furthermore, since Value Prediction in itself usually increases performance, our resulting {Early | Out-of-Order | Late} Execution architecture, EOLE, is often more efficient than a baseline VP-augmented 6-issue superscalar while having a significantly narrower 4-issue out-of-order engine

Read full abstract

Branch predication is a program transformation technique that combines instructions of multiple branches of an if statement into a straight-line sequence and associates each instruction of the sequence with a predicate. The branch predication improves the execution of branch statements on processors that support predicated execution of instruction, e.g., Intel IA-64, because such transformation improves the instruction scheduling and might help cache performance. This paper proposes a novel software-based branch predication technique for GPU. The main motivation is that branch instructions can easily become a performance bottleneck for a GPU program because of the cost of branch instructions compared to ALU instructions and the possibility of low ALU utilization due to separation of ALU instructions within control flow blocks. Due to the SIMD nature and massive multi-threading architecture of the GPU, branching can be costly if more than one path is taken by a set of concurrent threads in a kernel. In this paper we reveal that branch predication can enable instruction packing, a VLIW-like GPU feature that is designed to increase the parallel execution of independent instructions, and can also decrease the number of control flow instructions thereby improving the performance of GPU kernels with both single and multiple branch paths. The key of our novel branch predication technique is a set of transformation rules that takes into consideration the specialties of the GPU architecture and implements software-based predicated execution of instruction on the GPU with little to no overhead. Furthermore, we identify architectural and program factors that affect the effectiveness of our technique and build a benefit analysis model for the transformation. The implementation of our technique on synthetic benchmarks and real-world application proves its effectiveness.

Read full abstract

ALU Instructions Research Articles

Related Topics

Articles published on ALU Instructions

Enhancing network-on-chip performance by 32-bit RISC processor based on power and area efficiency

EOLE

EOLE

Software-based branch predication for AMD GPUs

An Instruction Scheduler for Dynamic ALU Cascading Adoption

On the Effectiveness of Flow Aggregation in Improving Instruction Reuse in Network Processing Applications

Exploiting value locality to exceed the dataflow limit

The Performance Impact of Exploiting Branch ILP with Tree Representation of ILP Code

The RISC processor DMN-6: a unified data-control flow architecture

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

ALU Instructions Research Articles

Related Topics

Articles published on ALU Instructions

Enhancing network-on-chip performance by 32-bit RISC processor based on power and area efficiency

EOLE

EOLE

Software-based branch predication for AMD GPUs

An Instruction Scheduler for Dynamic ALU Cascading Adoption

On the Effectiveness of Flow Aggregation in Improving Instruction Reuse in Network Processing Applications

Exploiting value locality to exceed the dataflow limit

The Performance Impact of Exploiting Branch ILP with Tree Representation of ILP Code

The RISC processor DMN-6: a unified data-control flow architecture