A method for applying loop unrolling and software pipelining to instruction-level parallel architectures

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

A considerable part of program execution time is consumed by loops, so that loop optimization is highly effective especially for the innermost loops of a program. Software pipelining and loop unrolling are known methods for loop optimization. Software pipelining is advantageous in that the code becomes only slightly longer. This method, however, is difficult to apply if the loop includes branching when the parallelism is limited. On the other hand, loop unrolling, while being free of such limitations, suffers from a number of drawbacks. In particular the code size grows substantially and it is difficult to determine the optimal number of body replications. In order to solve these problems, it seems important to combine software pipelining with loop unrolling so as to utilize the advantages of both techniques while paying due regard to properties of programs under consideration and to the machine resources available. This paper describes a method for applying optimal loop unrolling and effective software pipelining to achieve this goal. Program characteristics obtained by means of an extended PDG (program dependence graph) are taken into consideration as well as machine resources. © 1998 Scripta Technica, Syst Comp Jpn, 29(9): 62–73, 1998

Similar Papers
  • Book Chapter
  • Cite Count Icon 3
  • 10.1007/978-3-642-13374-9_19
Using the Meeting Graph Framework to Minimise Kernel Loop Unrolling for Scheduled Loops
  • Jan 1, 2010
  • Mounira Bachir + 2 more

This paper improves our previous research effort [1] by providing an efficient method for kernel loop unrolling minimisation in the case of already scheduled loops, where circular lifetime intervals are known. When loops are software pipelined, the number of values simultaneously alive becomes exactly known giving better opportunities for kernel loop unrolling. Furthermore, fixing circular lifetime intervals allows us to reduce the algorithmic complexity of our method compared to [1] by computing a new research space for minimal kernel loop unrolling. The meeting graph (MG) is one of the [3] frameworks proposed in the literature which models loop unrolling and register allocation together in a common formal framework for software pipelined loops. Although MG significantly improves loop register allocation, the computed loop unrolling may lead to unpractical code growth. This work proposes to minimise the loop unrolling degree in the meeting graph by making an adaptation of [1] the approach described in . We explain how to reduce the research space for minimal kernel loop unrolling in the context of MG, yielding to a reduced algorithmic complexity. Furthermore, our experiments on SPEC2000, SPEC2006, MEDIABENCH and FFMPEG show that in concrete cases the loop unrolling minimisation is very fast and the minimal loop unrolling degree for 75% of the optimised loops is equal to 1 (i.e. no unroll), while it is equal to 7 when the software pipelining (SWP) schedule is not fixed.

  • Research Article
  • Cite Count Icon 4
  • 10.1007/s10766-012-0203-z
Minimal Unroll Factor for Code Generation of Software Pipelining
  • Jul 17, 2012
  • International Journal of Parallel Programming
  • Mounira Bachir + 4 more

We address the problem of generating compact code from software pipelined loops. Although software pipelining is a powerful technique to extract fine-grain parallelism, it generates lifetime intervals spanning multiple loop iterations. These intervals require periodic register allocation (also called variable expansion), which in turn yields a code generation challenge. We are looking for the minimal unrolling factor enabling the periodic register allocation of software pipelined kernels. This challenge is generally addressed through one of: (1) hardware support in the form of rotating register files, which solve the unrolling problem but are expensive in hardware; (2) register renaming by inserting register moves, which increase the number of operations in the loop, and may damage the schedule of the software pipeline and reduce throughput; (3) post-pass loop unrolling that does not compromise throughput but often leads to impractical code growth. The latter approach relies on the proof that MAXLIVE registers (maximal number of values simultaneously alive) are sufficient for periodic register allocation (Eisenbeis et al. in PACT ’95: Proceedings of the IFIP WG10.3 working conference on Parallel Architectures and Compilation Techniques, pages 264–267, Manchester, UK, 1995; Hendren et al. in CC ’92: Proceedings of the 4th International Conference on Compiler Construction, pages 176–191, London, UK, 1992). However, the best existing heuristic for controlling this code growth—modulo variable expansion (Lam in SIGPLAN Not 23(7):318–328, 1988)—may not apply the correct amount of loop unrolling to guarantee that MAXLIVE registers are enough, which may result in register spills Eisenbeis et al. in PACT ’95: Proceedings of the IFIP WG10.3 working conference on Parallel Architectures and Compilation Techniques, pages 264–267, Manchester, UK, 1995. This paper presents our research results on the open problem of minimal loop unrolling, allowing a software-only code generation that does not trade the optimality of the initiation interval (II) for the compactness of the generated code. Our novel idea is to use the remaining free registers after periodic register allocation to relax the constraints on register reuse. The problem of minimal loop unrolling arises either before or after software pipelining, either with a single or with multiple register types (classes). We provide a formal problem definition for each scenario, and we propose and study a dedicated algorithm for each problem. Our solutions are implemented within an industrial-strength compiler for a VLIW embedded processor from STMicroelectronics, and validated on multiple benchmarks suites.

  • Research Article
  • Cite Count Icon 21
  • 10.1145/79505.79508
A study of scalar compilation techniques for pipelined supercomputers
  • Sep 1, 1990
  • ACM Transactions on Mathematical Software
  • Shlomo Weiss + 1 more

This paper studies two compilation techniques for enhancing scalar performance in high-speed scientific processors: software pipelining and loop unrolling. We study the impact of the architecture (size of the register file) and of the hardware (size of instruction buffer) on the efficiency of loop unrolling. We also develop a methodology for classifying software pipelining techniques. For loop unrolling, a straightforward scheduling algorithm is shown to produce near-optimal results when not inhibited by recurrences or memory hazards. Our study indicates that the performance produced with a modified CRAY-1S scalar architecture and a code scheduler utilizing loop unrolling is comparable to the performance achieved by the CRAY-1S with a vector unit and the CFT vectorizing compiler. Finally, we show that the combination of loop unrolling and dynamic software pipelining, as implemented by a decoupled computer, substantially outperforms the vector CRAY-1S.

  • Research Article
  • Cite Count Icon 38
  • 10.1145/36177.36191
A study of scalar compilation techniques for pipelined supercomputers
  • Oct 1, 1987
  • ACM SIGARCH Computer Architecture News
  • Shlomo Weiss + 1 more

This paper studies two compilation techniques for enhancing scalar performance in high-speed scientific processors: software pipelining and loop unrolling. We study the impact of the architecture (size of the register file) and of the hardware (size of instruction buffer) on the efficiency of loop unrolling. We also develop a methodology for classifying software pipelining techniques. For loop unrolling, a straightforward scheduling algorithm is shown to produce near-optimal results when not inhibited by recurrences or memory hazards. Software pipelining requires less hardware but also achieves less speedup. Finally, we show that the performance produced with a modified CRAY-1S scalar architecture and a code scheduler utilizing loop unrolling is comparable to the performance achieved by the CRAY-1S with a vector unit and the CFT vectorizing compiler.

  • Research Article
  • Cite Count Icon 7
  • 10.1145/36205.36191
A study of scalar compilation techniques for pipelined supercomputers
  • Oct 1, 1987
  • ACM SIGPLAN Notices
  • Shlomo Weiss + 1 more

This paper studies two compilation techniques for enhancing scalar performance in high-speed scientific processors: software pipelining and loop unrolling. We study the impact of the architecture (size of the register file) and of the hardware (size of instruction buffer) on the efficiency of loop unrolling. We also develop a methodology for classifying software pipelining techniques. For loop unrolling, a straightforward scheduling algorithm is shown to produce near-optimal results when not inhibited by recurrences or memory hazards. Software pipelining requires less hardware but also achieves less speedup. Finally, we show that the performance produced with a modified CRAY-1S scalar architecture and a code scheduler utilizing loop unrolling is comparable to the performance achieved by the CRAY-1S with a vector unit and the CFT vectorizing compiler.

  • Research Article
  • Cite Count Icon 63
  • 10.1145/36204.36191
A study of scalar compilation techniques for pipelined supercomputers
  • Oct 1, 1987
  • ACM SIGOPS Operating Systems Review
  • Shlomo Weiss + 1 more

This paper studies two compilation techniques for enhancing scalar performance in high-speed scientific processors: software pipelining and loop unrolling. We study the impact of the architecture (size of the register file) and of the hardware (size of instruction buffer) on the efficiency of loop unrolling. We also develop a methodology for classifying software pipelining techniques. For loop unrolling, a straightforward scheduling algorithm is shown to produce near-optimal results when not inhibited by recurrences or memory hazards. Software pipelining requires less hardware but also achieves less speedup. Finally, we show that the performance produced with a modified CRAY-1S scalar architecture and a code scheduler utilizing loop unrolling is comparable to the performance achieved by the CRAY-1S with a vector unit and the CFT vectorizing compiler.

  • Book Chapter
  • Cite Count Icon 14
  • 10.1007/3-540-61053-7_49
Pipelining-dovetailing: A transformation to enhance software pipelining for nested loops
  • Jan 1, 1996
  • Jian Wang + 1 more

The objective of software pipelining is to generate code which can maximally exploit instruction-level parallelism (ILP) in modern multiissue processor architectures, such as VLIW and superscalar processors. Since the amount of ILP is usually fixed to a small number, four — eight, using state-of-the-art software pipelining scheduling techniques, modern compilers have been able to schedule instructions in a small window of successive iterations and keep the machine resources usefully busy. To maximally take advantage of software pipelining, it is beneficial if the number of iterations of the loops to be software pipelined is large (called trip counts in this paper). Therefore, software pipelining of nested loops becomes important, especially when the innermost loops have smaller trip counts.This paper presents a loop transformation which extends software pipelining from the innermost loops to the enclosing loop nests. Unlike some popular loop transformation techniques (e.g. unimodular transformation) targeted to multi-processor machines (where the goal has been to maximally expose loop-level parallelism i.e. the transformed loop nests have maximum number of doall loops), the goal of our transformation, pipelining-dovetailing, is to extend the software pipelining of the innermost loop to the surrounding loop nests. Thus all iterations of the loop nests can be smoothly software pipelined through, and the number of effective trip counts is maximized. We also define the condition under which pipelining-dovetailing is valid. As a result, a software pipelining framework is derived for loop nests which integrates software pipelining and pipelining-dovetailing together.KeywordsInstruction-Level ParallelismFine-Grain ParallelismSoftware PipeliningLoop SchedulingNested LoopVery Long Instruction Word(VLIW)Superscalar

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/hpcsim.2012.6266972
On the effectiveness of register moves to minimise post-pass unrolling in software pipelined loops
  • Jul 1, 2012
  • Mounira Bachir + 2 more

Software pipelining is a powerful technique to expose fine-grain parallelism, but it results in variables staying alive across more than one kernel iteration. It requires periodic register allocation and is challenging for code generation: the lack of a reliable solution currently restricts the applicability of software pipelining. The classical software solution that does not alter the computation throughput consists in unrolling the loop a posteriori [11], [10]. However, the resulting unrolling degree is often unacceptable and may reach absurd levels. Alternatively, loop unrolling can be avoided thanks to software register renaming. This is achieved through the insertion of move operations, but this may increase the initiation interval (II) which nullifies the benefits of software pipelining. This article aims at tightly controling the post-pass loop unrolling necessary to generate code. We study the potential of live range splitting to reduce kernel loop unrolling, introducing additional move instructions without inscreasing the II. We provide a complete formalisation of the problem, an algorithm, and extensive experiments. Our algorithm yields low unrolling degrees in most cases - with no increase of the II.

  • Conference Article
  • Cite Count Icon 6
  • 10.1109/iwia.2001.955198
Characteristics of loop unrolling effect: software pipelining and memory latency hiding
  • Jan 1, 2001
  • Hiroyuki + 1 more

Recently loop unrolling has been shown in a new light from the superscalar architectural point of view. In this paper, we show that in addition to superscalar effect and scalar replacement effect, loop unrolling can hide memory latency, and that the combination of those effects improve the performance of loop unrolling. A major contribution of this paper is that the analysis is done symbolically and quantitatively. Although they have been known as major reasons that affect the performance of loop unrolling, no quantitative approach has not been tried. Our analysis can make clear the behaviour of superscalar functions and memory latency hiding in loop unrolling.

  • Single Report
  • Cite Count Icon 2
  • 10.21236/ada326916
Effects of Loop Unrolling and Loop Fusion on Register Pressure and Code Performance.
  • Jun 1, 1997
  • Dale Shires

: Many of today's high-performance computer processors are super-scalar. They can dispatch multiple instructions per cycle and, hence, provide what is commonly referred to as instruction-level parallelism. This super-scalar capability, combined with software pipelining, can increase processor throughput dramatically. Achieving maximum throughput, however, is nontrivial. Compilers must engage in aggressive optimization techniques, such as loop unrolling, speculative code motion, etc., to structure code to take full advantage of the underlying computer architecture. The phase-ordering implications of these optimizations are not well understood and are the subject of continuing research. Of particular interest are optimizations that enhance instruction-level parallelism. Two of these are loop unrolling and loop fusion. These are source-level optimizations that can be performed by either the programmer or the compiler. These optimizations have dramatic effects on the compiler's instruction scheduler. Performed too aggressively, these optimizations can increase register pressure and result in costly memory references. This paper details experiments performed to measure the effects of these source-level code transformations and how they influenced register pressure and code performance.

  • Conference Article
  • Cite Count Icon 5
  • 10.1109/hicss.1995.375390
A comparative evaluation of software techniques to hide memory latency
  • Jan 4, 1995
  • L.K John + 3 more

Software oriented techniques to hide memory latency in superscalar and superpipelined machines include loop unrolling, software pipelining, and software cache prefetching. Issuing the data fetch request prior to actual need for data allows overlap of accessing with useful computations. Loop unrolling and software pipelining do not necessitate microarchitecture or instruction set architecture changes, whereas software controlled prefetching does. While studies on the benefits of the individual techniques have been done, no study evaluates all of these techniques within a consistent framework. This paper attempts to remedy this by providing a comparative evaluation of the features and benefits of the techniques. Loop, unrolling and static scheduling of loads is seen to produce significant improvement in performance at lower latencies. Software pipelining is observed to be better than software controlled prefetching at lower latencies, but at higher latencies, software prefetching outperforms software pipelining. Aggressive prefetching beyond conditional branches can detrimentally affect performance by increasing the memory bandwidth requirements and bus traffic. >

  • Conference Article
  • 10.1109/hpcsim.2011.5999826
Loop unrolling minimisation in the presence of multiple register types: A viable alternative to modulo variable expansion
  • Jul 1, 2011
  • Mounira Bachir + 3 more

International audience

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/edtc.1994.326831
Optimal scheduling and software pipelining of repetitive signal flow graphs with delay line optimization
  • Jan 1, 1994
  • F Depuydt + 3 more

Software pipelining can have an enormous impact on the clock cycle count and hence on the performance of a real-time signal processing design. Because it pays off to invest CPU time in the optimal software pipelining of time-critical parts of a design, an integer programming approach is proposed for simultaneous scheduling and software pipelining. The integer programming techniques in the literature do not support cyclic (repetitive) signal flow, graphs, and/or do not allow optimization of the storage cost of delay lines during software pipelining. The new contributions in this paper are the full integration of software pipelining and scheduling, based on a new timing model that supports cyclic signal flow, graphs and optimization of delay line storage costs. Experiments with several real-time signal processing applications have shown the practical applicability of the approach. >

  • Book Chapter
  • Cite Count Icon 5
  • 10.1016/b978-044482106-5/50033-8
Resource-Constrained Software Pipelining for High-Level Synthesis of DSP Systems
  • Jan 1, 1995
  • Algorithms and Parallel VLSI Architectures III
  • F Sánchez + 1 more

Resource-Constrained Software Pipelining for High-Level Synthesis of DSP Systems

  • Conference Article
  • 10.23919/elinfocom.2018.8330584
Symbolic execution using approximate computing (SEAC) — A novel branch hazard distribution method
  • Jan 1, 2018
  • Oladiran G Olaleye + 3 more

To improve the overall performance of computer systems, instruction-level parallelism (ILP) has been widely exploited. However, branch hazards, conditional and unconditional, still limit the efficiency of most ILP techniques. Compiler techniques such as loop unrolling, software pipelining, and trace scheduling are being used to increase the amount of parallelism available in systems with fairly predictable branches, while predicated instructions have been useful in eliminating branch hazards in specific cases. The limitations imposed on ILP by branch hazards, however, are significant in large blocks of codes or, at best, hidden at the expense of processor resources. As a result, researchers are exploring the techniques of approximate computing, which when applied, would be suitable for only fault-tolerant systems. Some are also working on the methods of code approximation, which mainly involves hazard minimization by distribution over specific parts of code segments. In this work, we propose and demonstrate a novel branch hazard distribution technique - Symbolic Execution using Approximate Computing (SEAC). We applied the proposed technique to a test program and ran simulation experiments using the Detailed CPU model in gem5 simulator. Simulation results show that SEAC is 3.57, 1.95 and 1.32 times better than the best, among the tested conventional ILP techniques, based on speedup, energy saving, and branch hazard distribution coefficient respectively.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant