Unrolling shape for out-of-order processors

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Loop unrolling is today one of the most effective optimizations for modern architectures. To give an analytical model for loop unrolling performance, unrolling shape was proposed. It was applied to in-order processors, and was proved to give an accurate performance model for loop unrolling in term of software pipelining and cache miss alleviation. In this paper, we apply unrolling shape to out-of-order processors. A scheme for calculating PL/sub OOO/, pipelining terms of an unrolled loop by factor l are presented as PL/sub OOO/(l) = {(Nins(l)/F + NOccpy(l))}/l, where Nins(l) is the number of instructions in an unrolled loop by factor l, F the fetch rate of the architecture, NOccpy(l) the number of store instructions scheduled after Nins(l)/F-th cycle. A pipelining term for in-order processors is essential for calculating NOccpy(l). It is to be noted that the scheme for out-of-order processors uses unrolling shape for in-order processors. Experiments show that our scheme is precise in calculating the behaviour of loop unrolling on out-of-order processors. We show that our scheme quantitatively shows the effect of loop unrolling as the one of infinitely unrolled loops on in-order processors. Furthermore, we reveal that the old folklore that the loop unrolling reduces the loop overhead has revived on out-of-order processors as a performance improvement factor as d/dlPL/sub OOO/ (Aho et al., 1986).

Similar Papers
  • Book Chapter
  • Cite Count Icon 3
  • 10.1007/978-3-642-13374-9_19
Using the Meeting Graph Framework to Minimise Kernel Loop Unrolling for Scheduled Loops
  • Jan 1, 2010
  • Mounira Bachir + 2 more

This paper improves our previous research effort [1] by providing an efficient method for kernel loop unrolling minimisation in the case of already scheduled loops, where circular lifetime intervals are known. When loops are software pipelined, the number of values simultaneously alive becomes exactly known giving better opportunities for kernel loop unrolling. Furthermore, fixing circular lifetime intervals allows us to reduce the algorithmic complexity of our method compared to [1] by computing a new research space for minimal kernel loop unrolling. The meeting graph (MG) is one of the [3] frameworks proposed in the literature which models loop unrolling and register allocation together in a common formal framework for software pipelined loops. Although MG significantly improves loop register allocation, the computed loop unrolling may lead to unpractical code growth. This work proposes to minimise the loop unrolling degree in the meeting graph by making an adaptation of [1] the approach described in . We explain how to reduce the research space for minimal kernel loop unrolling in the context of MG, yielding to a reduced algorithmic complexity. Furthermore, our experiments on SPEC2000, SPEC2006, MEDIABENCH and FFMPEG show that in concrete cases the loop unrolling minimisation is very fast and the minimal loop unrolling degree for 75% of the optimised loops is equal to 1 (i.e. no unroll), while it is equal to 7 when the software pipelining (SWP) schedule is not fixed.

  • Research Article
  • 10.1002/(sici)1520-684x(199808)29:9<62::aid-scj7>3.0.co;2-h
A method for applying loop unrolling and software pipelining to instruction-level parallel architectures
  • Aug 1, 1998
  • Systems and Computers in Japan
  • Nobuhiro Kondo + 3 more

A considerable part of program execution time is consumed by loops, so that loop optimization is highly effective especially for the innermost loops of a program. Software pipelining and loop unrolling are known methods for loop optimization. Software pipelining is advantageous in that the code becomes only slightly longer. This method, however, is difficult to apply if the loop includes branching when the parallelism is limited. On the other hand, loop unrolling, while being free of such limitations, suffers from a number of drawbacks. In particular the code size grows substantially and it is difficult to determine the optimal number of body replications. In order to solve these problems, it seems important to combine software pipelining with loop unrolling so as to utilize the advantages of both techniques while paying due regard to properties of programs under consideration and to the machine resources available. This paper describes a method for applying optimal loop unrolling and effective software pipelining to achieve this goal. Program characteristics obtained by means of an extended PDG (program dependence graph) are taken into consideration as well as machine resources. © 1998 Scripta Technica, Syst Comp Jpn, 29(9): 62–73, 1998

  • Research Article
  • Cite Count Icon 21
  • 10.1145/79505.79508
A study of scalar compilation techniques for pipelined supercomputers
  • Sep 1, 1990
  • ACM Transactions on Mathematical Software
  • Shlomo Weiss + 1 more

This paper studies two compilation techniques for enhancing scalar performance in high-speed scientific processors: software pipelining and loop unrolling. We study the impact of the architecture (size of the register file) and of the hardware (size of instruction buffer) on the efficiency of loop unrolling. We also develop a methodology for classifying software pipelining techniques. For loop unrolling, a straightforward scheduling algorithm is shown to produce near-optimal results when not inhibited by recurrences or memory hazards. Our study indicates that the performance produced with a modified CRAY-1S scalar architecture and a code scheduler utilizing loop unrolling is comparable to the performance achieved by the CRAY-1S with a vector unit and the CFT vectorizing compiler. Finally, we show that the combination of loop unrolling and dynamic software pipelining, as implemented by a decoupled computer, substantially outperforms the vector CRAY-1S.

  • Research Article
  • Cite Count Icon 38
  • 10.1145/36177.36191
A study of scalar compilation techniques for pipelined supercomputers
  • Oct 1, 1987
  • ACM SIGARCH Computer Architecture News
  • Shlomo Weiss + 1 more

This paper studies two compilation techniques for enhancing scalar performance in high-speed scientific processors: software pipelining and loop unrolling. We study the impact of the architecture (size of the register file) and of the hardware (size of instruction buffer) on the efficiency of loop unrolling. We also develop a methodology for classifying software pipelining techniques. For loop unrolling, a straightforward scheduling algorithm is shown to produce near-optimal results when not inhibited by recurrences or memory hazards. Software pipelining requires less hardware but also achieves less speedup. Finally, we show that the performance produced with a modified CRAY-1S scalar architecture and a code scheduler utilizing loop unrolling is comparable to the performance achieved by the CRAY-1S with a vector unit and the CFT vectorizing compiler.

  • Research Article
  • Cite Count Icon 7
  • 10.1145/36205.36191
A study of scalar compilation techniques for pipelined supercomputers
  • Oct 1, 1987
  • ACM SIGPLAN Notices
  • Shlomo Weiss + 1 more

This paper studies two compilation techniques for enhancing scalar performance in high-speed scientific processors: software pipelining and loop unrolling. We study the impact of the architecture (size of the register file) and of the hardware (size of instruction buffer) on the efficiency of loop unrolling. We also develop a methodology for classifying software pipelining techniques. For loop unrolling, a straightforward scheduling algorithm is shown to produce near-optimal results when not inhibited by recurrences or memory hazards. Software pipelining requires less hardware but also achieves less speedup. Finally, we show that the performance produced with a modified CRAY-1S scalar architecture and a code scheduler utilizing loop unrolling is comparable to the performance achieved by the CRAY-1S with a vector unit and the CFT vectorizing compiler.

  • Research Article
  • Cite Count Icon 63
  • 10.1145/36204.36191
A study of scalar compilation techniques for pipelined supercomputers
  • Oct 1, 1987
  • ACM SIGOPS Operating Systems Review
  • Shlomo Weiss + 1 more

This paper studies two compilation techniques for enhancing scalar performance in high-speed scientific processors: software pipelining and loop unrolling. We study the impact of the architecture (size of the register file) and of the hardware (size of instruction buffer) on the efficiency of loop unrolling. We also develop a methodology for classifying software pipelining techniques. For loop unrolling, a straightforward scheduling algorithm is shown to produce near-optimal results when not inhibited by recurrences or memory hazards. Software pipelining requires less hardware but also achieves less speedup. Finally, we show that the performance produced with a modified CRAY-1S scalar architecture and a code scheduler utilizing loop unrolling is comparable to the performance achieved by the CRAY-1S with a vector unit and the CFT vectorizing compiler.

  • Book Chapter
  • Cite Count Icon 7
  • 10.1007/978-3-642-21878-1_60
Bridging Performance Analysis Tools and Analytic Performance Modeling for HPC
  • Jan 1, 2011
  • Torsten Hoefler

Application performance is critical in high-performance computing (HPC), however, it is not considered in a systematic way in the HPC software development process. Integrated performance models could improve this situation. Advanced analytic performance modeling and performance analysis tools exist in isolation but have similar goals and could benefit mutually. We find that existing analysis tools could be extended to support analytic performance modeling and performance models could be used to improve the understanding of real application performance artifacts. We show a simple example of how a tool could support developers of analytic performance models. Finally, we propose to implement a strategy for integrated tool-supported performance modeling during the whole software development process.KeywordsMessage Passing InterfacePerformance ToolSoftware Development ProcessTarget ArchitectureCritical BlockThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

  • Conference Article
  • Cite Count Icon 20
  • 10.1109/ecrts.2009.9
Combining Worst-Case Timing Models, Loop Unrolling, and Static Loop Analysis for WCET Minimization
  • Jul 1, 2009
  • Paul Lokuciejewski + 1 more

Program loops are notorious for their optimization potential on modern high-performance architectures. Compilers aim at their aggressive transformation to achieve large improvements of the program performance. In particular, the optimization loop unrolling has shown in the past decades to be highly effective achieving significant increases of the average-case performance. In this paper, we present loop unrolling that is tailored towards real-time systems. Our novel optimization is driven by worst-case execution time (WCET) information to effectively minimize the program's worst-case behavior. To exploit maximal optimization potential, the determination of a suitable unrolling factor is based on precise loop iteration counts provided by a static loop analysis. In addition,our heuristics avoid adverse effects of unrolling which result from instruction cache overflows and the generation of additional spill code. Results on 45 real-life benchmarks demonstrate that aggressive loop unrolling can yield WCET reductions of up to 13.7% over simple, naive approaches employed by many production compilers.

  • Conference Article
  • Cite Count Icon 6
  • 10.1109/iwia.2001.955198
Characteristics of loop unrolling effect: software pipelining and memory latency hiding
  • Jan 1, 2001
  • Hiroyuki + 1 more

Recently loop unrolling has been shown in a new light from the superscalar architectural point of view. In this paper, we show that in addition to superscalar effect and scalar replacement effect, loop unrolling can hide memory latency, and that the combination of those effects improve the performance of loop unrolling. A major contribution of this paper is that the analysis is done symbolically and quantitatively. Although they have been known as major reasons that affect the performance of loop unrolling, no quantitative approach has not been tried. Our analysis can make clear the behaviour of superscalar functions and memory latency hiding in loop unrolling.

  • Book Chapter
  • Cite Count Icon 2
  • 10.1007/978-3-540-39920-9_9
An Unfolding-Based Loop Optimization Technique
  • Jan 1, 2003
  • Litong Song + 2 more

Loops in programs are the source of many optimizations for improving program performance, particularly on modern high-performance architectures as well as vector and multithreaded systems. Techniques such as loop invariant code motion, loop unrolling and loop peeling have demonstrated their utility in compiler optimizations. However, many of these techniques can only be used in very limited cases when the loops are ”well-structured” and easy to analyze. For instance, loop invariant code motion works only when invariant code is inside loops; loop unrolling and loop peeling work effectively when the array references are either constants or affine functions of index variable. It is our contention that there are many opportunities overlooked by limiting the optimizations to well structured loops. In many cases, even ”badly-structured” loops may be transformed into well structured loops. As a case in point, we show how some loop-dependent code can be transformed into loop-invariant code by transforming the loops. Our technique described in this paper relies on unfolding the loop for several initial iterations such that more opportunities may be exposed for many other existing compiler optimization techniques such as loop invariant code motion, loop peeling, loop unrolling, and so on.KeywordsAffine FunctionControl DependenceCompiler OptimizationInstruction Level ParallelismDependence EdgeThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

  • Research Article
  • Cite Count Icon 10
  • 10.1145/967278.967284
What can we gain by unfolding loops?
  • Feb 1, 2004
  • ACM SIGPLAN Notices
  • Litong Song + 1 more

Loops in programs are the source of many optimizations for improving program performance, particularly on modern high-performance architectures as well as vector and multithreaded systems. Techniques such as loop invariant code motion, loop unrolling and loop peeling have demonstrated their utility in compiler optimizations. However, many of these techniques can only be used in very limited cases when the loops are "well-structured" and easy to analyze. For instance, loop invariant code motion works only when invariant code is inside loops; loop unrolling and loop peeling work effectively when the array references are either constants or affine functions of index variable. It is our contention that there are many opportunities overlooked by limiting the optimizations to "well structured" loops. In many cases, even "badly-structured" loops may be transformed into "well structured" loops. As a case in point, we show how some loop-dependent code can be transformed into loop-independent code by transforming the loops. Our technique described in this paper relies on unfolding the loop for several initial iterations such that more opportunities may be exposed for many other existing compiler optimization techniques such as loop invariant code motion, loop peeling, loop unrolling and so on.

  • Conference Article
  • Cite Count Icon 129
  • 10.1145/166955.166994
An analytic performance model of disk arrays
  • Jun 1, 1993
  • Edward K Lee + 1 more

As disk arrays become widely used, tools for understanding and analyzing their performance become increasingly important. In particular, performance models can be invaluable in both configuring and designing disk arrays. Accurate analytic performance models are preferable to other types of models because they can be quickly evaluated, are applicable under a wide range of system and workload parameters, and can be manipulated by a range of mathematical techniques. Unfortunately, analytic performance models of disk arrays are difficult to formulate due to the presence of queueing and fork-join synchronization; a disk array request is broken up into independent disk requests which must all complete to satisfy the original request. In this paper, we develop and validate an analytic performance model for disk arrays. We derive simple equations for approximating their utilization, response time and throughput. We validate the analytic model via simulation, investigate the error introduced by each approximation used in deriving the analytic model, and examine the validity of some of the conclusions that can be drawn from the model.

  • Research Article
  • Cite Count Icon 24
  • 10.1145/166962.166994
An analytic performance model of disk arrays
  • Jun 1, 1993
  • ACM SIGMETRICS Performance Evaluation Review
  • Edward K Lee + 1 more

As disk arrays become widely used, tools for understanding and analyzing their performance become increasingly important. In particular, performance models can be invaluable in both configuring and designing disk arrays. Accurate analytic performance models are preferable to other types of models because they can be quickly evaluated, are applicable under a wide range of system and workload parameters, and can be manipulated by a range of mathematical techniques. Unfortunately, analytic performance models of disk arrays are difficult to formulate due to the presence of queueing and fork-join synchronization ; a disk array request is broken up into independent disk requests which must all complete to satisfy the original request. In this paper, we develop and validate an analytic performance model for disk arrays. We derive simple equations for approximating their utilization, response time and throughput. We validate the analytic model via simulation, investigate the error introduced by each approximation used in deriving the analytic model, and examine the validity of some of the conclusions that can be drawn from the model.

  • Conference Article
  • 10.1109/pdp.2012.49
On Optimizing the Longest Common Subsequence Problem by Loop Unrolling Along Wavefronts
  • Feb 1, 2012
  • Johann Steinbrecher + 1 more

Loop unrolling is a loop transformation where a few loop iterations are grouped as a super iteration for exploring more independent instructions and to decrease the total loop overhead. This paper characterizes loop unrolling by the unrolling factor, the number of iterations in a super iteration and the unrolling direction, the choice of iterations to be grouped to form the super iteration. We use loop unrolling for maximizing instruction-level parallelism in the longest common subsequence problem. To increase the number of independent instructions in the super iteration, we use a linear schedule to group iterations on the same wave front, a hyper plane in the loop iteration space. Then, the loop is unrolled along the wave front which guarantees all iterations in the same super iteration are independent. The selection of the optimal unrolling factor is based on the assumption that if all the pipelines are saturated, the performance should not be bad. Two necessary conditions and a sufficient condition for optimality are presented and used to find the optimal unrolling factor. The total execution time is expressed as a function of algorithm parameters, architecture parameters and the unrolling factor. A benchmark of the technique scores a 1.475 speed-up over traditional methods.

  • Research Article
  • Cite Count Icon 4
  • 10.1007/s10766-012-0203-z
Minimal Unroll Factor for Code Generation of Software Pipelining
  • Jul 17, 2012
  • International Journal of Parallel Programming
  • Mounira Bachir + 4 more

We address the problem of generating compact code from software pipelined loops. Although software pipelining is a powerful technique to extract fine-grain parallelism, it generates lifetime intervals spanning multiple loop iterations. These intervals require periodic register allocation (also called variable expansion), which in turn yields a code generation challenge. We are looking for the minimal unrolling factor enabling the periodic register allocation of software pipelined kernels. This challenge is generally addressed through one of: (1) hardware support in the form of rotating register files, which solve the unrolling problem but are expensive in hardware; (2) register renaming by inserting register moves, which increase the number of operations in the loop, and may damage the schedule of the software pipeline and reduce throughput; (3) post-pass loop unrolling that does not compromise throughput but often leads to impractical code growth. The latter approach relies on the proof that MAXLIVE registers (maximal number of values simultaneously alive) are sufficient for periodic register allocation (Eisenbeis et al. in PACT ’95: Proceedings of the IFIP WG10.3 working conference on Parallel Architectures and Compilation Techniques, pages 264–267, Manchester, UK, 1995; Hendren et al. in CC ’92: Proceedings of the 4th International Conference on Compiler Construction, pages 176–191, London, UK, 1992). However, the best existing heuristic for controlling this code growth—modulo variable expansion (Lam in SIGPLAN Not 23(7):318–328, 1988)—may not apply the correct amount of loop unrolling to guarantee that MAXLIVE registers are enough, which may result in register spills Eisenbeis et al. in PACT ’95: Proceedings of the IFIP WG10.3 working conference on Parallel Architectures and Compilation Techniques, pages 264–267, Manchester, UK, 1995. This paper presents our research results on the open problem of minimal loop unrolling, allowing a software-only code generation that does not trade the optimality of the initiation interval (II) for the compactness of the generated code. Our novel idea is to use the remaining free registers after periodic register allocation to relax the constraints on register reuse. The problem of minimal loop unrolling arises either before or after software pipelining, either with a single or with multiple register types (classes). We provide a formal problem definition for each scenario, and we propose and study a dedicated algorithm for each problem. Our solutions are implemented within an industrial-strength compiler for a VLIW embedded processor from STMicroelectronics, and validated on multiple benchmarks suites.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant