Aggressive loop unrolling in a retargetable, optimizing compiler
A well-known code transformation for improving the run-time performance of a program is loop unrolling. The most obvious benefit of unrolling a loop is that the transformed loop usually requires fewer instruction executions than the original loop. The reduction in instruction executions comes from two sources: the number of branch instructions executed is reduced, and the control variable is modified fewer times. In addition, for architectures with features designed to exploit instruction-level parallelism, loop unrolling can expose greater levels of instruction-level parallelism. Loop unrolling is an effective code transformation often improving the execution performance of programs that spend much of their execution time in loops by 10 to 30 percent. Possibly because of the effectiveness of a simple application of loop unrolling, it has not been studied as extensively as other code improvements such as register allocation or common subexpression elimination. The result is that many compilers employ simplistic loop unrolling algorithms that miss many opportunities for improving run-time performance. This paper describes how aggressive loop unrolling is done in a retargetable optimizing compiler. Using a set of 32 benchmark programs, the effectiveness of this more aggressive approach to loop unrolling is evaluated. The results show that aggressive loop unrolling can yield additional performance increase of 10 to 20 percent over the simple, naive approaches employed by many production compilers.
- Conference Article
20
- 10.1109/ecrts.2009.9
- Jul 1, 2009
Program loops are notorious for their optimization potential on modern high-performance architectures. Compilers aim at their aggressive transformation to achieve large improvements of the program performance. In particular, the optimization loop unrolling has shown in the past decades to be highly effective achieving significant increases of the average-case performance. In this paper, we present loop unrolling that is tailored towards real-time systems. Our novel optimization is driven by worst-case execution time (WCET) information to effectively minimize the program's worst-case behavior. To exploit maximal optimization potential, the determination of a suitable unrolling factor is based on precise loop iteration counts provided by a static loop analysis. In addition,our heuristics avoid adverse effects of unrolling which result from instruction cache overflows and the generation of additional spill code. Results on 45 real-life benchmarks demonstrate that aggressive loop unrolling can yield WCET reductions of up to 13.7% over simple, naive approaches employed by many production compilers.
- Book Chapter
3
- 10.1007/978-3-642-13374-9_19
- Jan 1, 2010
This paper improves our previous research effort [1] by providing an efficient method for kernel loop unrolling minimisation in the case of already scheduled loops, where circular lifetime intervals are known. When loops are software pipelined, the number of values simultaneously alive becomes exactly known giving better opportunities for kernel loop unrolling. Furthermore, fixing circular lifetime intervals allows us to reduce the algorithmic complexity of our method compared to [1] by computing a new research space for minimal kernel loop unrolling. The meeting graph (MG) is one of the [3] frameworks proposed in the literature which models loop unrolling and register allocation together in a common formal framework for software pipelined loops. Although MG significantly improves loop register allocation, the computed loop unrolling may lead to unpractical code growth. This work proposes to minimise the loop unrolling degree in the meeting graph by making an adaptation of [1] the approach described in . We explain how to reduce the research space for minimal kernel loop unrolling in the context of MG, yielding to a reduced algorithmic complexity. Furthermore, our experiments on SPEC2000, SPEC2006, MEDIABENCH and FFMPEG show that in concrete cases the loop unrolling minimisation is very fast and the minimal loop unrolling degree for 75% of the optimised loops is equal to 1 (i.e. no unroll), while it is equal to 7 when the software pipelining (SWP) schedule is not fixed.
- Conference Article
18
- 10.1145/2892208.2892219
- Mar 17, 2016
Register allocation is a much studied problem. A particularly important context for optimizing register allocation is within loops, since a significant fraction of the execution time of programs is often inside loop code. A variety of algorithms have been proposed in the past for register allocation, but the complexity of the problem has resulted in a decoupling of several important aspects, including loop unrolling, register promotion, and instruction reordering. In this paper, we develop an approach to register allocation and promotion in a unified optimization framework that simultaneously considers the impact of loop unrolling and instruction scheduling. This is done via a novel instruction tiling approach where instructions within a loop are represented along one dimension and innermost loop iterations along the other dimension. By exploiting the regularity along the loop dimension, and imposing essential dependence based constraints on intra-tile execution order, the problem of optimizing register pressure is cast in a constraint programming formalism. Experimental results are provided from thousands of innermost loops extracted from the SPEC benchmarks, demonstrating improvements over the current state-of-the-art.
- Conference Article
- 10.1117/12.527107
- Apr 19, 2004
- Proceedings of SPIE, the International Society for Optical Engineering/Proceedings of SPIE
As wireless video products evolve, they demand more sophisticated processing at higher resolutions and frame rates. Computational performance and energy efficiency have become critical design issues. This paper presents the Quantized Color Pack eXtension (QCPX) combined with a loop unrolling (LU) technique to improve execution performance and energy efficiency of color image and video processing applications. QCPX applied to a 32-bit datapath processor supports parallel operations on two packed 16-bit YCbCr (Y: luminance, Cr and Cb: chrominance) color pixels, providing greater subword-level parallelism by increasing the number of smaller color pixels packed into a word. Instruction-level parallelism can be further enhanced through loop unrolling. These techniques provide greater performance and efficiency for multimedia workloads on mobile systems. Experimental results on a set of media benchmark applications indicate that the LU plus QCPX-optimized version achieves a speedup ranging from 3.8 to 7.9 while reducing the energy consumption from 76% to 87% over the baseline version on identically configured, dynamically scheduled ILP superscalar processors. The LU plus QCPX-optimized version also outperforms the LU plus MDMX-like (MIPS’s multimedia extension) version.
- Conference Article
27
- 10.5555/225160.225184
- Dec 1, 1995
Exploitation of instruction-level parallelism is an effective mechanism for improving the performance of modern super-scalar/VLIW processors. Various software techniques can be applied to increase instruction-level parallelism. This paper describes and evaluates a software technique, dynamic memory disambiguation, that permits loops containing loads and stores to be scheduled more aggressively, thereby exposing more instruction-level parallelism. The results of our evaluation show that when dynamic memory disambiguation is applied in conjunction with loop unrolling, register renaming, and static memory disambiguation, the ILP of memory-intensive benchmarks can be increased by as much as 300 percent over loops where dynamic memory disambiguation is not performed. Our measurements also indicate that for the programs that benefit the most from these optimizations, the register usage does not exceed the number of registers on mast high-performance processors.
- Conference Article
52
- 10.1109/micro.1995.476820
- Nov 1, 1995
Exploitation of instruction-level parallelism is an effective mechanism for improving the performance of modern super-scalar/VLIW processors. Various software techniques can be applied to increase instruction-level parallelism. This paper describes and evaluates a software technique, dynamic memory disambiguation, that permits loops containing loads and stores to be scheduled more aggressively, thereby exposing more instruction-level parallelism. The results of our evaluation show that when dynamic memory disambiguation is applied in conjunction with loop unrolling, register renaming, and static memory disambiguation, the ILP of memory-intensive benchmarks can be increased by as much as 300 percent over loops where dynamic memory disambiguation is not performed. Our measurements also indicate that for the programs that benefit the most from these optimizations, the register usage does not exceed the number of registers on mast high-performance processors.
- Conference Article
19
- 10.1145/3377555.3377890
- Feb 22, 2020
Loop unrolling is a widely adopted loop transformation, commonly used for enabling subsequent optimizations. Straight-line-code vectorization (SLP) is an optimization that benefits from unrolling. SLP converts isomorphic instruction sequences into vector code. Since unrolling generates repeatead isomorphic instruction sequences, it enables SLP to vectorize more code. However, most production compilers apply these optimizations independently and uncoordinated. Unrolling is commonly tuned to avoid code bloat, not maximizing the potential for vectorization, leading to missed vectorization opportunities. We are proposing VALU, a novel loop unrolling heuristic that takes vectorization into account when making unrolling decisions. Our heuristic is powered by an analysis that estimates the potential benefit of SLP vectorization for the unrolled version of the loop. Our heuristic then selects the unrolling factor that maximizes the utilization of the vector units. VALU also forwards the vectorizable code to SLP, allowing it to bypass its greedy search for vectorizable seed instructions, exposing more vectorization opportunities. Our evaluation on a production compiler shows that VALU uncovers many vectorization opportunities that were missed by the default loop unroller and vectorizers. This results in more vectorized code and significant performance speedups for 17 of the kernels of the TSVC benchmarks suite, reaching up to 2× speedup over the already highly optimized -O3. Our evaluation on full benchmarks from FreeBench and MiBench shows that VALU results in a geo-mean speedup of 1.06×.
- Research Article
97
- 10.1145/1027084.1027087
- Oct 1, 2004
- ACM Transactions on Design Automation of Electronic Systems
We present a high-level synthesis methodology that applies a coordinated set of coarse-grain and fine-grain parallelizing transformations. The transformations are applied both during a pre-synthesis phase and during scheduling, with the objective of optimizing the results of synthesis and reducing the impact of control flow constructs on the quality of results. We first apply a set of source level presynthesis transformations that include common sub-expression elimination (CSE), copy propagation, dead code elimination and loop-invariant code motion, along with more coarse-level code restructuring transformations such as loop unrolling. We then explore scheduling techniques that use a set of aggressive speculative code motions to maximally parallelize the design by re-ordering, speculating and sometimes even duplicating operations in the design. In particular, we present a new technique called "Dynamic CSE" that dynamically coordinates CSE and code motions such as speculation and conditional speculation during scheduling. We implemented our parallelizing high-level synthesis in the <i>SPARK</i> framework. This framework takes a behavioral description in ANSI-C as input and generates synthesizable register-transfer level VHDL. Our results from computationally expensive portions of three moderately complex design targets, namely, MPEG-1, MPEG-2 and the GIMP image processing tool, validate the utility of our approach to the behavioral synthesis of designs with complex control flows.
- Conference Article
3
- 10.1109/icpp.1996.538560
- Aug 12, 1996
We propose a scheme to estimate exact minimum parallel execution time of the single loop with loop-carried dependences in medium and fine grain parallel execution. The minimum parallel execution time of a loop is given by the critical path length of the dependence graph which represents the code obtained from the fully unrolled loop. However, unrolling loops with a large number of iterations requires too much computation time and large storage space to be practical. The scheme proposed provides the minimum parallel execution time without unrolling the loop at all by reducing the problem into an integer linear programming problem and employing the simplex method and a branch-and-bound algorithm to solve it. We also show an experimental implementation of the proposed scheme with Livermore Benchmark Kernels to demonstrate that the computational complexity of our scheme is independent of the number of iterations of the given loop.
- Research Article
- 10.1166/jolpe.2015.1361
- Mar 1, 2015
- Journal of Low Power Electronics
The present work introduces a compilation technique to reduce runtime leakage power of functional units of a processor by combining loop unrolling with power gating. The instructions in the unrolled loop are scheduled to provide opportunities for power gating the functional units which are not used for a considerable amount of time. An algorithm that saves maximum leakage energy without performance loss due to execution of power gating instructions has been introduced. The algorithm does loop unrolling, scheduling of instructions and finally insert power gating instructions. The present work is explained using two illustrative examples, one without loop-carried dependence and the other with loop-carried dependence. It is observed that the number of clock cycles taken by the power gating instructions is less than or equal to the number of clock cycles saved by loop unrolling. This results in 23–64% reduction of the total energy consumed by the benchmark programs without any degradation of performance.
- Single Report
2
- 10.21236/ada326916
- Jun 1, 1997
: Many of today's high-performance computer processors are super-scalar. They can dispatch multiple instructions per cycle and, hence, provide what is commonly referred to as instruction-level parallelism. This super-scalar capability, combined with software pipelining, can increase processor throughput dramatically. Achieving maximum throughput, however, is nontrivial. Compilers must engage in aggressive optimization techniques, such as loop unrolling, speculative code motion, etc., to structure code to take full advantage of the underlying computer architecture. The phase-ordering implications of these optimizations are not well understood and are the subject of continuing research. Of particular interest are optimizations that enhance instruction-level parallelism. Two of these are loop unrolling and loop fusion. These are source-level optimizations that can be performed by either the programmer or the compiler. These optimizations have dramatic effects on the compiler's instruction scheduler. Performed too aggressively, these optimizations can increase register pressure and result in costly memory references. This paper details experiments performed to measure the effects of these source-level code transformations and how they influenced register pressure and code performance.
- Research Article
- 10.3390/electronics13081425
- Apr 10, 2024
- Electronics
Loop unrolling can provide more instruction-level parallelism opportunities for code and enables a greater range of instruction pipeline scheduling. In high-performance very-long-instruction-word (VLIW) digital signal processors (DSPs), there are special registers to address. To further improve the instruction-level parallelism of code for such DSPs by making full use of these registers, in this paper, we propose a more effective loop unrolling approach through extending memory accessing (LUAEMA). In this approach, the final unrolling factor is computed by a model in which every register kind and every memory accessing operation are considered. For basic digital signal processing algorithms, the unrolling factor under the LUAEMA is larger than that under the conventional loop unrolling approach. We also provide the opportunity to reduce the number of instructions in a loop during the code transformation of loop unrolling. The experimental results show that the loop unrolling approach proposed in this paper can achieve an average speedup ratio ranging from 1.14 to 1.81 compared with the conventional loop unrolling approach. For some algorithms, the peak speedup ratio is up to 2.11.
- Book Chapter
- 10.1007/3-540-48311-x_175
- Jan 1, 1999
Research in Instruction-Level Parallelism (ILP) is concerned with architectural innovations in the processor to expose parallelism between the execution of instructions. Of course, the relationship with the research on the memory hierarchy and on compiler optimisation techniques is very strong. Another point is that such a research needs tools to simulate the mechanisms. Thus, researchers have to develop their tools. Such a tool is detailed in the paper on code cloning tracing by Lafage et al from IRISA.Most of these topics are represented in this workshop although there are no papers on the lower levels of the memory hierarchy.The memory hierarchy, and particularly, the first-level cache is highly related to ILP research since superscalar processors place higher demands on it for obtaining more instructions and more data per cycle. In addition to the requirement of higher bandwidths, latency is also an important issue. One way to reduce the latency is prefetching as proposed by Chi and Yuan. Another issue is the way the cache is managed. Software can afford hints for a better management, which might result in a good speedup as in the paper of Lebeck et al.A big and old deal is what should be in the hardware and what should be left in the compiler. Returning to simpler processors while leaving part of the job to the compiler might arrive in the future. Thus, we should care of compiler studies. Moreover, compiler studies might have an impact on the architecture. The papers by Norris, Fenwick and Genius, Lelait concern compiler optimisations. The first one deals with register allocation that can have a great impact on the reordering of instructions while the second one apply techniques of register allocation to the data in memory in order to improve the use of the cache. The VLIW architecture highly depends on the quality of the compiler. The paper of Ebcioglu et al gives encouraging results for such an approach.Increasing the ILP is, at last, limited by data dependencies.KeywordsData DependencyMemory HierarchyRegister AllocationGood SpeedupCompiler OptimisationThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
- Research Article
- 10.1002/(sici)1520-684x(199808)29:9<62::aid-scj7>3.0.co;2-h
- Aug 1, 1998
- Systems and Computers in Japan
A considerable part of program execution time is consumed by loops, so that loop optimization is highly effective especially for the innermost loops of a program. Software pipelining and loop unrolling are known methods for loop optimization. Software pipelining is advantageous in that the code becomes only slightly longer. This method, however, is difficult to apply if the loop includes branching when the parallelism is limited. On the other hand, loop unrolling, while being free of such limitations, suffers from a number of drawbacks. In particular the code size grows substantially and it is difficult to determine the optimal number of body replications. In order to solve these problems, it seems important to combine software pipelining with loop unrolling so as to utilize the advantages of both techniques while paying due regard to properties of programs under consideration and to the machine resources available. This paper describes a method for applying optimal loop unrolling and effective software pipelining to achieve this goal. Program characteristics obtained by means of an extended PDG (program dependence graph) are taken into consideration as well as machine resources. © 1998 Scripta Technica, Syst Comp Jpn, 29(9): 62–73, 1998
- Research Article
- 10.36948/ijfmr.2021.v03i04.37540
- Jul 8, 2021
- International Journal For Multidisciplinary Research
Modern computational workloads demand exceptional performance and efficiency, necessitating the effective utilization of advanced CPU features such as SIMD (Single Instruction Multiple Data), instruction-level parallelism (ILP), and branch prediction. This paper explores optimization techniques that address inefficiencies at the algorithmic, architectural, and system levels, enabling software to align with hardware capabilities. Key techniques include resolving data dependencies, enhancing memory locality, utilizing compiler intrinsics,applying tail call optimizations, and employing strategies like loop unrolling, blocking, vectorization, and function inlining. Tail call optimization and breaking dependency chains are analyzed to improve parallelism and reduce processing overhead. Both manual and compiler-driven approaches are evaluated, providing insights into their trade-offs and synergies. Experimental results from benchmarks, such as matrix multiplication and particle simulations, demonstrate significant gains, with up to a 3x increase in instructions per cycle (IPC) and a 40% reduction in execution time. These findings highlight the critical role of optimizing software for architectural features like cache hierarchies, pipelining, and vector widths. This study provides techniques to maximize CPU efficiency, bridging the gap between hardware potential and software performance. Future directions include extending these methodologies to hybrid architectures like GPUs and integrating machine learning models for dynamic runtime optimization.