Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

What can we gain by unfolding loops?

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Loops in programs are the source of many optimizations for improving program performance, particularly on modern high-performance architectures as well as vector and multithreaded systems. Techniques such as loop invariant code motion, loop unrolling and loop peeling have demonstrated their utility in compiler optimizations. However, many of these techniques can only be used in very limited cases when the loops are "well-structured" and easy to analyze. For instance, loop invariant code motion works only when invariant code is inside loops; loop unrolling and loop peeling work effectively when the array references are either constants or affine functions of index variable. It is our contention that there are many opportunities overlooked by limiting the optimizations to "well structured" loops. In many cases, even "badly-structured" loops may be transformed into "well structured" loops. As a case in point, we show how some loop-dependent code can be transformed into loop-independent code by transforming the loops. Our technique described in this paper relies on unfolding the loop for several initial iterations such that more opportunities may be exposed for many other existing compiler optimization techniques such as loop invariant code motion, loop peeling, loop unrolling and so on.

Similar Papers
  • Book Chapter
  • Cite Count Icon 2
  • 10.1007/978-3-540-39920-9_9
An Unfolding-Based Loop Optimization Technique
  • Jan 1, 2003
  • Litong Song + 2 more

Loops in programs are the source of many optimizations for improving program performance, particularly on modern high-performance architectures as well as vector and multithreaded systems. Techniques such as loop invariant code motion, loop unrolling and loop peeling have demonstrated their utility in compiler optimizations. However, many of these techniques can only be used in very limited cases when the loops are ”well-structured” and easy to analyze. For instance, loop invariant code motion works only when invariant code is inside loops; loop unrolling and loop peeling work effectively when the array references are either constants or affine functions of index variable. It is our contention that there are many opportunities overlooked by limiting the optimizations to well structured loops. In many cases, even ”badly-structured” loops may be transformed into well structured loops. As a case in point, we show how some loop-dependent code can be transformed into loop-invariant code by transforming the loops. Our technique described in this paper relies on unfolding the loop for several initial iterations such that more opportunities may be exposed for many other existing compiler optimization techniques such as loop invariant code motion, loop peeling, loop unrolling, and so on.KeywordsAffine FunctionControl DependenceCompiler OptimizationInstruction Level ParallelismDependence EdgeThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/11802839_55
High-Level Synthesis Using SPARK and Systolic Array
  • Jan 1, 2006
  • Jae-Jin Lee + 1 more

Recently, SPARK parallelizing high-level synthesis software tool has been developed. It takes a behavioral ANSI-C code as an input, schedules it using speculative code motions and loop transformations, generates a finite state machine for the scheduled design graph, and then finally outputs a synthesizable RTL VHDL code. To handle loop algorithm, SPARK employs various loop transformations such as loop invariant code motion, loop unrolling, loop index variable elimination and loop shifting. In loop synthesis, however, SPARK does not produce circuit description whose quality can compete with manual designs. With the objective of improving the quality of high-level synthesis results for designs with loops, this paper shows an upgrade of SPARK through transforming nested loops into a 2-D systolic array to increase parallelism. The C-to-VHDL loop synthesis in this paper achieves synthesis results that are better than those achieved from a current version of SPARK for matrix-matrix multiplication and FIR filter, and can be incorporated into SPARK parallelizing high-level synthesis framework.KeywordsNest LoopSystolic ArrayTotal Execution TimeHardware ComplexitySynthesis ResultThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

  • Conference Article
  • Cite Count Icon 7
  • 10.1145/1534530.1534550
The effect of unrolling and inlining for Python bytecode optimizations
  • May 4, 2009
  • Yosi Ben Asher + 1 more

In this study, we consider bytecode optimizations for Python, a programming language which combines object-oriented concepts with features of scripting languages, such as dynamic dictionaries. Due to its design nature, Python is relatively slow compared to other languages. It operates through compiling the code into powerful bytecode instructions that are executed by an interpreter. Python's speed is limited due to its interpreter design, and thus there is a significant need to optimize the language. In this paper, we discuss one possible approach and limitations in optimizing Python based on bytecode transformations. In the first stage of the proposed optimizer, the bytecode is expanded using function inline and loop unrolling. The second stage of transformations simplifies the bytecode by applying a complete set of data-flow optimizations, including constant propagation, algebraic simplifications, dead code elimination, copy propagation, common sub expressions elimination, loop invariant code motion and strength reduction. While these optimizations are known and their implementation mechanism (data flow analysis) is well developed, they have not been successfully implemented in Python due to its dynamic features which prevent their use. In this work we attempt to understand the dynamic features of Python and how these features affect and limit the implementation of these optimizations. In particular, we consider the significant effects of first unrolling and then inlining on the ability to apply the remaining optimizations. The results of our experiments indicate that these optimizations can indeed be implemented and dramatically improve execution times.

  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/11688839_11
Loop Transformations in the Ahead-of-Time Optimization of Java Bytecode
  • Jan 1, 2006
  • Simon Hammond + 1 more

Loop optimizations such as loop unrolling, unfolding and invariant code motion have long been used in a wide variety of compilers to improve the running time of applications. In this paper we present a series of experimental results detailing the effect these techniques have on the running time of Java applications following ahead of time optimization. We also detail the optimization tools and transformations developed for this paper which extend the SOOT framework discussed in a number of previous papers on the subject. Our experimentation, conducted on the SciMark 2.0 benchmarking suite, demonstrates that when optimized using the techniques mentioned, Java applications can benefit from performance improvements of up to 20%. We finish with a discussion of the results obtained, including results on how the optimizations affect JIT compilation and class size and proceed to argue that ahead-of-time loop unrolling and unfolding optimization may have a role to play in improving the performance of Java applications, particularly in scientific applications.

  • Research Article
  • Cite Count Icon 46
  • 10.1109/tce.2017.015072
DSP design protection in CE through algorithmic transformation based structural obfuscation
  • Nov 1, 2017
  • IEEE Transactions on Consumer Electronics
  • Anirban Sengupta + 3 more

Structural obfuscation offers a means to effectively secure through obfuscation the contents of an intellectual property (IP) cores used in an electronic system-on-chip (SoC). In this work a novel structural obfuscation methodology for protecting a digital signal processor (DSP) IP core at the architectural synthesis design stage. The proposed approach specifically targets protection of IP cores that involve complex loops. Five different algorithmic level transformation techniques are employed: loop unrolling, loop invariant code motion, tree height reduction/increment, logic transformation and redundant operation removal. Each of these can yield camouflaged functionally equivalent designs. In addition, low cost obfuscated design is generated through proposed approach through the use of multi-stage algorithmic transformation and particle swarm optimization (PSO)-drive design space exploration (DSE). Results of proposed approach yielded an enhancement obfuscation of 22 % and reduction in obfuscated design cost of 55 % compared to similar prior art.

  • Conference Article
  • Cite Count Icon 6
  • 10.1109/iccd.1988.25700
CTP-A family of optimizing compilers for the NS32532 microprocessor
  • Oct 3, 1988
  • C Bendelac + 1 more

Techniques for generating highly optimized code for a pipelined microprocessor, the NS32532, and its fast floating point slave processor, the NS32580, are described in the context of the CTP family of optimizing compilers. All CTP compilers are constructed from three separate parts: a language-dependent compiler front-end, a shared global optimizer, and a shared code generator. In addition to most classical transformations, such as value propagation, redundant and dead code elimination, loop invariant code motion, global strength reduction and register allocation, the CTP compilers also perform less common optimizations, such as loop unrolling, basic block reorganization, code reordering, and profile feedback utilization. The relative influence of the different optimizations on the performance of the NS32532 using several standard benchmark programs is presented. >

  • Book Chapter
  • 10.1201/9781003127598-1-1
Securing Dedicated DSP Co-processors (Hardware IP) using Structural Obfuscation for IoT-oriented Platforms
  • Jul 14, 2021
  • Anirban Sengupta + 1 more

Internet of Things (IoT) has become an integral part of modern life. IoT oriented platforms are comprised of digital signal processing (DSP) coprocessors suitable for low power high performance applications, compared to traditional counterparts such as microprocessors. However, DSP coprocessors are not entirely designed in-house due to the global design supply chain, resulting into security threats at the hardware level. Some of the prominent hardware security threats for such devices used in IoT oriented platforms could be backdoor Trojan insertion, reverse engineering, etc. This chapter discusses some of the standard structural obfuscation approaches used for securing dedicated DSP coprocessors, as well as the structural obfuscation approaches that make the DSP hardware unobvious (and uninterpretable) from an attacker’s perspective. More explicitly, state of the art structural obfuscation approaches such as compiler-driven transformation techniques, hybrid transformation techniques, hologram based obfuscation techniques and key-based structural obfuscation techniques are discussed. Adopting a distinct and integrated approach, it aims to elaborate on the transformation processes for structural obfuscation, such as logic transformation, tree height transformation, partitioning, loop unrolling, loop invariant code motion, folding knob, redundant operation elimination, and so on. Demonstrations use DSP applications such as finite impulse response filter, discrete cosine transformation and other digital filters. Also presented is comparative analysis of the structural obfuscation approaches used for DSP applications.

  • Research Article
  • 10.52783/cana.v32.2860
The Opticode: A User-Centric Tool for Enhancing Software Efficiency and Minimizing Errors Through Dead Code Elimination and Loop Invariant Code Motion Techniques
  • Dec 18, 2024
  • Communications on Applied Nonlinear Analysis
  • Tulshihar Patil

Introduction: This article introduces OptiCode, a complex software tool that uses loop invariant code mobility and dead code reduction, among other advanced code optimization techniques, to improve code efficiency and decrease compile time. Using Loop Invariant Code Motion (LICM) and Abstract Syntax Trees (ASTs) for precise code analysis, OptiCode efficiently detects and eliminates redundant code, as well as optimizes loop structures by removing 4.87% of dead code with an efficiency of 5.38. OptiCode outperforms other apps in comparison, as seen by the considerable compile time savings and excellent efficiency ratings that it achieves. Objectives: To remove the unused code and elements affecting the efficacy of the code. Methods: Source code is passed as an input then the lexer performs the tokenization. Tokenized words are processed by the parser to assess the syntax of the code. Customized Abstract Syntax Tree removes the dead code, and the data is passed to Customized Loop invariant code Motion which optimize the looping structure in the code. At last, the optimized code is generated. Results: OptiCode outperformed Taskapp (3.89), Agilla (3.67), and Rfmtoleds (3.45) with its greatest efficiency rating of 5.38 on a 10-point scale in our comparison research. The 731 lines of code in the OptiCode codebase include 150 lines of dead code and 57 variables that aren't used Conclusions: The code is optimized to save space and time. Performance increases as number of lines increases.

  • Research Article
  • Cite Count Icon 97
  • 10.1145/1027084.1027087
Coordinated parallelizing compiler optimizations and high-level synthesis
  • Oct 1, 2004
  • ACM Transactions on Design Automation of Electronic Systems
  • Sumit Gupta + 3 more

We present a high-level synthesis methodology that applies a coordinated set of coarse-grain and fine-grain parallelizing transformations. The transformations are applied both during a pre-synthesis phase and during scheduling, with the objective of optimizing the results of synthesis and reducing the impact of control flow constructs on the quality of results. We first apply a set of source level presynthesis transformations that include common sub-expression elimination (CSE), copy propagation, dead code elimination and loop-invariant code motion, along with more coarse-level code restructuring transformations such as loop unrolling. We then explore scheduling techniques that use a set of aggressive speculative code motions to maximally parallelize the design by re-ordering, speculating and sometimes even duplicating operations in the design. In particular, we present a new technique called "Dynamic CSE" that dynamically coordinates CSE and code motions such as speculation and conditional speculation during scheduling. We implemented our parallelizing high-level synthesis in the <i>SPARK</i> framework. This framework takes a behavioral description in ANSI-C as input and generates synthesizable register-transfer level VHDL. Our results from computationally expensive portions of three moderately complex design targets, namely, MPEG-1, MPEG-2 and the GIMP image processing tool, validate the utility of our approach to the behavioral synthesis of designs with complex control flows.

  • Conference Article
  • Cite Count Icon 20
  • 10.1109/ecrts.2009.9
Combining Worst-Case Timing Models, Loop Unrolling, and Static Loop Analysis for WCET Minimization
  • Jul 1, 2009
  • Paul Lokuciejewski + 1 more

Program loops are notorious for their optimization potential on modern high-performance architectures. Compilers aim at their aggressive transformation to achieve large improvements of the program performance. In particular, the optimization loop unrolling has shown in the past decades to be highly effective achieving significant increases of the average-case performance. In this paper, we present loop unrolling that is tailored towards real-time systems. Our novel optimization is driven by worst-case execution time (WCET) information to effectively minimize the program's worst-case behavior. To exploit maximal optimization potential, the determination of a suitable unrolling factor is based on precise loop iteration counts provided by a static loop analysis. In addition,our heuristics avoid adverse effects of unrolling which result from instruction cache overflows and the generation of additional spill code. Results on 45 real-life benchmarks demonstrate that aggressive loop unrolling can yield WCET reductions of up to 13.7% over simple, naive approaches employed by many production compilers.

  • Conference Article
  • Cite Count Icon 7
  • 10.1109/icapp.2002.1173607
A technique for variable dependence driven loop peeling
  • Oct 23, 2002
  • Litong Song + 1 more

Loops in programs are the source of many optimizations leading to performance improvements, particularly on modern high-performance architectures as well as vector and multithreaded systems. Among the optimization techniques, loop peeling is an important technique that can be used to parallelize computations. The technique relies on moving computations in early iterations out of the loop body such that the remaining iterations can be executed in parallel. A key issue in applying loop peeling is the number of iterations that must be peeled off from the loop body. Current techniques use heuristics or ad hoc techniques to peel a fixed number of iterations or a speculated number of iterations. To our knowledge, no formal or systematic technique that can be used by compilers to determine the number of iterations that must be peeled off based on the program characteristics. In this paper we introduce one technique that uses variable dependence analysis for identifying the number of iterations to be peeled off. Our goal is to find general techniques that can accurately determine the ideal number of iterations for loop peeling, while working within the context of other loop optimizations including code motion.

  • Conference Article
  • Cite Count Icon 1
  • 10.1145/1289881.1289912
Facilitating compiler optimizations through the dynamic mapping of alternate register structures
  • Sep 30, 2007
  • Chris Zimmer + 4 more

Aggressive compiler optimizations such as software pipelining and loop invariant code motion can significantly improve application performance, but these transformations often require the use of several additional registers to hold data values across one or more loop iterations. Compilers that target embedded systems may often have difficulty exploiting these optimizations since many embedded systems typically do not have as many general purpose registers available. Alternate register structures like register queues can be used to facilitate the application of these optimizations due to common reference patterns. In this paper, we propose a microarchitectural technique that permits these alternate register structures to be efficiently mapped into a given processor architecture and automatically exploited by an optimizing compiler. We show that this minimally invasive technique can be used to facilitate the application of software pipelining and loop invariant code motion for a variety of embedded benchmarks. This leads to performance improvements for the embedded processor, as well as new opportunities for further aggressive optimization of embedded systems software due to a significant decrease in the register pressure of tight loops.

  • Research Article
  • 10.1145/2490301.2451136
DeAliaser
  • Mar 16, 2013
  • ACM SIGARCH Computer Architecture News
  • Wonsun Ahn + 2 more

Alias analysis is a critical component in many compiler optimizations. A promising approach to reduce the complexity of alias analysis is to use speculation. The approach consists of performing optimizations assuming the alias relationships that are true most of the time, and repairing the code when such relationships are found not to hold through runtime checks. This paper proposes a general alias speculation scheme that leverages upcoming hardware support for transactions with the help of some ISA extensions. The ability of transactions to checkpoint and roll back frees the compiler to pursue aggressive optimizations without having to worry about recovery code. Also, exposing the memory conflict detection hardware in transactions to software allows runtime checking of aliases with little or no overhead. We test the potential of the novel alias speculation approach with Loop Invariant Code Motion (LICM), Global Value Numbering (GVN), and Partial Redundancy Elimination (PRE) optimization passes. On average, they are shown to reduce program execution time by 9% in SPEC FP2006 applications and 3% in SPEC INT2006 applications over the alias analysis of a state-of-the-art compiler.

  • Conference Article
  • Cite Count Icon 6
  • 10.1145/2451116.2451136
DeAliaser
  • Mar 16, 2013
  • Wonsun Ahn + 2 more

Alias analysis is a critical component in many compiler optimizations. A promising approach to reduce the complexity of alias analysis is to use speculation. The approach consists of performing optimizations assuming the alias relationships that are true most of the time, and repairing the code when such relationships are found not to hold through runtime checks.This paper proposes a general alias speculation scheme that leverages upcoming hardware support for transactions with the help of some ISA extensions. The ability of transactions to checkpoint and roll back frees the compiler to pursue aggressive optimizations without having to worry about recovery code. Also, exposing the memory conflict detection hardware in transactions to software allows runtime checking of aliases with little or no overhead. We test the potential of the novel alias speculation approach with Loop Invariant Code Motion (LICM), Global Value Numbering (GVN), and Partial Redundancy Elimination (PRE) optimization passes. On average, they are shown to reduce program execution time by 9% in SPEC FP2006 applications and 3% in SPEC INT2006 applications over the alias analysis of a state-of-the-art compiler.

  • Research Article
  • Cite Count Icon 1
  • 10.1145/2499368.2451136
DeAliaser
  • Mar 16, 2013
  • ACM SIGPLAN Notices
  • Wonsun Ahn + 2 more

Alias analysis is a critical component in many compiler optimizations. A promising approach to reduce the complexity of alias analysis is to use speculation. The approach consists of performing optimizations assuming the alias relationships that are true most of the time, and repairing the code when such relationships are found not to hold through runtime checks. This paper proposes a general alias speculation scheme that leverages upcoming hardware support for transactions with the help of some ISA extensions. The ability of transactions to checkpoint and roll back frees the compiler to pursue aggressive optimizations without having to worry about recovery code. Also, exposing the memory conflict detection hardware in transactions to software allows runtime checking of aliases with little or no overhead. We test the potential of the novel alias speculation approach with Loop Invariant Code Motion (LICM), Global Value Numbering (GVN), and Partial Redundancy Elimination (PRE) optimization passes. On average, they are shown to reduce program execution time by 9% in SPEC FP2006 applications and 3% in SPEC INT2006 applications over the alias analysis of a state-of-the-art compiler.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant