Minimal Unroll Factor for Code Generation of Software Pipelining
We address the problem of generating compact code from software pipelined loops. Although software pipelining is a powerful technique to extract fine-grain parallelism, it generates lifetime intervals spanning multiple loop iterations. These intervals require periodic register allocation (also called variable expansion), which in turn yields a code generation challenge. We are looking for the minimal unrolling factor enabling the periodic register allocation of software pipelined kernels. This challenge is generally addressed through one of: (1) hardware support in the form of rotating register files, which solve the unrolling problem but are expensive in hardware; (2) register renaming by inserting register moves, which increase the number of operations in the loop, and may damage the schedule of the software pipeline and reduce throughput; (3) post-pass loop unrolling that does not compromise throughput but often leads to impractical code growth. The latter approach relies on the proof that MAXLIVE registers (maximal number of values simultaneously alive) are sufficient for periodic register allocation (Eisenbeis et al. in PACT ’95: Proceedings of the IFIP WG10.3 working conference on Parallel Architectures and Compilation Techniques, pages 264–267, Manchester, UK, 1995; Hendren et al. in CC ’92: Proceedings of the 4th International Conference on Compiler Construction, pages 176–191, London, UK, 1992). However, the best existing heuristic for controlling this code growth—modulo variable expansion (Lam in SIGPLAN Not 23(7):318–328, 1988)—may not apply the correct amount of loop unrolling to guarantee that MAXLIVE registers are enough, which may result in register spills Eisenbeis et al. in PACT ’95: Proceedings of the IFIP WG10.3 working conference on Parallel Architectures and Compilation Techniques, pages 264–267, Manchester, UK, 1995. This paper presents our research results on the open problem of minimal loop unrolling, allowing a software-only code generation that does not trade the optimality of the initiation interval (II) for the compactness of the generated code. Our novel idea is to use the remaining free registers after periodic register allocation to relax the constraints on register reuse. The problem of minimal loop unrolling arises either before or after software pipelining, either with a single or with multiple register types (classes). We provide a formal problem definition for each scenario, and we propose and study a dedicated algorithm for each problem. Our solutions are implemented within an industrial-strength compiler for a VLIW embedded processor from STMicroelectronics, and validated on multiple benchmarks suites.
- Conference Article
6
- 10.1145/1375657.1375677
- Jun 12, 2008
This paper solves an open problem regarding loop unrolling after periodic register allocation. Although software pipelining is a powerful technique to extract fine-grain parallelism, it generates reuse circuits spanning multiple loop iterations. These circuits require periodic register allocation, which in turn yield a code generation challenge, generally addressed through: (1) hardware support --- rotating register files --- deemed too expensive for embedded processors, (2) insertion of register moves with a high risk of reducing the computation throughput --- initiation interval (II) --- of software pipelining, and (3) post-pass loop unrolling that does not compromise throughput but often leads to unpractical code growth. The latter approach relies on the proof that MAXLIVE registers are sufficient for periodic register allocation (2; 3; 5); yet the only heuristic to control the amount of post-pass loop unrolling does not achieve this bound and leads to undesired register spills (4; 7).We propose a periodic register allocation technique allowing a software-only code generation that does not trade the optimality of the II for compactness of the generated code. Our idea is based on using the remaining registers: calling Rarch the number of architectural registers of the target processor, then the number of remaining registers that can be used for minimising the unrolling degree is equal to Rarch-MAXLIVE.We provide a complete formalisation of the problem and algorithm, followed by extensive experiments. We achieve practical loop unrolling degrees in most cases --- with no increase of the II --- while state-of-the-art techniques would either induce register spilling, degrade the II or lead to unacceptable code growth.
- Research Article
1
- 10.1145/1379023.1375677
- Jun 12, 2008
- ACM SIGPLAN Notices
This paper solves an open problem regarding loop unrolling after periodic register allocation. Although software pipelining is a powerful technique to extract fine-grain parallelism, it generates reuse circuits spanning multiple loop iterations. These circuits require periodic register allocation, which in turn yield a code generation challenge, generally addressed through: (1) hardware support --- rotating register files --- deemed too expensive for embedded processors, (2) insertion of register moves with a high risk of reducing the computation throughput --- initiation interval ( II ) --- of software pipelining, and (3) post-pass loop unrolling that does not compromise throughput but often leads to unpractical code growth. The latter approach relies on the proof that MAXLIVE registers are sufficient for periodic register allocation (2; 3; 5); yet the only heuristic to control the amount of post-pass loop unrolling does not achieve this bound and leads to undesired register spills (4; 7). We propose a periodic register allocation technique allowing a software-only code generation that does not trade the optimality of the II for compactness of the generated code. Our idea is based on using the remaining registers: calling R arch the number of architectural registers of the target processor, then the number of remaining registers that can be used for minimising the unrolling degree is equal to R arch -MAXLIVE. We provide a complete formalisation of the problem and algorithm, followed by extensive experiments. We achieve practical loop unrolling degrees in most cases --- with no increase of the II --- while state-of-the-art techniques would either induce register spilling, degrade the II or lead to unacceptable code growth.
- Conference Article
1
- 10.1109/hpcsim.2012.6266972
- Jul 1, 2012
Software pipelining is a powerful technique to expose fine-grain parallelism, but it results in variables staying alive across more than one kernel iteration. It requires periodic register allocation and is challenging for code generation: the lack of a reliable solution currently restricts the applicability of software pipelining. The classical software solution that does not alter the computation throughput consists in unrolling the loop a posteriori [11], [10]. However, the resulting unrolling degree is often unacceptable and may reach absurd levels. Alternatively, loop unrolling can be avoided thanks to software register renaming. This is achieved through the insertion of move operations, but this may increase the initiation interval (II) which nullifies the benefits of software pipelining. This article aims at tightly controling the post-pass loop unrolling necessary to generate code. We study the potential of live range splitting to reduce kernel loop unrolling, introducing additional move instructions without inscreasing the II. We provide a complete formalisation of the problem, an algorithm, and extensive experiments. Our algorithm yields low unrolling degrees in most cases - with no increase of the II.
- Conference Article
18
- 10.5555/977395.977656
- Mar 20, 2004
Traditionally, software pipelining is applied either to theinnermost loop of a given loop nest or from the innermostloop to the outer loops. In a companion paper, we proposeda scheduling method, called Single-dimension SoftwarePipelining (SSP), to software pipeline a multi-dimensionalloop nest at an arbitrary loop level.In this paper, we describe our solution to SSP code generation.In contrast to traditional software pipelining, SSPhandles two distinct repetitive patterns, and thus requiresnew code generation algorithms. Further, these two distinctrepetitive patterns complicate register assignment and requiretwo levels of register renaming. As rotating registerssupport renaming at only one level, our solution is based ona combination of dynamic register renaming (using rotatingregisters) and static register renaming (using code replication).Finally, code size increase, an even more important issuefor SSP than for traditional software-pipelining, is alsoaddressed. Optimizations are proposed to reduce code sizewithout significant performance degradation.We first present a code generation scheme and subsequentlyimplement it for the IA-64 architecture, making effectiveuse of rotating registers and predicated execution.We present some initial experimental results, which demonstratenot only the feasibility and correctness of our codegeneration scheme, but also its code quality.
- Conference Article
11
- 10.1109/cgo.2004.1281673
- Jun 10, 2004
Traditionally, software pipelining is applied either to theinnermost loop of a given loop nest or from the innermostloop to the outer loops. In a companion paper, we proposeda scheduling method, called Single-dimension SoftwarePipelining (SSP), to software pipeline a multi-dimensionalloop nest at an arbitrary loop level.In this paper, we describe our solution to SSP code generation.In contrast to traditional software pipelining, SSPhandles two distinct repetitive patterns, and thus requiresnew code generation algorithms. Further, these two distinctrepetitive patterns complicate register assignment and requiretwo levels of register renaming. As rotating registerssupport renaming at only one level, our solution is based ona combination of dynamic register renaming (using rotatingregisters) and static register renaming (using code replication).Finally, code size increase, an even more important issuefor SSP than for traditional software-pipelining, is alsoaddressed. Optimizations are proposed to reduce code sizewithout significant performance degradation.We first present a code generation scheme and subsequentlyimplement it for the IA-64 architecture, making effectiveuse of rotating registers and predicated execution.We present some initial experimental results, which demonstratenot only the feasibility and correctness of our codegeneration scheme, but also its code quality.
- Research Article
- 10.1002/(sici)1520-684x(199808)29:9<62::aid-scj7>3.0.co;2-h
- Aug 1, 1998
- Systems and Computers in Japan
A considerable part of program execution time is consumed by loops, so that loop optimization is highly effective especially for the innermost loops of a program. Software pipelining and loop unrolling are known methods for loop optimization. Software pipelining is advantageous in that the code becomes only slightly longer. This method, however, is difficult to apply if the loop includes branching when the parallelism is limited. On the other hand, loop unrolling, while being free of such limitations, suffers from a number of drawbacks. In particular the code size grows substantially and it is difficult to determine the optimal number of body replications. In order to solve these problems, it seems important to combine software pipelining with loop unrolling so as to utilize the advantages of both techniques while paying due regard to properties of programs under consideration and to the machine resources available. This paper describes a method for applying optimal loop unrolling and effective software pipelining to achieve this goal. Program characteristics obtained by means of an extended PDG (program dependence graph) are taken into consideration as well as machine resources. © 1998 Scripta Technica, Syst Comp Jpn, 29(9): 62–73, 1998
- Conference Article
123
- 10.1145/263272.263286
- Jan 1, 1997
Article Free Access Share on Techniques for low energy software Authors: Huzefa Mehta Equator Technologies, Cambell, CA Equator Technologies, Cambell, CAView Profile , Robert Michael Owens Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PAView Profile , Mary Jane Irwin Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PAView Profile , Rita Chen Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PAView Profile , Debashree Ghosh Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PAView Profile Authors Info & Claims ISLPED '97: Proceedings of the 1997 international symposium on Low power electronics and designAugust 1997 Pages 72–75https://doi.org/10.1145/263272.263286Online:01 August 1997Publication History 86citation1,008DownloadsMetricsTotal Citations86Total Downloads1,008Last 12 Months36Last 6 weeks6 Get Citation AlertsNew Citation Alert added!This alert has been successfully added and will be sent to:You will be notified whenever a record that you have chosen has been cited.To manage your alert preferences, click on the button below.Manage my Alerts New Citation Alert!Please log in to your account Save to BinderSave to BinderCreate a New BinderNameCancelCreateExport CitationPublisher SiteeReaderPDF
- Research Article
9
- 10.1109/tc.2002.1032620
- Sep 1, 2002
- IEEE Transactions on Computers
Enhanced pipeline scheduling (EPS) is a software pipelining technique which can achieve a variable initiation interval (II) for loops with control flow via its code motion pipelining. EPS, however, leaves behind many renaming copy instructions that cannot be coalesced due to interferences. These copies take resources and, more seriously, they may cause a stall if they rename a multilatency instruction whose latency is longer than the II aimed for by EPS. This paper proposes a code transformation technique based on loop unrolling which makes those copies coalescible. Two unique features of the technique are its method of determining the precise unroll amount, based on an idea of extended live ranges, and its insertion of special bookkeeping copies at loop exits. The proposed technique enables EPS to avoid a serious slowdown from latency handling and resource pressure, while keeping its variable II and other advantages. In fact, renaming through copies, followed by unroll-based copy elimination, is EPS's solution to the cross-iteration register overwrite problem in software pipelining. It works for loops with arbitrary control flow that EPS must deal with, as well as for straightline loops. Our empirical study performed on a VLIW testbed with a two-cycle load latency shows that 86 percent of the otherwise uncoalescible copies in innermost loops become coalescible when unrolled 2.2 times on average. In addition, it is demonstrated that the unroll amount obtained is precise and the most efficient. The unrolled version of the VLIW code includes fewer no-op VLIW caused by stalls, improving the performance by a geometric mean of 18 percent on a 16-ALU machine.
- Book Chapter
3
- 10.1007/978-3-642-13374-9_19
- Jan 1, 2010
This paper improves our previous research effort [1] by providing an efficient method for kernel loop unrolling minimisation in the case of already scheduled loops, where circular lifetime intervals are known. When loops are software pipelined, the number of values simultaneously alive becomes exactly known giving better opportunities for kernel loop unrolling. Furthermore, fixing circular lifetime intervals allows us to reduce the algorithmic complexity of our method compared to [1] by computing a new research space for minimal kernel loop unrolling. The meeting graph (MG) is one of the [3] frameworks proposed in the literature which models loop unrolling and register allocation together in a common formal framework for software pipelined loops. Although MG significantly improves loop register allocation, the computed loop unrolling may lead to unpractical code growth. This work proposes to minimise the loop unrolling degree in the meeting graph by making an adaptation of [1] the approach described in . We explain how to reduce the research space for minimal kernel loop unrolling in the context of MG, yielding to a reduced algorithmic complexity. Furthermore, our experiments on SPEC2000, SPEC2006, MEDIABENCH and FFMPEG show that in concrete cases the loop unrolling minimisation is very fast and the minimal loop unrolling degree for 75% of the optimised loops is equal to 1 (i.e. no unroll), while it is equal to 7 when the software pipelining (SWP) schedule is not fixed.
- Research Article
38
- 10.1145/36177.36191
- Oct 1, 1987
- ACM SIGARCH Computer Architecture News
This paper studies two compilation techniques for enhancing scalar performance in high-speed scientific processors: software pipelining and loop unrolling. We study the impact of the architecture (size of the register file) and of the hardware (size of instruction buffer) on the efficiency of loop unrolling. We also develop a methodology for classifying software pipelining techniques. For loop unrolling, a straightforward scheduling algorithm is shown to produce near-optimal results when not inhibited by recurrences or memory hazards. Software pipelining requires less hardware but also achieves less speedup. Finally, we show that the performance produced with a modified CRAY-1S scalar architecture and a code scheduler utilizing loop unrolling is comparable to the performance achieved by the CRAY-1S with a vector unit and the CFT vectorizing compiler.
- Research Article
21
- 10.1145/79505.79508
- Sep 1, 1990
- ACM Transactions on Mathematical Software
This paper studies two compilation techniques for enhancing scalar performance in high-speed scientific processors: software pipelining and loop unrolling. We study the impact of the architecture (size of the register file) and of the hardware (size of instruction buffer) on the efficiency of loop unrolling. We also develop a methodology for classifying software pipelining techniques. For loop unrolling, a straightforward scheduling algorithm is shown to produce near-optimal results when not inhibited by recurrences or memory hazards. Our study indicates that the performance produced with a modified CRAY-1S scalar architecture and a code scheduler utilizing loop unrolling is comparable to the performance achieved by the CRAY-1S with a vector unit and the CFT vectorizing compiler. Finally, we show that the combination of loop unrolling and dynamic software pipelining, as implemented by a decoupled computer, substantially outperforms the vector CRAY-1S.
- Research Article
63
- 10.1145/36204.36191
- Oct 1, 1987
- ACM SIGOPS Operating Systems Review
This paper studies two compilation techniques for enhancing scalar performance in high-speed scientific processors: software pipelining and loop unrolling. We study the impact of the architecture (size of the register file) and of the hardware (size of instruction buffer) on the efficiency of loop unrolling. We also develop a methodology for classifying software pipelining techniques. For loop unrolling, a straightforward scheduling algorithm is shown to produce near-optimal results when not inhibited by recurrences or memory hazards. Software pipelining requires less hardware but also achieves less speedup. Finally, we show that the performance produced with a modified CRAY-1S scalar architecture and a code scheduler utilizing loop unrolling is comparable to the performance achieved by the CRAY-1S with a vector unit and the CFT vectorizing compiler.
- Research Article
7
- 10.1145/36205.36191
- Oct 1, 1987
- ACM SIGPLAN Notices
This paper studies two compilation techniques for enhancing scalar performance in high-speed scientific processors: software pipelining and loop unrolling. We study the impact of the architecture (size of the register file) and of the hardware (size of instruction buffer) on the efficiency of loop unrolling. We also develop a methodology for classifying software pipelining techniques. For loop unrolling, a straightforward scheduling algorithm is shown to produce near-optimal results when not inhibited by recurrences or memory hazards. Software pipelining requires less hardware but also achieves less speedup. Finally, we show that the performance produced with a modified CRAY-1S scalar architecture and a code scheduler utilizing loop unrolling is comparable to the performance achieved by the CRAY-1S with a vector unit and the CFT vectorizing compiler.
- Conference Article
33
- 10.1109/micro.1997.645802
- Nov 23, 2002
The performance of Very Long Instruction Word (VLIW) microprocessors depends on the close cooperation between the compiler and the architecture. This paper evaluates a set of important compilation techniques and related architectural features for VLIW machines. The evaluation is performed on a SPARC-based VLIW testbed where gcc-generated optimized SPARC code is scheduled into high-performance VLIW code. As a base scheduling compiler, we experiment with three core scheduling techniques including enhanced pipeline scheduling, all-path speculation, and renaming. We analyze the characteristics of the useful and useless ALUs in each cycle to see how many of those ALUs execute non-speculative operations, speculative operations, and copies, respectively. Then, we evaluate the following compilation techniques: software pipelining, loop unrolling, non-greedy enhanced pipeline scheduling, profile-based all-path speculation, trace-based speculation, renaming, restricted speculative loads, and memory disambiguation. Since we experiment on a uniform testbed based on a detailed analysis of ALUs, our evaluation provides an useful insight on the performance impact of these techniques.
- Conference Article
- 10.23919/elinfocom.2018.8330584
- Jan 1, 2018
To improve the overall performance of computer systems, instruction-level parallelism (ILP) has been widely exploited. However, branch hazards, conditional and unconditional, still limit the efficiency of most ILP techniques. Compiler techniques such as loop unrolling, software pipelining, and trace scheduling are being used to increase the amount of parallelism available in systems with fairly predictable branches, while predicated instructions have been useful in eliminating branch hazards in specific cases. The limitations imposed on ILP by branch hazards, however, are significant in large blocks of codes or, at best, hidden at the expense of processor resources. As a result, researchers are exploring the techniques of approximate computing, which when applied, would be suitable for only fault-tolerant systems. Some are also working on the methods of code approximation, which mainly involves hazard minimization by distribution over specific parts of code segments. In this work, we propose and demonstrate a novel branch hazard distribution technique - Symbolic Execution using Approximate Computing (SEAC). We applied the proposed technique to a test program and ran simulation experiments using the Detailed CPU model in gem5 simulator. Simulation results show that SEAC is 3.57, 1.95 and 1.32 times better than the best, among the tested conventional ILP techniques, based on speedup, energy saving, and branch hazard distribution coefficient respectively.