LLM-Generated Invariants for Bounded Model Checking Without Loop Unrolling
We investigate a modification of the classical Bounded Model Checking (BMC) procedure that does not handle loops through unrolling but via modifications to the control flow graph (CFG). A portion of the CFG representing a loop is replaced by a node asserting invariants of the loop. We generate these invariants using Large Language Models (LLMs) and use a first-order theorem prover to ensure the correctness of the generated statements. We thus transform programs to loop-free variants in a sound manner. Our experimental results show that the resulting tool, ESBMC ibmc, is competitive with state-of-the-art formal verifiers for programs with unbounded loops, significantly improving the number of programs verified by the industrial-strength software verifier ESBMC and verifying programs that state-of-the-art software verifiers such as SeaHorn and VeriAbs could not.
- Book Chapter
34
- 10.1007/3-540-61053-7_53
- Jan 1, 1996
A well-known code transformation for improving the run-time performance of a program is loop unrolling. The most obvious benefit of unrolling a loop is that the transformed loop usually requires fewer instruction executions than the original loop. The reduction in instruction executions comes from two sources: the number of branch instructions executed is reduced, and the control variable is modified fewer times. In addition, for architectures with features designed to exploit instruction-level parallelism, loop unrolling can expose greater levels of instruction-level parallelism. Loop unrolling is an effective code transformation often improving the execution performance of programs that spend much of their execution time in loops by 10 to 30 percent. Possibly because of the effectiveness of a simple application of loop unrolling, it has not been studied as extensively as other code improvements such as register allocation or common subexpression elimination. The result is that many compilers employ simplistic loop unrolling algorithms that miss many opportunities for improving run-time performance. This paper describes how aggressive loop unrolling is done in a retargetable optimizing compiler. Using a set of 32 benchmark programs, the effectiveness of this more aggressive approach to loop unrolling is evaluated. The results show that aggressive loop unrolling can yield additional performance increase of 10 to 20 percent over the simple, naive approaches employed by many production compilers.
- Book Chapter
3
- 10.1007/978-3-642-13374-9_19
- Jan 1, 2010
This paper improves our previous research effort [1] by providing an efficient method for kernel loop unrolling minimisation in the case of already scheduled loops, where circular lifetime intervals are known. When loops are software pipelined, the number of values simultaneously alive becomes exactly known giving better opportunities for kernel loop unrolling. Furthermore, fixing circular lifetime intervals allows us to reduce the algorithmic complexity of our method compared to [1] by computing a new research space for minimal kernel loop unrolling. The meeting graph (MG) is one of the [3] frameworks proposed in the literature which models loop unrolling and register allocation together in a common formal framework for software pipelined loops. Although MG significantly improves loop register allocation, the computed loop unrolling may lead to unpractical code growth. This work proposes to minimise the loop unrolling degree in the meeting graph by making an adaptation of [1] the approach described in . We explain how to reduce the research space for minimal kernel loop unrolling in the context of MG, yielding to a reduced algorithmic complexity. Furthermore, our experiments on SPEC2000, SPEC2006, MEDIABENCH and FFMPEG show that in concrete cases the loop unrolling minimisation is very fast and the minimal loop unrolling degree for 75% of the optimised loops is equal to 1 (i.e. no unroll), while it is equal to 7 when the software pipelining (SWP) schedule is not fixed.
- Conference Article
4
- 10.1109/cisp.2008.211
- Jan 1, 2008
Window operations which are computationally intensive and data intensive are frequently used in image compression, pattern recognition and digital signal processing. Reconfigurable hardware boards provide a convenient and flexible solution to speed up these algorithms. This paper studies the effect of loop unrolling on the area, clock speed and throughput based on a data schedule method to find the latent connections between the three capabilities and loop unrolling. Our results indicate that due to the unique design of the compilation framework. Inner loop unrolling makes the controllers become more complicated than outer loop unrolling and increase the requirement of areas at the same time. However, outer loop unrolling demands more memory elements to keep the reused data. The clock speed begins to decrease when the number of RAM modules extends to a certain size, and the throughput increase in different degrees for different operations.
- Conference Article
- 10.1109/iwia.2003.1262786
- Jul 17, 2003
Loop unrolling is today one of the most effective optimizations for modern architectures. To give an analytical model for loop unrolling performance, unrolling shape was proposed. It was applied to in-order processors, and was proved to give an accurate performance model for loop unrolling in term of software pipelining and cache miss alleviation. In this paper, we apply unrolling shape to out-of-order processors. A scheme for calculating PL/sub OOO/, pipelining terms of an unrolled loop by factor l are presented as PL/sub OOO/(l) = {(Nins(l)/F + NOccpy(l))}/l, where Nins(l) is the number of instructions in an unrolled loop by factor l, F the fetch rate of the architecture, NOccpy(l) the number of store instructions scheduled after Nins(l)/F-th cycle. A pipelining term for in-order processors is essential for calculating NOccpy(l). It is to be noted that the scheme for out-of-order processors uses unrolling shape for in-order processors. Experiments show that our scheme is precise in calculating the behaviour of loop unrolling on out-of-order processors. We show that our scheme quantitatively shows the effect of loop unrolling as the one of infinitely unrolled loops on in-order processors. Furthermore, we reveal that the old folklore that the loop unrolling reduces the loop overhead has revived on out-of-order processors as a performance improvement factor as d/dlPL/sub OOO/ (Aho et al., 1986).
- Research Article
- 10.1002/(sici)1520-684x(199808)29:9<62::aid-scj7>3.0.co;2-h
- Aug 1, 1998
- Systems and Computers in Japan
A considerable part of program execution time is consumed by loops, so that loop optimization is highly effective especially for the innermost loops of a program. Software pipelining and loop unrolling are known methods for loop optimization. Software pipelining is advantageous in that the code becomes only slightly longer. This method, however, is difficult to apply if the loop includes branching when the parallelism is limited. On the other hand, loop unrolling, while being free of such limitations, suffers from a number of drawbacks. In particular the code size grows substantially and it is difficult to determine the optimal number of body replications. In order to solve these problems, it seems important to combine software pipelining with loop unrolling so as to utilize the advantages of both techniques while paying due regard to properties of programs under consideration and to the machine resources available. This paper describes a method for applying optimal loop unrolling and effective software pipelining to achieve this goal. Program characteristics obtained by means of an extended PDG (program dependence graph) are taken into consideration as well as machine resources. © 1998 Scripta Technica, Syst Comp Jpn, 29(9): 62–73, 1998
- Conference Article
48
- 10.2514/6.1990-1149
- Apr 2, 1990
A fast, accurate Choleski method for the solution of symmetric systems of linear equations is presented. This direct method is based on a variable-band storage scheme and takes advantage of column heights to reduce the number of operations in the Choleski factorization. The method employs parallel computation in the outermost DO-loop and vector computation via the 'loop unrolling' technique in the innermost DO-loop. The method avoids computations with zeros outside the column heights, and as an option, zeros inside the band. The close relationship between Choleski and Gauss elimination methods is examined. The minor changes required to convert the Choleski code to a Gauss code to solve non-positive-definite symmetric systems of equations are identified. The results for two large-scale structural analyses performed on supercomputers, demonstrate the accuracy and speed of the method.
- Conference Article
- 10.1117/12.527107
- Apr 19, 2004
- Proceedings of SPIE, the International Society for Optical Engineering/Proceedings of SPIE
As wireless video products evolve, they demand more sophisticated processing at higher resolutions and frame rates. Computational performance and energy efficiency have become critical design issues. This paper presents the Quantized Color Pack eXtension (QCPX) combined with a loop unrolling (LU) technique to improve execution performance and energy efficiency of color image and video processing applications. QCPX applied to a 32-bit datapath processor supports parallel operations on two packed 16-bit YCbCr (Y: luminance, Cr and Cb: chrominance) color pixels, providing greater subword-level parallelism by increasing the number of smaller color pixels packed into a word. Instruction-level parallelism can be further enhanced through loop unrolling. These techniques provide greater performance and efficiency for multimedia workloads on mobile systems. Experimental results on a set of media benchmark applications indicate that the LU plus QCPX-optimized version achieves a speedup ranging from 3.8 to 7.9 while reducing the energy consumption from 76% to 87% over the baseline version on identically configured, dynamically scheduled ILP superscalar processors. The LU plus QCPX-optimized version also outperforms the LU plus MDMX-like (MIPS’s multimedia extension) version.
- Conference Article
1
- 10.1109/iciinfs.2009.5429845
- Dec 1, 2009
Application Specific Instruction-set Processor (ASIP) is one of the popular processor design techniques for embedded systems which allows customizability in processor design without overly hindering design flexibility. Multi-pipeline ASIPs were proposed to improve the performance of such systems by compromising between speed and processor area. One of the problems in the multi-pipeline design is the limited inherent instruction level parallelism (ILP) available in applications. The ILP of application programs can be improved via a compiler optimization technique known as loop unrolling. In this paper, we present how loop unrolling effects the performance of multi-pipeline ASIPs. The improvements in performance average around 15% for a number of benchmark applications with the maximum improvement of around 30%. In addition, we analyzed the variable of performance against loop unrolling factor, which is the amount of unrolling we perform.
- Research Article
63
- 10.1145/36204.36191
- Oct 1, 1987
- ACM SIGOPS Operating Systems Review
This paper studies two compilation techniques for enhancing scalar performance in high-speed scientific processors: software pipelining and loop unrolling. We study the impact of the architecture (size of the register file) and of the hardware (size of instruction buffer) on the efficiency of loop unrolling. We also develop a methodology for classifying software pipelining techniques. For loop unrolling, a straightforward scheduling algorithm is shown to produce near-optimal results when not inhibited by recurrences or memory hazards. Software pipelining requires less hardware but also achieves less speedup. Finally, we show that the performance produced with a modified CRAY-1S scalar architecture and a code scheduler utilizing loop unrolling is comparable to the performance achieved by the CRAY-1S with a vector unit and the CFT vectorizing compiler.
- Conference Article
70
- 10.1109/ipdps.2010.5470423
- Jan 1, 2010
Graphics Processing Units (GPUs) are massively parallel, many-core processors with tremendous computational power and very high memory bandwidth. With the advent of general purpose programming models such as NVIDIA's CUDA and the new standard OpenCL, general purpose programming using GPUs (GPGPU) has become very popular. However, the GPU architecture and programming model have brought along with it many new challenges and opportunities for compiler optimizations. One such classical optimization is loop unrolling. Current GPU compilers perform limited loop unrolling. In this paper, we attempt to understand the impact of loop unrolling on GPGPU programs. We develop a semi-automatic, compile-time approach for identifying optimal unroll factors for suitable loops in GPGPU programs. In addition, we propose techniques for reducing the number of unroll factors evaluated, based on the characteristics of the program being compiled and the device being compiled to. We use these techniques to evaluate the effect of loop unrolling on a range of GPGPU programs and show that we correctly identify the optimal unroll factors. The optimized versions run up to 70% faster than the unoptimized versions.
- Conference Article
11
- 10.1109/mipro.2014.6859582
- May 1, 2014
Loop unrolling is a well known technique, which usually results with speedup of a program that contains loops. The effect is obtained by reducing the operations that require counter increases and branch jumps at the end of the loops. This paper analyzes the impact of loop unrolling on various processor types and memory patterns. The experiments show a high correlation between the cache and the problem size. The loop unrolling results with a higher speedup for the execution of a smaller size problem, while it does not have impact for a problem whose size is greater than the capacity of the last level cache size, due to the huge number of cache misses. Another important result is that the loop unrolling achieves greater speedup on Intel, rather than AMD CPU. In this paper we analyze and discuss the various behaviors of loop unrolling.
- Research Article
- 10.3724/sp.j.1016.2008.00989
- Oct 10, 2009
- Chinese Journal of Computers
Window operations which are computationally intensive and data intensive are frequently used in image compression,pattern recognition and digital signal processing.Reconfigurable hardware boards provide a convenient and flexible solution to speed up these algorithms.Based on a memory and data schedule method as well as the method of data-path generation,this paper studies the effect of loop unrolling on the area,clock speed and throughput for sliding window operations.The results indicate that due to the unique design of the compilation framework,inner loop unrolling makes the controllers become more complicated than outer loop unrolling and increase more requirements of areas at the same time.However,outer loop unrolling demands more memory elements to keep the reused data.The clock speed begins to decrease when the number of RAM modules extends to a certain size,and the throughput increase in different degrees for different operations.
- Conference Article
18
- 10.1145/3377555.3377890
- Feb 22, 2020
Loop unrolling is a widely adopted loop transformation, commonly used for enabling subsequent optimizations. Straight-line-code vectorization (SLP) is an optimization that benefits from unrolling. SLP converts isomorphic instruction sequences into vector code. Since unrolling generates repeatead isomorphic instruction sequences, it enables SLP to vectorize more code. However, most production compilers apply these optimizations independently and uncoordinated. Unrolling is commonly tuned to avoid code bloat, not maximizing the potential for vectorization, leading to missed vectorization opportunities. We are proposing VALU, a novel loop unrolling heuristic that takes vectorization into account when making unrolling decisions. Our heuristic is powered by an analysis that estimates the potential benefit of SLP vectorization for the unrolled version of the loop. Our heuristic then selects the unrolling factor that maximizes the utilization of the vector units. VALU also forwards the vectorizable code to SLP, allowing it to bypass its greedy search for vectorizable seed instructions, exposing more vectorization opportunities. Our evaluation on a production compiler shows that VALU uncovers many vectorization opportunities that were missed by the default loop unroller and vectorizers. This results in more vectorized code and significant performance speedups for 17 of the kernels of the TSVC benchmarks suite, reaching up to 2× speedup over the already highly optimized -O3. Our evaluation on full benchmarks from FreeBench and MiBench shows that VALU results in a geo-mean speedup of 1.06×.
- Book Chapter
2
- 10.1007/978-3-540-39920-9_9
- Jan 1, 2003
Loops in programs are the source of many optimizations for improving program performance, particularly on modern high-performance architectures as well as vector and multithreaded systems. Techniques such as loop invariant code motion, loop unrolling and loop peeling have demonstrated their utility in compiler optimizations. However, many of these techniques can only be used in very limited cases when the loops are ”well-structured” and easy to analyze. For instance, loop invariant code motion works only when invariant code is inside loops; loop unrolling and loop peeling work effectively when the array references are either constants or affine functions of index variable. It is our contention that there are many opportunities overlooked by limiting the optimizations to well structured loops. In many cases, even ”badly-structured” loops may be transformed into well structured loops. As a case in point, we show how some loop-dependent code can be transformed into loop-invariant code by transforming the loops. Our technique described in this paper relies on unfolding the loop for several initial iterations such that more opportunities may be exposed for many other existing compiler optimization techniques such as loop invariant code motion, loop peeling, loop unrolling, and so on.KeywordsAffine FunctionControl DependenceCompiler OptimizationInstruction Level ParallelismDependence EdgeThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
- Research Article
10
- 10.1145/967278.967284
- Feb 1, 2004
- ACM SIGPLAN Notices
Loops in programs are the source of many optimizations for improving program performance, particularly on modern high-performance architectures as well as vector and multithreaded systems. Techniques such as loop invariant code motion, loop unrolling and loop peeling have demonstrated their utility in compiler optimizations. However, many of these techniques can only be used in very limited cases when the loops are "well-structured" and easy to analyze. For instance, loop invariant code motion works only when invariant code is inside loops; loop unrolling and loop peeling work effectively when the array references are either constants or affine functions of index variable. It is our contention that there are many opportunities overlooked by limiting the optimizations to "well structured" loops. In many cases, even "badly-structured" loops may be transformed into "well structured" loops. As a case in point, we show how some loop-dependent code can be transformed into loop-independent code by transforming the loops. Our technique described in this paper relies on unfolding the loop for several initial iterations such that more opportunities may be exposed for many other existing compiler optimization techniques such as loop invariant code motion, loop peeling, loop unrolling and so on.