Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Vectorization-aware loop unrolling with seed forwarding

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Loop unrolling is a widely adopted loop transformation, commonly used for enabling subsequent optimizations. Straight-line-code vectorization (SLP) is an optimization that benefits from unrolling. SLP converts isomorphic instruction sequences into vector code. Since unrolling generates repeatead isomorphic instruction sequences, it enables SLP to vectorize more code. However, most production compilers apply these optimizations independently and uncoordinated. Unrolling is commonly tuned to avoid code bloat, not maximizing the potential for vectorization, leading to missed vectorization opportunities. We are proposing VALU, a novel loop unrolling heuristic that takes vectorization into account when making unrolling decisions. Our heuristic is powered by an analysis that estimates the potential benefit of SLP vectorization for the unrolled version of the loop. Our heuristic then selects the unrolling factor that maximizes the utilization of the vector units. VALU also forwards the vectorizable code to SLP, allowing it to bypass its greedy search for vectorizable seed instructions, exposing more vectorization opportunities. Our evaluation on a production compiler shows that VALU uncovers many vectorization opportunities that were missed by the default loop unroller and vectorizers. This results in more vectorized code and significant performance speedups for 17 of the kernels of the TSVC benchmarks suite, reaching up to 2× speedup over the already highly optimized -O3. Our evaluation on full benchmarks from FreeBench and MiBench shows that VALU results in a geo-mean speedup of 1.06×.

Similar Papers
  • Book Chapter
  • Cite Count Icon 34
  • 10.1007/3-540-61053-7_53
Aggressive loop unrolling in a retargetable, optimizing compiler
  • Jan 1, 1996
  • Jack W Davidson + 1 more

A well-known code transformation for improving the run-time performance of a program is loop unrolling. The most obvious benefit of unrolling a loop is that the transformed loop usually requires fewer instruction executions than the original loop. The reduction in instruction executions comes from two sources: the number of branch instructions executed is reduced, and the control variable is modified fewer times. In addition, for architectures with features designed to exploit instruction-level parallelism, loop unrolling can expose greater levels of instruction-level parallelism. Loop unrolling is an effective code transformation often improving the execution performance of programs that spend much of their execution time in loops by 10 to 30 percent. Possibly because of the effectiveness of a simple application of loop unrolling, it has not been studied as extensively as other code improvements such as register allocation or common subexpression elimination. The result is that many compilers employ simplistic loop unrolling algorithms that miss many opportunities for improving run-time performance. This paper describes how aggressive loop unrolling is done in a retargetable optimizing compiler. Using a set of 32 benchmark programs, the effectiveness of this more aggressive approach to loop unrolling is evaluated. The results show that aggressive loop unrolling can yield additional performance increase of 10 to 20 percent over the simple, naive approaches employed by many production compilers.

  • Conference Article
  • Cite Count Icon 20
  • 10.1109/ecrts.2009.9
Combining Worst-Case Timing Models, Loop Unrolling, and Static Loop Analysis for WCET Minimization
  • Jul 1, 2009
  • Paul Lokuciejewski + 1 more

Program loops are notorious for their optimization potential on modern high-performance architectures. Compilers aim at their aggressive transformation to achieve large improvements of the program performance. In particular, the optimization loop unrolling has shown in the past decades to be highly effective achieving significant increases of the average-case performance. In this paper, we present loop unrolling that is tailored towards real-time systems. Our novel optimization is driven by worst-case execution time (WCET) information to effectively minimize the program's worst-case behavior. To exploit maximal optimization potential, the determination of a suitable unrolling factor is based on precise loop iteration counts provided by a static loop analysis. In addition,our heuristics avoid adverse effects of unrolling which result from instruction cache overflows and the generation of additional spill code. Results on 45 real-life benchmarks demonstrate that aggressive loop unrolling can yield WCET reductions of up to 13.7% over simple, naive approaches employed by many production compilers.

  • Conference Article
  • Cite Count Icon 13
  • 10.1109/hipc.2009.5433205
CellMT: A cooperative multithreading library for the Cell/B.E.
  • Dec 1, 2009
  • Vicenc Beltran + 3 more

The Cell BE processor has proved that heterogeneous multi-core systems can provide a huge computational power with high efficiency for a wide range of applications. The simple design of the computational units and the use of small managed local memories is the key to achieve high efficiency and performance at the same time. However, this simple and efficient hardware design comes at the price of higher code complexity. The code written to run in this kind of processors must deal with several issues such as code vectorization, loop unrolling or the explicit management of local memories. Some of these issues such as vectorization or loop unrolling can be partially solved by the compiler, but the overlapping of data transfer and computation times must be manually addressed by the programmer with techniques such as double buffering that increase the code complexity. In this paper we present a user level threading library called CellMT that effectively hide memory latencies. The concurrent execution of several threads inside each SPU naturally overlaps computation and data transfer times without increasing the code complexity. To prove the suitability and feasibility of our multi-threaded library, we perform an exhaustive performance evaluation with a synthetic benchmark and a real application. The experimental results show that the multithreaded approach can outperform a hand-coded double buffering scheme, with speedups from 0.96x to 3.2x, while maintaining the complexity of a naive buffering scheme.

  • Conference Article
  • Cite Count Icon 26
  • 10.1145/76263.76265
Vectorization on Monte Carlo particle transport: an architectural study using the LANL benchmark “GAMTEB”
  • Jan 1, 1989
  • P J Burns + 4 more

Fully vectorized versions of the Los Alamos National Laboratory benchmark code Gamteb, a Monte Carlo photon transport algorithm, were developed for the Cyber 205/ETA-10 and Cray X-MP/Y-MP architectures. Single-processor performance measurements of the vector and scalar implementations were modeled in a modified Amdahl's Law that accounts for additional data motion in the vector code. The performance and implementation strategy of the vector codes are related to architectural features of each machine. Speedups between fifteen and eighteen for Cyber 205/ETA-10 architectures, and about nine for CRAY X-MP/Y-MP architectures are observed. The best single processor execution time for the problem was 0.33 seconds on the ETA-10G, and 0.42 seconds on the CRAY Y-MP.

  • Conference Article
  • 10.1109/pdp.2012.49
On Optimizing the Longest Common Subsequence Problem by Loop Unrolling Along Wavefronts
  • Feb 1, 2012
  • Johann Steinbrecher + 1 more

Loop unrolling is a loop transformation where a few loop iterations are grouped as a super iteration for exploring more independent instructions and to decrease the total loop overhead. This paper characterizes loop unrolling by the unrolling factor, the number of iterations in a super iteration and the unrolling direction, the choice of iterations to be grouped to form the super iteration. We use loop unrolling for maximizing instruction-level parallelism in the longest common subsequence problem. To increase the number of independent instructions in the super iteration, we use a linear schedule to group iterations on the same wave front, a hyper plane in the loop iteration space. Then, the loop is unrolled along the wave front which guarantees all iterations in the same super iteration are independent. The selection of the optimal unrolling factor is based on the assumption that if all the pipelines are saturated, the performance should not be bad. Two necessary conditions and a sufficient condition for optimality are presented and used to find the optimal unrolling factor. The total execution time is expressed as a function of algorithm parameters, architecture parameters and the unrolling factor. A benchmark of the technique scores a 1.475 speed-up over traditional methods.

  • Conference Article
  • Cite Count Icon 32
  • 10.5555/2738600.2738625
PSLP: padded SLP automatic vectorization
  • Feb 7, 2015
  • Vasileios Porpodas + 2 more

The need to increase performance and power efficiency in modern processors has led to a wide adoption of SIMD vector units. All major vendors support vector instructions and the trend is pushing them to become wider and more powerful. However, writing code that makes efficient use of these units is hard and leads to platform-specific implementations. Compiler-based automatic vectorization is one solution for this problem. In particular the Superword-Level Parallelism (SLP) vectorization algorithm is the primary way to automatically generate vector code starting from straight-line scalar code. SLP is implemented in all major compilers, including GCC and LLVM. SLP relies on finding sequences of isomorphic instructions to pack together into vectors. However, this hinders the applicability of the algorithm as isomorphic code sequences are not common in practice. In this work we propose a solution to overcome this limitation. We introduce Padded SLP (PSLP), a novel vectorization algorithm that can vectorize code containing non-isomorphic instruction sequences. It injects a near-minimal number of redundant instructions into the code to transform non-isomorphic sequences into isomorphic ones. The padded instruction sequence can then be successfully vectorized. Our experiments show that PSLP improves vectorization coverage across a number of kernels and full benchmarks, decreasing execution time by up to 63%.

  • Conference Article
  • Cite Count Icon 38
  • 10.1109/cgo.2015.7054199
PSLP: Padded SLP automatic vectorization
  • Feb 1, 2015
  • Vasileios Porpodas + 2 more

The need to increase performance and power efficiency in modern processors has led to a wide adoption of SIMD vector units. All major vendors support vector instructions and the trend is pushing them to become wider and more powerful. However, writing code that makes efficient use of these units is hard and leads to platform-specific implementations. Compiler-based automatic vectorization is one solution for this problem. In particular the Superword-Level Parallelism (SLP) vectorization algorithm is the primary way to automatically generate vector code starting from straight-line scalar code. SLP is implemented in all major compilers, including GCC and LLVM. SLP relies on finding sequences of isomorphic instructions to pack together into vectors. However, this hinders the applicability of the algorithm as isomorphic code sequences are not common in practice. In this work we propose a solution to overcome this limitation. We introduce Padded SLP (PSLP), a novel vectorization algorithm that can vectorize code containing non-isomorphic instruction sequences. It injects a near-minimal number of redundant instructions into the code to transform non-isomorphic sequences into isomorphic ones. The padded instruction sequence can then be successfully vectorized. Our experiments show that PSLP improves vectorization coverage across a number of kernels and full benchmarks, decreasing execution time by up to 63%.

  • Single Report
  • Cite Count Icon 2
  • 10.21236/ada326916
Effects of Loop Unrolling and Loop Fusion on Register Pressure and Code Performance.
  • Jun 1, 1997
  • Dale Shires

: Many of today's high-performance computer processors are super-scalar. They can dispatch multiple instructions per cycle and, hence, provide what is commonly referred to as instruction-level parallelism. This super-scalar capability, combined with software pipelining, can increase processor throughput dramatically. Achieving maximum throughput, however, is nontrivial. Compilers must engage in aggressive optimization techniques, such as loop unrolling, speculative code motion, etc., to structure code to take full advantage of the underlying computer architecture. The phase-ordering implications of these optimizations are not well understood and are the subject of continuing research. Of particular interest are optimizations that enhance instruction-level parallelism. Two of these are loop unrolling and loop fusion. These are source-level optimizations that can be performed by either the programmer or the compiler. These optimizations have dramatic effects on the compiler's instruction scheduler. Performed too aggressively, these optimizations can increase register pressure and result in costly memory references. This paper details experiments performed to measure the effects of these source-level code transformations and how they influenced register pressure and code performance.

  • Book Chapter
  • Cite Count Icon 2
  • 10.1007/978-3-642-13374-9_7
Mapping Streaming Languages to General Purpose Processors through Vectorization
  • Jan 1, 2010
  • Raymond Manley + 1 more

Streaming languages were originally aimed at streaming architectures, but recent work has shown the stream programming model to be useful in exploiting parallelism on general purpose processors. Current research in mapping stream code onto GPPs deals with load balancing and generating threads based on hardware features. We look into improving problems associated with stream data locality and stream data parallelism on GPPs. We suggest that automatically generating vectorized code for these streaming operations is a potential solution. We use the Brook stream language as our syntax base and augment it to generate vector intrinsics targeting the x86 architecture. This compiler uses both existing and new strategies to transform high-level streaming kernel code into vector instructions without requiring additional annotations. We compare our system's results to existing mapping strategies aimed at using stream code on GPPs. When evaluating performance, we see a wide range of speedups from a few percent to over 2x and discuss the level of effectiveness of using vector code over scalar equivalents in specific application domains.

  • Research Article
  • Cite Count Icon 2
  • 10.1109/tpds.2021.3091015
Compiler-Assisted Compaction/Restoration of SIMD Instructions
  • Apr 1, 2022
  • IEEE Transactions on Parallel and Distributed Systems
  • Juan M Cebrian + 6 more

Vector processors (e.g., SIMD or GPUs) are ubiquitous in high performance systems. All the supercomputers in the world exploit data-level parallelism (DLP), for example by using single instructions to operate over several data elements. Improving vector processing is therefore key for exascale computing. However, despite its potential, vector code generation and execution have significant challenges. Among these challenges, control flow divergence is one of the main performance limiting factors. Most modern vector instruction sets, including SIMD, rely on predication to support divergence control. Nevertheless, the performance and energy consumption in predicated codes is usually insensitive to the number of active elements in a predicated mask. Since the trend is that vector register size increases, the energy efficiency of exascale computing systems will become sub-optimal. This article proposes a novel approach to improve execution efficiency in predicated vector codes, the Compiler-Assisted Compaction/Restoration (CACR) technique. Baseline CR delays predicated SIMD instructions with inactive elements, compacting active elements from instances of the same instruction of consecutive loop iterations. Compacted elements form an equivalent <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">dense</i> vector instruction. After executing the dense instructions, their results are restored to the original instructions. However, CR has a significant performance and energy penalty when it fails to find active elements, either due to lack of resources when unrolling or because of inter-loop dependencies. In CACR, the compiler analyzes the code looking for key information required to configure CR. Then, it passes this information to the processor via new instructions inserted in the code. This prevents CR from waiting for active elements on scenarios when it would fail to form dense instructions. Simulated results (gem5) show that CACR improves performance by up to 29 percent and reduces dynamic energy by up to 24.2 percent on average, for a a set of applications with predicated execution. The baseline CR only achieves 18.6 percent performance and 14 percent energy improvements for the same configuration and applications.

  • Book Chapter
  • Cite Count Icon 48
  • 10.1007/978-3-540-71351-7_28
A Rewriting System for the Vectorization of Signal Transforms
  • Jun 10, 2006
  • Franz Franchetti + 2 more

We present a rewriting system that automatically vectorizes signal transform algorithms at a high level of abstraction. The input to the system is a transform algorithm given as a formula in the well-known Kronecker product formalism. The output is a formula, which means it consists exclusively of constructs that can be directly mapped into short vector code. This approach obviates compiler vectorization, which is known to be limited in this domain. We included the formula vectorization into the Spiral program generator for signal transforms, which enables us to generate vectorized code and further optimize for the memory hierarchy through search over alternative algorithms. Benchmarks for the discrete Fourier transform (DFT) show that our generated floating-point code is competitive with and that our fixed-point code clearly outperforms the best available libraries.

  • Conference Article
  • Cite Count Icon 16
  • 10.5555/3039686.3039823
MDS code constructions with small sub-packetization and near-optimal repair bandwidth
  • Jan 16, 2017
  • arXiv (Cornell University)
  • Venkatesan Guruswami + 1 more

An (n, M) vector code C ⊆ &#x1d53d;n is a collection of M codewords where n elements (from the field &#x1d53d;) in each of the codewords are referred to as code blocks. Assuming that &#x1d53d; ≅ &#x1d539;e, the code blocks are treated as e-length vectors over the base field &#x1d539;. Equivalently, the code is said to have the sub-packetization level e. This paper addresses the problem of constructing MDS vector codes which enable exact reconstruction of each code block by downloading small amount of information from the remaining code blocks. The repair bandwidth of a code measures the information flow from the remaining code blocks during the reconstruction of a single code block. This problem naturally arises in the context of distributed storage systems as the node repair problem [4]. Assuming that M = |&#x1d539;|ke, the repair bandwidth of an MDS vector code is lower bounded by ((n − 1)/(n − k))· e symbols (over the base field &#x1d539;) which is also referred to as the cut-set bound [4]. For all values of n and k, the MDS vector codes that attain the cut-set bound with the sub-packetization level e = (n − k)⌈n/(n − k)⌉ are known in the literature [23,36].This paper presents a construction for MDS vector codes which simultaneously ensures both small repair bandwidth and small sub-packetization level. The obtained codes have the smallest possible sub-packetization level e = O(n − k) for an MDS vector code and the repair bandwidth which is at most twice the cut-set bound. The paper then generalizes this code construction so that the repair bandwidth of the obtained codes approach the cut-set bound at the cost of increased sub-packetization level. The constructions presented in this paper give MDS vector codes which are linear over the base field &#x1d539;.

  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/978-3-030-72789-5_2
PostSLP: Cross-Region Vectorization of Fully or Partially Vectorized Code
  • Jan 1, 2021
  • Vasileios Porpodas + 1 more

Modern optimizing compilers rely on auto-vectorization algorithms for generating high-performance code. Both loop and straight-line code vectorization algorithms generate SIMD vector instructions out of scalar code, with no intervention from the programmer.In this work, we show that the existing auto-vectorization algorithms operate on restricted code regions and therefore are missing out vectorization opportunities by either generating narrower vectors than those possible for the target architecture or are completely failing and leaving some of the code in scalar form. We show the need for a specialized post-processing re-vectorization pass, called PostSLP, that has the ability to span across multiple regions, and to generate more effective vector code. PostSLP is designed to convert already vectorized, or partially vectorized code into wider forms that perform better on the target architecture. We implemented PostSLP in LLVM and our evaluation shows significant performance improvements in SPEC CPU2006.

  • Conference Article
  • Cite Count Icon 12
  • 10.1109/isvlsi.2014.10
Swarm Intelligence Driven Simultaneous Adaptive Exploration of Datapath and Loop Unrolling Factor during Area-Performance Tradeoff
  • Jul 1, 2014
  • Anirban Sengupta + 1 more

Multi objective (MO) design space exploration (DSE) in high level synthesis (HLS) is a tedious task which administers the usage of intelligent decision making strategies at multiple stages to yield quality results. The problem of DSE becomes intractable and intricate when an auxiliary variable such as loop unrolling factor plays a vital role in the decision making process. This paper successfully solves the above problem by proposing the novel DSE approach for fully automated parallel (simultaneous) exploration of optimal datapath and unrolling factor (UF) during area-performance tradeoff in HLS. The proposed DSE approach is driven by hyper-dimensional particle swarm optimization (PSO). The major sub-contributions of this proposed algorithm includes: a) deriving a model for computation of execution delay of a loop unrolled control data flow graph (CDFG) based on resource constraint, without the necessity of tediously unrolling the entire CDFG in most cases, b) Consideration of loop unrolling and its impact on: i) control states and execution delay tradeoff during loop unrolling ii) area-execution delay tradeoff during the DSE process, c) novel comparative results for area-performance tradeoff with respect to multiple DFG and CDFG benchmarks. Results of the proposed approach indicated an average improvement in Quality of Results (QoR) of > 30% and reduction in runtime of > 92% compared to recent approaches.

  • Conference Article
  • Cite Count Icon 3
  • 10.1145/3426430.3429451
Machine learning to ease understanding of data driven compiler optimizations
  • Nov 15, 2020
  • Raphael Mosaner

Optimizing compilers use - often hand-crafted - heuristics to control optimizations such as inlining or loop unrolling. These heuristics are based on data such as size and structure of the parts to be optimized. A compilation, however, produces much more (platform specific) data that one could use as input. We thus propose the use of machine learning (ML) to derive better optimization decisions from this wealth of data and to tackle the shortcomings of hand-crafted heuristics. Ultimately, we want to shed light on the quality and performance of optimizations by using empirical data with automated feedback and updates in a production compiler.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant