The performance impact analysis of loop unrolling

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Loop unrolling is a well known technique, which usually results with speedup of a program that contains loops. The effect is obtained by reducing the operations that require counter increases and branch jumps at the end of the loops. This paper analyzes the impact of loop unrolling on various processor types and memory patterns. The experiments show a high correlation between the cache and the problem size. The loop unrolling results with a higher speedup for the execution of a smaller size problem, while it does not have impact for a problem whose size is greater than the capacity of the last level cache size, due to the huge number of cache misses. Another important result is that the loop unrolling achieves greater speedup on Intel, rather than AMD CPU. In this paper we analyze and discuss the various behaviors of loop unrolling.

Similar Papers
  • Research Article
  • Cite Count Icon 17
  • 10.14529/jsfi210205
Performance and Power Analysis of a Vector Computing System
  • Jun 1, 2021
  • Supercomputing Frontiers and Innovations
  • Kazuhiko Komatsu + 7 more

The performance of recent computing systems has drastically improved due to the increase in the number of cores. However, this approach is reaching the limitation due to the power constraints of facilities. Instead, this paper focuses on a vector processing with long vector length that has a potential to realize high performance and high power efficiency. This paper discusses the potential through the optimization of two benchmarks, the Himeno and HPCG benchmarks, for the latest vector computing system SX-Aurora TSUBASA. The architecture of SX-Aurora TSUBASA owes the high efficiency to making good of its long vector length. Considering these characteristics, various levels of optimizations required for a large-scale vector computing system are examined such as vectorization, loop unrolling, use of cache, domain decomposition, process mapping, and problem size tuning. The evaluation and analysis suggest that the optimizations improve the sustained performance, power efficiency, and scalability of both benchmarks. Therefore, it is clarified that the SX-Aurora TSUBASA architecture can achieve higher power efficiency due to its high sustained memory bandwidth paired with the long vector computing.

  • Research Article
  • Cite Count Icon 3
  • 10.1007/s00607-016-0535-4
A methodology pruning the search space of six compiler transformations by addressing them together as one problem and by exploiting the hardware architecture details
  • Jan 9, 2017
  • Computing
  • Vasilios Kelefouras

Today’s compilers have a plethora of optimizations-transformations to choose from, and the correct choice, order as well parameters of transformations have a significant/large impact on performance; choosing the correct order and parameters of optimizations has been a long standing problem in compilation research, which until now remains unsolved; the separate sub-problems optimization gives a different schedule/binary for each sub-problem and these schedules cannot coexist, as by refining one degrades the other. Researchers try to solve this problem by using iterative compilation techniques but the search space is so big that it cannot be searched even by using modern supercomputers. Moreover, compiler transformations do not take into account the hardware architecture details and data reuse in an efficient way. In this paper, a new iterative compilation methodology is presented which reduces the search space of six compiler transformations by addressing the above problems; the search space is reduced by many orders of magnitude and thus an efficient solution is now capable to be found. The transformations are the following: loop tiling (including the number of the levels of tiling), loop unroll, register allocation, scalar replacement, loop interchange and data array layouts. The search space is reduced (a) by addressing the aforementioned transformations together as one problem and not separately, (b) by taking into account the custom hardware architecture details (e.g., cache size and associativity) and algorithm characteristics (e.g., data reuse). The proposed methodology has been evaluated over iterative compilation and gcc/icc compilers, on both embedded and general purpose processors; it achieves significant performance gains at many orders of magnitude lower compilation time.

  • Conference Article
  • Cite Count Icon 1
  • 10.1145/2034751.2034753
Towards auto-tuning description language to heterogeneous computing environment
  • Sep 18, 2011
  • Takahiro Katagiri

Computer architectures are becoming more and more complex due to non-standardized memory accesses and hierarchical caches. It is very difficult for scientists and engineers to optimize their code to extract potential performance improvements on these architectures. Due to this, automatic performance tuning (AT) technology, hence, is a key technology to reduce cost of development for high performance numerical software.In this talk, the following two aims are folded. First, we introduce current AT studies. We focus on AT technology for numerical computations in viewpoint of numerical libraries, languages, code generators, and OS run-time software.Second, we explain ABCLibScript [1], which is an auto-tuning description language for C and Fortran90 for numerical computations to numerical software developers. ABCLibScript provides automatic code generation functions for dedicated code optimization, such as loop unrolling, algorithm selection, and varying of specified variables described by the user. We also explain HxABCLibScript[2], which is an AT language with extended function from original ABCLibScript to heterogeneous computer environment, which includes CPU and GPU (Graphics Processing Unit). The description of HxABCLibScript can free from selection of CPU and GPU switching to the arbitrary parts of program from users.The preliminary results show that the function of HxABCLibScript was highly efficient for simple kernels of typical numerical computations, such as a matrix-matrix multiplication, or a stencil computation from the Poisson's equation solver. The automatically generated codes from the description of HxABCLibScript can select the best computer resources between CPU and GPU according to problem size or the number of iterations on the program.

  • Conference Article
  • Cite Count Icon 117
  • 10.1145/165123.165126
Working sets, cache sizes, and node granularity issues for large-scale multiprocessors
  • Jan 1, 1993
  • Edward Rothberg + 2 more

The distribution of resources among processors, memory and caches is a crucial question faced by designers of large-scale parallel machines. If a machine is to solve problems with a certain data set size, should it be built with a large number of processors each with a small amount of memory, or a smaller number of processors each with a large amount of memory? How much cache memory should be provided per processor for cost-effectiveness? And how do these decisions change as larger problems are run on larger machines?In this paper, we explore the above questions based on the characteristics of five important classes of large-scale parallel scientific applications. We first show that all the applications have a hierarchy of well-defined per-processor working sets, whose size, performance impact and scaling characteristics can help determine how large different levels of a multiprocessor's cache hierarchy should be. Then, we use these working sets together with certain other important characteristics of the applications—such as communication to computation ratios, concurrency, and load balancing behavior—to reflect upon the broader question of the granularity of processing nodes in high-performance multiprocessors.We find that very small caches whose sizes do not increase with the problem or machine size are adequate for all but two of the application classes. Even in the two exceptions, the working sets scale quite slowly with problem size, and the cache sizes needed for problems that will be run in the foreseeable future are small. We also find that relatively fine-grained machines, with large numbers of processors and quite small amounts of memory per processor, are appropriate for all the applications.

  • Research Article
  • Cite Count Icon 8
  • 10.1145/173682.165126
Working sets, cache sizes, and node granularity issues for large-scale multiprocessors
  • May 1, 1993
  • ACM SIGARCH Computer Architecture News
  • Edward Rothberg + 2 more

The distribution of resources among processors, memory and caches is a crucial question faced by designers of large-scale parallel machines. If a machine is to solve problems with a certain data set size, should it be built with a large number of processors each with a small amount of memory, or a smaller number of processors each with a large amount of memory? How much cache memory should be provided per processor for cost-effectiveness? And how do these decisions change as larger problems are run on larger machines? In this paper, we explore the above questions based on the characteristics of five important classes of large-scale parallel scientific applications. We first show that all the applications have a hierarchy of well-defined per-processor working sets, whose size, performance impact and scaling characteristics can help determine how large different levels of a multiprocessor's cache hierarchy should be. Then, we use these working sets together with certain other important characteristics of the applications—such as communication to computation ratios, concurrency, and load balancing behavior—to reflect upon the broader question of the granularity of processing nodes in high-performance multiprocessors. We find that very small caches whose sizes do not increase with the problem or machine size are adequate for all but two of the application classes. Even in the two exceptions, the working sets scale quite slowly with problem size, and the cache sizes needed for problems that will be run in the foreseeable future are small. We also find that relatively fine-grained machines, with large numbers of processors and quite small amounts of memory per processor, are appropriate for all the applications.

  • Research Article
  • 10.3390/app15042021
Optimizing Lattice Basis Reduction Algorithm on ARM V8 Processors
  • Feb 14, 2025
  • Applied Sciences
  • Ronghui Cao + 6 more

The LLL (Lenstra–Lenstra–Lovász) algorithm is an important method for lattice basis reduction and has broad applications in computer algebra, cryptography, number theory, and combinatorial optimization. However, current LLL algorithms face challenges such as inadequate adaptation to domestic supercomputers and low efficiency. To enhance the efficiency of the LLL algorithm in practical applications, this research focuses on parallel optimization of the LLL_FP (LLL double-precision floating-point type) algorithm from the NTL library on the domestic Tianhe supercomputer using the Phytium ARM V8 processor. The optimization begins with the vectorization of the Gram–Schmidt coefficient calculation and row transformation using the SIMD instruction set of the Phytium chip, which significantly improve computational efficiency. Further assembly-level optimization fully utilizes the low-level instructions of the Phytium processor, and this increases execution speed. In terms of memory access, data prefetch techniques were then employed to load necessary data in advance before computation. This will reduce cache misses and accelerate data processing. To further enhance performance, loop unrolling was applied to the core loop, which allows more operations per loop iteration. Experimental results show that the optimized LLL_FP algorithm achieves up to a 42% performance improvement, with a minimum improvement of 34% and an average improvement of 38% in single-core efficiency compared to the serial LLL_FP algorithm. This study provides a more efficient solution for large-scale lattice basis reduction and demonstrates the potential of the LLL algorithm in ARM V8 high-performance computing environments.

  • Conference Article
  • Cite Count Icon 14
  • 10.5555/266800.266811
Evaluation of scheduling techniques on a SPARC-based VLIW testbed
  • Dec 1, 1997
  • Seong-Bae Park + 2 more

The performance of Very Long Instruction Word (VLIW) microprocessors depends on the close cooperation between the compiler and the architecture. This paper evaluates a set of important compilation techniques and related architectural features for VLIW machines. The evaluation is performed on a SPARC-based VLIW testbed where gcc-generated optimized SPARC code is scheduled into high-performance VLIW code. As a base scheduling compiler, we experiment with three core scheduling techniques including enhanced pipeline scheduling, all-path speculation, and renaming. We analyze the characteristics of the useful and useless ALUs in each cycle to see how many of those ALUs execute non-speculative operations, speculative operations, and copies, respectively. Then, we evaluate the following compilation techniques: software pipelining, loop unrolling, non-greedy enhanced pipeline scheduling, profile-based all-path speculation, trace-based speculation, renaming, restricted speculative loads, and memory disambiguation. Since we experiment on a uniform testbed based on a detailed analysis of ALUs, our evaluation provides an useful insight on the performance impact of these techniques.

  • Research Article
  • 10.1145/2666357.2597825
Improving performance of loops on DIAM-based VLIW architectures
  • May 5, 2014
  • ACM SIGPLAN Notices
  • Jinyong Lee + 3 more

Recent studies show that very long instruction word (VLIW) architectures, which inherently have wide datapath (e.g. 128 or 256 bits for one VLIW instruction word), can benefit from dynamic implied addressing mode (DIAM) and can achieve lower power consumption and smaller code size with a small performance overhead. Such overhead, which is claimed to be small, is mainly caused by the execution of additionally generated special instructions for conveying information that cannot be encoded in reduced instruction bit-width. In this paper, however, we show that the performance impact of applying DIAM on VLIW architecture cannot be overlooked expecially when applications possess high level of instruction level parallelism (ILP), which is mostly the case for loops because of the result of aggressive code scheduling. We also propose a way to relieve the performance degradation especially focusing on loops since loops spend almost 90% of total execution time in programs and tend to have high ILP. We first implement the original DIAM compilation technique in a compiler, and augment it with the proposed loop optimization scheme to show that ours can clearly alleviate the performance loss caused by the excessive number of additional instructions, with the help of slightly modified hardware. Moreover, the well-known loop unrolling scheme, which would produce denser code in loops at the cost of substantial code size bloating, is integrated into our compiler. The experiment result shows that the loop unrolling technique, combined with our augmented DIAM scheme, produces far better code in terms of performance with quite an acceptable amount of code increase.

  • Conference Article
  • 10.1145/2597809.2597825
Improving performance of loops on DIAM-based VLIW architectures
  • Jun 12, 2014
  • Jinyong Lee + 3 more

Recent studies show that very long instruction word (VLIW) architectures, which inherently have wide datapath (e.g. 128 or 256 bits for one VLIW instruction word), can benefit from dynamic implied addressing mode (DIAM) and can achieve lower power consumption and smaller code size with a small performance overhead. Such overhead, which is claimed to be small, is mainly caused by the execution of additionally generated special instructions for conveying information that cannot be encoded in reduced instruction bit-width. In this paper, however, we show that the performance impact of applying DIAM on VLIW architecture cannot be overlooked expecially when applications possess high level of instruction level parallelism (ILP), which is mostly the case for loops because of the result of aggressive code scheduling. We also propose a way to relieve the performance degradation especially focusing on loops since loops spend almost 90% of total execution time in programs and tend to have high ILP. We first implement the original DIAM compilation technique in a compiler, and augment it with the proposed loop optimization scheme to show that ours can clearly alleviate the performance loss caused by the excessive number of additional instructions, with the help of slightly modified hardware. Moreover, the well-known loop unrolling scheme, which would produce denser code in loops at the cost of substantial code size bloating, is integrated into our compiler. The experiment result shows that the loop unrolling technique, combined with our augmented DIAM scheme, produces far better code in terms of performance with quite an acceptable amount of code increase.

  • Conference Article
  • Cite Count Icon 33
  • 10.1109/micro.1997.645802
Evaluation of scheduling techniques on a SPARC-based VLIW testbed
  • Nov 23, 2002
  • Seongbae Park + 2 more

The performance of Very Long Instruction Word (VLIW) microprocessors depends on the close cooperation between the compiler and the architecture. This paper evaluates a set of important compilation techniques and related architectural features for VLIW machines. The evaluation is performed on a SPARC-based VLIW testbed where gcc-generated optimized SPARC code is scheduled into high-performance VLIW code. As a base scheduling compiler, we experiment with three core scheduling techniques including enhanced pipeline scheduling, all-path speculation, and renaming. We analyze the characteristics of the useful and useless ALUs in each cycle to see how many of those ALUs execute non-speculative operations, speculative operations, and copies, respectively. Then, we evaluate the following compilation techniques: software pipelining, loop unrolling, non-greedy enhanced pipeline scheduling, profile-based all-path speculation, trace-based speculation, renaming, restricted speculative loads, and memory disambiguation. Since we experiment on a uniform testbed based on a detailed analysis of ALUs, our evaluation provides an useful insight on the performance impact of these techniques.

  • Research Article
  • Cite Count Icon 33
  • 10.1023/a:1007935215591
Partitioning Processor Arrays under Resource Constraints
  • Sep 1, 1997
  • Journal of VLSI signal processing systems for signal, image and video technology
  • Jürgen Teich + 2 more

A single integer linear programming model for optimally scheduling partitioned regular algorithms is presented. The herein presented methodology differs from existing methods in the following capabilities: 1) Not only constraints on the number of available processors and communication capabilities are taken into account, but also local memories and constraints on the size of available memories. 2) Different types of processors can be handled. 3) The size of the optimization model (number of integer variables) is independent of the size of the tiles to be executed. Hence, 4) the number of integer variables in the optimization model is greatly reduced such that problems of relevant size can be solved in practical execution time.

  • Conference Article
  • Cite Count Icon 46
  • 10.1109/asap.1996.542808
Scheduling of partitioned regular algorithms on processor arrays with constrained resources
  • Aug 19, 1996
  • J Teich + 2 more

A single integer linear programming model for optimally scheduling partitioned regular algorithms is presented. The herein presented methodology differs from existing methods in the following capabilities: (1) Not only constraints on the number of available processors and communication capabilities are taken into account, but also processor caches and constraints on the size of available memories are modeled and taken into account in the optimization model. (2) Different types of processors can be handled. (3) The size of the optimization model (number of integer variables) is independent of the size of the tiles to be executed. Hence, (4) the number of integer variables in the optimization model is greatly reduced such that problems of relevant size can be solved in practical execution time.

  • Research Article
  • Cite Count Icon 70
  • 10.1016/j.cpc.2015.12.006
Hybrid OpenMP/MPI programs for solving the time-dependent Gross–Pitaevskii equation in a fully anisotropic trap
  • Dec 22, 2015
  • Computer Physics Communications
  • Bogdan Satarić + 5 more

Hybrid OpenMP/MPI programs for solving the time-dependent Gross–Pitaevskii equation in a fully anisotropic trap

  • Conference Article
  • Cite Count Icon 9
  • 10.2118/18408-ms
Improving the Performance of Parallel (and Serial) Reservoir Simulators
  • Feb 6, 1989
  • SPE Symposium on Reservoir Simulation
  • J Barua + 1 more

Parallel computers hold much promise for scientific computation. So a great deal of effort has been devoted to finding ways to parallelize linear equation solvers. However in fully implicit reservoir simulators the real problem is the solution of non-linear equations. This paper shows how a judicious combination of linear and non-linear solution techniques can lead to the fastest overall simulator. It uses a combination of an approximate iterative solution of the Jacobian and a Quasi-Newton method. The proposed method makes it possible to use the highly parallelizable Jacobi matrix solution techniques, which are poorly convergent, and still get good serial performance. Experiments on a parallel computer show that even with a highly parallel method, problem sizes need to be quite large to get good efficiency. The proposed method can also be used to speed up serial programs by simply using a good serial technique to iteratively solve the linear equations.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/icaee48663.2019.8975563
Performance Analysis of Cache Size and Set-Associativity using simpleScalar Benchmark
  • Sep 1, 2019
  • Zahid Ullah + 4 more

The ever growing space between the high-speed processor and slower main memory has always remained a performance bottleneck. Attempts have been made to address this challenge through the advent of smarter memories, an improved memory hierarchy, and deployment of high speed bus controllers. Another important dimension is to add cache units to the processor in order to benefit from the temporal and spatial, and even instruction locality and therefore, reduce the processormemory gap that ultimately results in a performance boost. In this paper, we demonstrate that increasing the cache size in order to reduce the capacity based misses results in further optimizing the performance. This results in increasing the associativity and block size and thus, reduce the conflict and compulsory based misses. We discuss the cache parameters set to achieve the minimum miss rate for the simpleScaler suite including Rijndael, Sha, Compress, Go, and Dijkstra benchmarks. Explicitly, we reduce the miss rate by using different cache configurations such as increasing the cache size, altering the block size, and varying the associativity. Notably, we achieve a minimal possible miss rate for the Rijndael benchmark, which is 104. The results show the validity of an n size cache and x associativity, which is equal to an n/2 cache size and 2x associativity – 2: 1 cache rule of thumb.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant