Combining the quantized color instruction set and loop unrolling on portable video processing systems
As wireless video products evolve, they demand more sophisticated processing at higher resolutions and frame rates. Computational performance and energy efficiency have become critical design issues. This paper presents the Quantized Color Pack eXtension (QCPX) combined with a loop unrolling (LU) technique to improve execution performance and energy efficiency of color image and video processing applications. QCPX applied to a 32-bit datapath processor supports parallel operations on two packed 16-bit YCbCr (Y: luminance, Cr and Cb: chrominance) color pixels, providing greater subword-level parallelism by increasing the number of smaller color pixels packed into a word. Instruction-level parallelism can be further enhanced through loop unrolling. These techniques provide greater performance and efficiency for multimedia workloads on mobile systems. Experimental results on a set of media benchmark applications indicate that the LU plus QCPX-optimized version achieves a speedup ranging from 3.8 to 7.9 while reducing the energy consumption from 76% to 87% over the baseline version on identically configured, dynamically scheduled ILP superscalar processors. The LU plus QCPX-optimized version also outperforms the LU plus MDMX-like (MIPS’s multimedia extension) version.
- Research Article
- 10.1145/2666357.2597825
- May 5, 2014
- ACM SIGPLAN Notices
Recent studies show that very long instruction word (VLIW) architectures, which inherently have wide datapath (e.g. 128 or 256 bits for one VLIW instruction word), can benefit from dynamic implied addressing mode (DIAM) and can achieve lower power consumption and smaller code size with a small performance overhead. Such overhead, which is claimed to be small, is mainly caused by the execution of additionally generated special instructions for conveying information that cannot be encoded in reduced instruction bit-width. In this paper, however, we show that the performance impact of applying DIAM on VLIW architecture cannot be overlooked expecially when applications possess high level of instruction level parallelism (ILP), which is mostly the case for loops because of the result of aggressive code scheduling. We also propose a way to relieve the performance degradation especially focusing on loops since loops spend almost 90% of total execution time in programs and tend to have high ILP. We first implement the original DIAM compilation technique in a compiler, and augment it with the proposed loop optimization scheme to show that ours can clearly alleviate the performance loss caused by the excessive number of additional instructions, with the help of slightly modified hardware. Moreover, the well-known loop unrolling scheme, which would produce denser code in loops at the cost of substantial code size bloating, is integrated into our compiler. The experiment result shows that the loop unrolling technique, combined with our augmented DIAM scheme, produces far better code in terms of performance with quite an acceptable amount of code increase.
- Conference Article
- 10.1145/2597809.2597825
- Jun 12, 2014
Recent studies show that very long instruction word (VLIW) architectures, which inherently have wide datapath (e.g. 128 or 256 bits for one VLIW instruction word), can benefit from dynamic implied addressing mode (DIAM) and can achieve lower power consumption and smaller code size with a small performance overhead. Such overhead, which is claimed to be small, is mainly caused by the execution of additionally generated special instructions for conveying information that cannot be encoded in reduced instruction bit-width. In this paper, however, we show that the performance impact of applying DIAM on VLIW architecture cannot be overlooked expecially when applications possess high level of instruction level parallelism (ILP), which is mostly the case for loops because of the result of aggressive code scheduling. We also propose a way to relieve the performance degradation especially focusing on loops since loops spend almost 90% of total execution time in programs and tend to have high ILP. We first implement the original DIAM compilation technique in a compiler, and augment it with the proposed loop optimization scheme to show that ours can clearly alleviate the performance loss caused by the excessive number of additional instructions, with the help of slightly modified hardware. Moreover, the well-known loop unrolling scheme, which would produce denser code in loops at the cost of substantial code size bloating, is integrated into our compiler. The experiment result shows that the loop unrolling technique, combined with our augmented DIAM scheme, produces far better code in terms of performance with quite an acceptable amount of code increase.
- Book Chapter
19
- 10.1016/b978-0-12-384988-5.00034-6
- Jan 1, 2011
- GPU Computing Gems Emerald Edition
Chapter 34 - Experiences on Image and Video Processing with CUDA and OpenCL
- Book Chapter
34
- 10.1007/3-540-61053-7_53
- Jan 1, 1996
A well-known code transformation for improving the run-time performance of a program is loop unrolling. The most obvious benefit of unrolling a loop is that the transformed loop usually requires fewer instruction executions than the original loop. The reduction in instruction executions comes from two sources: the number of branch instructions executed is reduced, and the control variable is modified fewer times. In addition, for architectures with features designed to exploit instruction-level parallelism, loop unrolling can expose greater levels of instruction-level parallelism. Loop unrolling is an effective code transformation often improving the execution performance of programs that spend much of their execution time in loops by 10 to 30 percent. Possibly because of the effectiveness of a simple application of loop unrolling, it has not been studied as extensively as other code improvements such as register allocation or common subexpression elimination. The result is that many compilers employ simplistic loop unrolling algorithms that miss many opportunities for improving run-time performance. This paper describes how aggressive loop unrolling is done in a retargetable optimizing compiler. Using a set of 32 benchmark programs, the effectiveness of this more aggressive approach to loop unrolling is evaluated. The results show that aggressive loop unrolling can yield additional performance increase of 10 to 20 percent over the simple, naive approaches employed by many production compilers.
- Book Chapter
5
- 10.1007/11572961_10
- Jan 1, 2005
Application-specific extensions of a processor provide an efficient mechanism to meet the growing performance demands of multimedia applications. This paper presents a color-aware instruction set extension (CAX) for embedded multimedia systems that supports vector processing of color image sequences. CAX supports parallel operations on two-packed 16-bit (6:5:5) YCbCr (luminance-chrominance) data in a 32-bit datapath processor, providing greater concurrency and efficiency for color image and video processing. Unlike typical multimedia extensions (e.g., MMX, VIS, and MDMX), CAX harnesses parallelism within the human perceptual YCbCr space, rather than depending solely on generic subword parallelism. Experimental results on an identically configured, dynamically scheduled 4-way superscalar processor indicate that CAX outperforms MDMX (a representative MIPS multimedia extension) in terms of speedup (3.9× with CAX, but only 2.1× with MDMX over the baseline performance) and energy reduction (68% to 83% reduction with CAX, but only 39% to 69% reduction with MDMX over the baseline). More exhaustive simulations are conducted to provide an in-depth analysis of CAX on machines with varying issue widths, ranging from 1 to 16 instructions per cycle. The impact of the CAX plus loop unrolling is also presented.KeywordsVideo ProcessingLoop UnrollInstruction CountSuperscalar ProcessorVector Median FilterThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
- Conference Article
- 10.1109/sips.2003.1235647
- Oct 14, 2003
High instruction throughput and energy efficiency are becoming increasingly important design requirements for embedded and mobile computing systems. The paper presents the quantized color pack extension (QCPX) ISA to improve execution performance of multimedia processing applications on programmable superscalar processors while reducing the energy consumption for these applications. QCPX exploits parallelism within the color space representation (YCbCr - luminance-chrominance) in addition to generic subword parallelism exploited by existing multimedia instruction set extensions (e.g., MMX, SSE, MDMX). We evaluate the performance (execution time in cycles) and energy consumption using QCPX on a media benchmark suite that includes vector median filter, scalar median filter, edge detection, and vector quantization. Our experimental results indicate that a 32-bit QCPX version achieves speedups ranging from 205% to 562% compared with that of a 32-bit baseline RISC version and 90% to 100% over the 32-bit MDMX-like version on identically configured, dynamically scheduled ILP superscalar processors. In addition, the QCPX version reduces the energy consumption from 69% to 83% over the baseline version and 47% to 50% over the MDMX-like version due to the significant reduction of executed instructions and cache accesses.
- Conference Article
27
- 10.5555/225160.225184
- Dec 1, 1995
Exploitation of instruction-level parallelism is an effective mechanism for improving the performance of modern super-scalar/VLIW processors. Various software techniques can be applied to increase instruction-level parallelism. This paper describes and evaluates a software technique, dynamic memory disambiguation, that permits loops containing loads and stores to be scheduled more aggressively, thereby exposing more instruction-level parallelism. The results of our evaluation show that when dynamic memory disambiguation is applied in conjunction with loop unrolling, register renaming, and static memory disambiguation, the ILP of memory-intensive benchmarks can be increased by as much as 300 percent over loops where dynamic memory disambiguation is not performed. Our measurements also indicate that for the programs that benefit the most from these optimizations, the register usage does not exceed the number of registers on mast high-performance processors.
- Conference Article
52
- 10.1109/micro.1995.476820
- Nov 1, 1995
Exploitation of instruction-level parallelism is an effective mechanism for improving the performance of modern super-scalar/VLIW processors. Various software techniques can be applied to increase instruction-level parallelism. This paper describes and evaluates a software technique, dynamic memory disambiguation, that permits loops containing loads and stores to be scheduled more aggressively, thereby exposing more instruction-level parallelism. The results of our evaluation show that when dynamic memory disambiguation is applied in conjunction with loop unrolling, register renaming, and static memory disambiguation, the ILP of memory-intensive benchmarks can be increased by as much as 300 percent over loops where dynamic memory disambiguation is not performed. Our measurements also indicate that for the programs that benefit the most from these optimizations, the register usage does not exceed the number of registers on mast high-performance processors.
- Research Article
8
- 10.1287/ijoc.6.1.94
- Feb 1, 1994
- ORSA Journal on Computing
Interior point algorithms for linear programming achieve significant reductions in computer time over earlier methods for many large linear programming problems (LPs) and solve problems larger than previously possible. The most computationally intensive step in each iteration of any interior point algorithm is the numerical factorization of a sparse, symmetric, positive definite matrix. In large or relatively dense problems, 80–90% or more of computational time is spent in this step. This study describes our implementations of two algorithms for performing this factorization, the column Cholesky and the multifrontal methods, based on graph theory applied to sparse symmetric matrices. We use advanced techniques such as loop unrolling and equivalent sparse matrix reordering to improve the performance of the factorization step. Our studies are incorporated into an implementation of the primal-dual barrier algorithm. Computational experiments on relatively large LPs on a DEC station 3100 demonstrate that the primal-dual barrier algorithm using our advanced column Cholesky outperforms by 20–60% the same algorithm using a straightforward column Cholesky. Also, our multifrontal method using the loop unrolling technique in a partial inner product routine shows a 10–50% speedup compared with no loop unrolling. The two methods exhibit comparable overall performance. INFORMS Journal on Computing, ISSN 1091-9856, was published as ORSA Journal on Computing from 1989 to 1995 under ISSN 0899-1499.
- Conference Article
4
- 10.1109/cise.2009.5363077
- Dec 1, 2009
This paper first presents a new architecture of SHA-1, which achieved the theoretical upper bound on throughput in the iterative architecture. And then based on the general proposed architecture, this paper implemented other two different kinds of pipelined architectures which are based on the iterative tech- nique and the loop unrolling technique respectively. The latter with 40-stage pipeline reached a throughput up to 76.195Gbps on an Altera Stratix II GX EP2SGX90FF FPGA. At least to the au- thors' knowledge, this is the fastest published FPGA-based design at the time of writing. At last the proposed designs are compared with other published SHA-1 designs, the designs in this paper have obvious advantages both in speed and areas. Based on the analysis of the SHA-1 algorithm and the pre- vious publications, this paper optimized the critical path of SHA-1 first, whose delay was equal to the iteration round of SHA-1 proposed in (8). And then based on the general pro- posed architecture, this paper presented other two different kinds of pipelined architectures which were based on the itera- tive technique and the loop unrolling technique. Compared with other publications, the presented designs in this paper have obvious advantages both in speed and logic requirements. The paper is organized as follows: A description of SHA-1 is presented in Section 2. The optimized architecture of SHA-1 is introduced and analyzed in section 3. The loop unrolling and pipelining techniques are analyzed in section 4. Two different kinds of pipelined architectures are implemented on FPGA in section 5 and the results of the proposed designs are described and compared to other published implementations in Section 6.
- Conference Article
152
- 10.1145/300979.300990
- May 1, 1999
This paper aims to provide a quantitative understanding of the performance of image and video processing applications on general-purpose processors, without and with media ISA extensions. We use detailed simulation of 12 benchmarks to study the effectiveness of current architectural features and identify future challenges for these workloads.Our results show that conventional techniques in current processors to enhance instruction-level parallelism (ILP) provide a factor of 2.3X to 4.2X performance improvement. The Sun VIS media ISA extensions provide an additional 1.1X to 4.2X performance improvement. The ILP features and media ISA extensions significantly reduce the CPU component of execution time, making 5 of the image processing benchmarks memory-bound.The memory behavior of our benchmarks is characterized by large working sets and streaming data accesses. Increasing the cache size has no impact on 8 of the benchmarks. The remaining benchmarks require relatively large cache sizes (dependent on the display sizes) to exploit data reuse, but derive less than 1.2X performance benefits with the larger caches. Software prefetching provides 1.4X to 2.5X performance improvement in the image processing benchmarks where memory is a significant problem. With the addition of software prefetching, all our benchmarks revert to being compute-bound.
- Conference Article
- 10.23919/elinfocom.2018.8330584
- Jan 1, 2018
To improve the overall performance of computer systems, instruction-level parallelism (ILP) has been widely exploited. However, branch hazards, conditional and unconditional, still limit the efficiency of most ILP techniques. Compiler techniques such as loop unrolling, software pipelining, and trace scheduling are being used to increase the amount of parallelism available in systems with fairly predictable branches, while predicated instructions have been useful in eliminating branch hazards in specific cases. The limitations imposed on ILP by branch hazards, however, are significant in large blocks of codes or, at best, hidden at the expense of processor resources. As a result, researchers are exploring the techniques of approximate computing, which when applied, would be suitable for only fault-tolerant systems. Some are also working on the methods of code approximation, which mainly involves hazard minimization by distribution over specific parts of code segments. In this work, we propose and demonstrate a novel branch hazard distribution technique - Symbolic Execution using Approximate Computing (SEAC). We applied the proposed technique to a test program and ran simulation experiments using the Detailed CPU model in gem5 simulator. Simulation results show that SEAC is 3.57, 1.95 and 1.32 times better than the best, among the tested conventional ILP techniques, based on speedup, energy saving, and branch hazard distribution coefficient respectively.
- Research Article
- 10.36948/ijfmr.2021.v03i04.37540
- Jul 8, 2021
- International Journal For Multidisciplinary Research
Modern computational workloads demand exceptional performance and efficiency, necessitating the effective utilization of advanced CPU features such as SIMD (Single Instruction Multiple Data), instruction-level parallelism (ILP), and branch prediction. This paper explores optimization techniques that address inefficiencies at the algorithmic, architectural, and system levels, enabling software to align with hardware capabilities. Key techniques include resolving data dependencies, enhancing memory locality, utilizing compiler intrinsics,applying tail call optimizations, and employing strategies like loop unrolling, blocking, vectorization, and function inlining. Tail call optimization and breaking dependency chains are analyzed to improve parallelism and reduce processing overhead. Both manual and compiler-driven approaches are evaluated, providing insights into their trade-offs and synergies. Experimental results from benchmarks, such as matrix multiplication and particle simulations, demonstrate significant gains, with up to a 3x increase in instructions per cycle (IPC) and a 40% reduction in execution time. These findings highlight the critical role of optimizing software for architectural features like cache hierarchies, pipelining, and vector widths. This study provides techniques to maximize CPU efficiency, bridging the gap between hardware potential and software performance. Future directions include extending these methodologies to hybrid architectures like GPUs and integrating machine learning models for dynamic runtime optimization.
- Single Report
2
- 10.21236/ada326916
- Jun 1, 1997
: Many of today's high-performance computer processors are super-scalar. They can dispatch multiple instructions per cycle and, hence, provide what is commonly referred to as instruction-level parallelism. This super-scalar capability, combined with software pipelining, can increase processor throughput dramatically. Achieving maximum throughput, however, is nontrivial. Compilers must engage in aggressive optimization techniques, such as loop unrolling, speculative code motion, etc., to structure code to take full advantage of the underlying computer architecture. The phase-ordering implications of these optimizations are not well understood and are the subject of continuing research. Of particular interest are optimizations that enhance instruction-level parallelism. Two of these are loop unrolling and loop fusion. These are source-level optimizations that can be performed by either the programmer or the compiler. These optimizations have dramatic effects on the compiler's instruction scheduler. Performed too aggressively, these optimizations can increase register pressure and result in costly memory references. This paper details experiments performed to measure the effects of these source-level code transformations and how they influenced register pressure and code performance.
- Research Article
- 10.3390/electronics13081425
- Apr 10, 2024
- Electronics
Loop unrolling can provide more instruction-level parallelism opportunities for code and enables a greater range of instruction pipeline scheduling. In high-performance very-long-instruction-word (VLIW) digital signal processors (DSPs), there are special registers to address. To further improve the instruction-level parallelism of code for such DSPs by making full use of these registers, in this paper, we propose a more effective loop unrolling approach through extending memory accessing (LUAEMA). In this approach, the final unrolling factor is computed by a model in which every register kind and every memory accessing operation are considered. For basic digital signal processing algorithms, the unrolling factor under the LUAEMA is larger than that under the conventional loop unrolling approach. We also provide the opportunity to reduce the number of instructions in a loop during the code transformation of loop unrolling. The experimental results show that the loop unrolling approach proposed in this paper can achieve an average speedup ratio ranging from 1.14 to 1.81 compared with the conventional loop unrolling approach. For some algorithms, the peak speedup ratio is up to 2.11.