Impact of Loop Unrolling on Area, Throughput and Clock Frequency for Window Operations Based on a Data Schedule Method
Window operations which are computationally intensive and data intensive are frequently used in image compression, pattern recognition and digital signal processing. Reconfigurable hardware boards provide a convenient and flexible solution to speed up these algorithms. This paper studies the effect of loop unrolling on the area, clock speed and throughput based on a data schedule method to find the latent connections between the three capabilities and loop unrolling. Our results indicate that due to the unique design of the compilation framework. Inner loop unrolling makes the controllers become more complicated than outer loop unrolling and increase the requirement of areas at the same time. However, outer loop unrolling demands more memory elements to keep the reused data. The clock speed begins to decrease when the number of RAM modules extends to a certain size, and the throughput increase in different degrees for different operations.
- Book Chapter
25
- 10.1007/978-3-540-71431-6_11
- Mar 27, 2007
Window operations which are computationally intensive and data intensive are frequently used in image compression, pattern recognition and digital signal processing. The efficiency of memory accessing often dominates the overall computation performance, and the problem becomes increasingly crucial in reconfigurable systems. The challenge is to intelligently exploit data reuse on the reconfigurable fabric (FPGA) to minimize the required memory or memory bandwidth while maximizing parallelism. In this paper, we present a universal memory structure for high level synthesis to automatically generate the hardware frames for all window processing applications. Comparing with related works, our approach can enhance the frequency from 69MHZ to 238.7MHZ.
- Research Article
1
- 10.1080/1206212x.2008.11441881
- Jan 1, 2008
- International Journal of Computers and Applications
Window operations which are computationally intensive and data intensive are frequently used in image compression, pattern recognition and digital signal processing. Reconfigurable hardware boards provide a convenient and flexible solution to speed up these algorithms. In this paper, we design a three-level memory structure to realize inner-loop and outer-loop data reuse in window operations completely, and use shifted registers to make hardware design simpler. Then, we present a design space exploration algorithm to get a high-performance design without going through the time-consuming hardware design process for each different algorithm. By finding the three upper bounds according to area constraints, memory bandwidth constraints and on-chip memory constraints, the block structure of the design which can fully utilize the available resources on the board is determined.
- Conference Article
81
- 10.1145/997163.997199
- Jun 11, 2004
Balancing computation with I/O has been considered as a critical factor of the overall performance for embedded systems in general and reconfigurable computing systems in particular. Data I/O often dominates the overall computation performance for window operation, which are frequently used in image processing, image compression, pattern recognition and digital signal processing. This problem is more acute in reconfigurable systems since the compiler must generate the data path and the sequence of operations. The challenge is to intelligently exploit data reuse on the reconfigurable fabric (FPGA) to minimize the required memory or I/O bandwidth while maximizing parallelism.In this paper, we present a compile-time approach to reuse data in window-based codes. The compiler, called ROCCC, first analyzes and optimizes the window operation in C. It then computes the size of the hardware buffer and defines three sets of data values for each window: the window set, the managed set and the killed set. This compile-time analysis simplifies the HDL code generation and improves the resulting hardware performance. We also discuss in-place window operations.
- Research Article
18
- 10.1145/998300.997199
- Jun 11, 2004
- ACM SIGPLAN Notices
Balancing computation with I/O has been considered as a critical factor of the overall performance for embedded systems in general and reconfigurable computing systems in particular. Data I/O often dominates the overall computation performance for window operation, which are frequently used in image processing, image compression, pattern recognition and digital signal processing. This problem is more acute in reconfigurable systems since the compiler must generate the data path and the sequence of operations. The challenge is to intelligently exploit data reuse on the reconfigurable fabric (FPGA) to minimize the required memory or I/O bandwidth while maximizing parallelism.In this paper, we present a compile-time approach to reuse data in window-based codes. The compiler, called ROCCC, first analyzes and optimizes the window operation in C. It then computes the size of the hardware buffer and defines three sets of data values for each window: the window set , the managed set and the killed set . This compile-time analysis simplifies the HDL code generation and improves the resulting hardware performance. We also discuss in-place window operations.
- Conference Article
- 10.1145/2483028.2483134
- May 2, 2013
There are large numbers of high-level algorithms consisting of multiple loop nests in image compression, pattern recognition and digital signal processing. FPGA provides a convenient and flexible solution to speed up these loop-intensive algorithms. However, FPGA reconfiguration which needs a long time is inevitable when switching between the loop nests. This paper presents a parameterized pipeline template to execute all the loop nests in sequence without FPGA reconfiguration. Five steps are designed to decide the parameters. Experiments show that the pipeline template can achieve a comparative execution cycles for a loop comparing with the special hardware structure.
- Conference Article
4
- 10.1109/icfcc.2010.5497316
- Jan 1, 2010
A memory architecture is proposed to automatically explore the design space consisting of data reuse for sliding window applications, in the context of FPGA-targeted hardware compilation. Sliding window operator is widely used in the typical applications on reconfigurable system, such as image processing, pattern recognition and digital signal processing, etc. But the sliding window circuit generated by reconfigurable compiler system is not so efficiency, limited by redundant storage, waiting operation and so on. In this paper, we present a block-based storage data reuse method to increases data throughput in sliding window applications. Through parallel access the window data, our method can reduce memory access time and improve the performance of hardware circuit. Experiments show that in three typical applications of sliding window, this approach can achieve accelerating of sliding window circuit, the performance of the program enhances 6.5-7.9 times.
- Conference Article
2
- 10.1109/icassp.1994.389587
- Apr 19, 1994
Neural networks have been successfully applied in many fields thanks to their learning and generalization capabilities and to the parallel processing and fault tolerance properties. Typical applications concern images processing, pattern recognition and digital signal processing, such as adaptive filtering and channel equalization. The authors propose the use of neural networks as digital receivers for continuous phase modulation (CPM). Simulation results refer to the European GSM digital cellular radio system. The neural receiver performance has been evaluated for coherent detection, considering an additive white Gaussian noise (AWGN) channel and compared with a maximum likelihood sequences estimator (MLSE) receiver based on the Viterbi algorithm. The paper also presents a hardware implementation of the proposed network based on a digital signal processor (DSP) and on a programmable gate array (PGA). >
- Research Article
- 10.3390/electronics13081425
- Apr 10, 2024
- Electronics
Loop unrolling can provide more instruction-level parallelism opportunities for code and enables a greater range of instruction pipeline scheduling. In high-performance very-long-instruction-word (VLIW) digital signal processors (DSPs), there are special registers to address. To further improve the instruction-level parallelism of code for such DSPs by making full use of these registers, in this paper, we propose a more effective loop unrolling approach through extending memory accessing (LUAEMA). In this approach, the final unrolling factor is computed by a model in which every register kind and every memory accessing operation are considered. For basic digital signal processing algorithms, the unrolling factor under the LUAEMA is larger than that under the conventional loop unrolling approach. We also provide the opportunity to reduce the number of instructions in a loop during the code transformation of loop unrolling. The experimental results show that the loop unrolling approach proposed in this paper can achieve an average speedup ratio ranging from 1.14 to 1.81 compared with the conventional loop unrolling approach. For some algorithms, the peak speedup ratio is up to 2.11.
- Book Chapter
54
- 10.1007/11802839_48
- Jan 1, 2006
Loop unrolling is the main compiler technique that allows reconfigurable architectures achieve large degrees of parallelism. However, loop unrolling increases the area and can potentially have a negative impact on clock cycle time. In most embedded applications, the critical parameter is the throughput. Loop unrolling can therefore have contradictory effects on the throughput. As a consequence there exists, in general, a degree of unrolling that maximizes the throughput per unit area.
- Single Book
- 10.3390/books978-3-0365-1475-8
- Nov 1, 2021
Modern computer technology has opened up new opportunities for the development of digital signal processing methods. The applications of digital signal processing have expanded significantly and today include audio and speech processing, sonar, radar, and other sensor array processing, spectral density estimation, statistical signal processing, digital image processing, signal processing for telecommunications, control systems, biomedical engineering, and seismology, among others. This Special Issue is aimed at wide coverage of the problems of digital signal processing, from mathematical modeling to the implementation of problem-oriented systems. The basis of digital signal processing is digital filtering. Wavelet analysis implements multiscale signal processing and is used to solve applied problems of de-noising and compression. Processing of visual information, including image and video processing and pattern recognition, is actively used in robotic systems and industrial processes control today. Improving digital signal processing circuits and developing new signal processing systems can improve the technical characteristics of many digital devices. The development of new methods of artificial intelligence, including artificial neural networks and brain-computer interfaces, opens up new prospects for the creation of smart technology. This Special Issue contains the latest technological developments in mathematics and digital signal processing. The stated results are of interest to researchers in the field of applied mathematics and developers of modern digital signal processing systems.
- Book Chapter
- 10.1201/9781003127598-1-1
- Jul 14, 2021
Internet of Things (IoT) has become an integral part of modern life. IoT oriented platforms are comprised of digital signal processing (DSP) coprocessors suitable for low power high performance applications, compared to traditional counterparts such as microprocessors. However, DSP coprocessors are not entirely designed in-house due to the global design supply chain, resulting into security threats at the hardware level. Some of the prominent hardware security threats for such devices used in IoT oriented platforms could be backdoor Trojan insertion, reverse engineering, etc. This chapter discusses some of the standard structural obfuscation approaches used for securing dedicated DSP coprocessors, as well as the structural obfuscation approaches that make the DSP hardware unobvious (and uninterpretable) from an attacker’s perspective. More explicitly, state of the art structural obfuscation approaches such as compiler-driven transformation techniques, hybrid transformation techniques, hologram based obfuscation techniques and key-based structural obfuscation techniques are discussed. Adopting a distinct and integrated approach, it aims to elaborate on the transformation processes for structural obfuscation, such as logic transformation, tree height transformation, partitioning, loop unrolling, loop invariant code motion, folding knob, redundant operation elimination, and so on. Demonstrations use DSP applications such as finite impulse response filter, discrete cosine transformation and other digital filters. Also presented is comparative analysis of the structural obfuscation approaches used for DSP applications.
- Dissertation
3
- 10.37099/mtu.dc.etds/181
- Jan 1, 2002
With increasing demands for performance by embedded systems, especially by digital signal processing (DSP) applications, embedded processors must increase available instructionlevel parallelism (ILP) within significant constraints on power consumption and chip cost. Unfortunately, supporting a large amount of ILP on a processor while maintaining a single register file increases chip cost and potentially decreases overall performance due to increased cycle time. To address this problem, some modern embedded processors partition the register file into multiple low-ported register files, each directly connected with one or more functional units. These functional unit/register file groups are called clusters. Clustered VLIW (very long instruction word) architectures need extra copy operations or delays to transfer values among clusters. To take advantage of clustered architectures, the compiler must expose parallelism for maximal functional-unit utilization, and schedule instructions to reduce intercluster communication overhead. High-level loop transformations offer an excellent opportunity to enhance the abilities of low-level optimizers to generate code for clustered architectures. This dissertation investigates the effects of three loop transformations, i.e., loop fusion, loop unrolling, and unroll-and-jam, on clustered VLIW architectures. The objective is to achieve high performance with low communication overhead. This dissertation discusses the following techniques: Loop Fusion This research examines the impact of loop fusion on clustered architectures. A metric based upon communication costs for guiding loop fusion is developed and tested on DSP benchmarks. Unroll-and-jam and Loop Unrolling A new method that integrates a communication cost model with an integer-optimization problem is developed to determine unroll amounts for loop unrolling and unroll-and-jam automatically for a specific loop on a specific architecture. These techniques have been implemented and tested using DSP benchmarks on simulated, clustered VLIW architectures and a real clustered, embedded processor, the TI TMS320C64X. The results show that the new techniques achieve an average speedup of 1.72-1.89 on five different clustered architectures. These techniques have been implemented and tested using DSP benchmarks on simulated, clustered VLIW architectures and a real clustered, embedded processor, the TI TMS320C64X. The results show that the new techniques achieve an average speedup of 1.72-1.89 on five different clustered architectures.
- Research Article
7
- 10.1007/s40799-020-00395-4
- Aug 12, 2020
- Experimental Techniques
This research focuses on the verification of the viability of image compression in infrared thermography in order to address the problem of data storage. Specifically, images from vibrothermographic tests were utilized due to their special characteristics compared to the results from alternative infrared thermography techniques, which are able to introduce additional uncertainties to the compression process. In this research, an adaptive algorithm based on the lifting discrete wavelet transform and set-partitioning embedded blocks was used for image compression. Five different methods, namely the compression ratio, mean squared error, peak signal-to-noise ratio, structural similarity index and coordinate modal assurance criterion, were applied to evaluate the performance of the compression process while identifying and locating the regions affected more significantly after image compression. Feature extraction through the independent component analysis was then applied to the images to separate the features such as the hot spots so that the influence from the image compression process on each important feature could be evaluated independently. In this article, the theoretical background of the applied data processing techniques is firstly presented. Through two sets of data acquired from vibrothermographic tests on an aerospace-grade composite plate containing delamination, the effects of the image compression process on the relevant hot spots are evaluated, and the effectiveness of the compression process is verified. It is demonstrated that the compression process was able to reduce the size of the images significantly without adversely affecting the quality of the important features indicating the presence of damage. The major characteristics of the key features have been successfully preserved after effective image compression.
- Conference Article
7
- 10.1109/iscas.1991.176317
- Jan 1, 1991
The authors describe code optimization techniques newly implemented for a DSP (digital signal processor) knowledge-based compiler which effectively enhances the DSP's internal parallelism. A novel systematic parallel memory allocation method is introduced, and an efficient register allocation method is newly implemented in the rule-based code generator. An efficient loop code can be generated by using the developed code generator with a loop optimization method such as loop unrolling. With the above code optimization methods for DSPs, the compiler generates a code which is comparable to the code developed by programmers by hand for NEC77230/240 families. >
- Conference Article
- 10.1109/miel.2012.6222893
- May 1, 2012
Digital signal processing (DSP) is one of the fastest growing fields in modern electronics. Today we meet DSP processors in numerous application areas, such as communication, electro-medicine, multimedia, etc. One of the crucial problem which have to be solved in these applications relate to filtering. During this it is of paramount importance to create efficient algorithms. In this paper, creation of time efficient algorithms is considered. To this end, several FIR filtering algorithms were created in HLL C and ASM, and optimized in respect to execution time. For optimization, we used techniques such as loop unrolling, software pipelining, code reordering, etc. All proposed code optimization techniques are implemented on Texas Instruments (TI) TMS320C6713 DSP processor and Code Composer Studio as software tool. For performance evaluation, speedup factor was used. The obtained results show that speedup is within a range from 1.97 up to 5.36.