Loop unrolling in multi-pipeline ASIP design
Application Specific Instruction-set Processor (ASIP) is one of the popular processor design techniques for embedded systems which allows customizability in processor design without overly hindering design flexibility. Multi-pipeline ASIPs were proposed to improve the performance of such systems by compromising between speed and processor area. One of the problems in the multi-pipeline design is the limited inherent instruction level parallelism (ILP) available in applications. The ILP of application programs can be improved via a compiler optimization technique known as loop unrolling. In this paper, we present how loop unrolling effects the performance of multi-pipeline ASIPs. The improvements in performance average around 15% for a number of benchmark applications with the maximum improvement of around 30%. In addition, we analyzed the variable of performance against loop unrolling factor, which is the amount of unrolling we perform.
- Conference Article
- 10.1109/iscas.2007.378780
- May 1, 2007
This paper describes design of application specific instruction-set processors (ASIP). We implement three ASIPs including signal processor for OFDM communication systems (SPOCS), video specific instruction-set processor (VSIP) and digital audio specific instruction-set processor (DASIP). Our ASIPs have novel instructions and optimized hardware architectures for specific applications. Our ASIPs can have much smaller area and dramatically reduce the numbers of memory accesses compared with commercial DSP chips, which result in low power consumption. All of the proposed ASIPs have been thoroughly verified using the Xilinx XC2v6000 FPGA and one of the implemented ASIPs has been already employed in a digital home theater SoC.
- Conference Article
8
- 10.1109/iscas.2005.1465387
- May 23, 2005
Until recently, application specific instruction-set processor (ASIP) design was very costly and complex. Now, ASIP circuits are much easier to develop with technologies like Tensilica and Altera configurable processors that provide tools enabling effective generation of RTL (register transfer level) code for ASIPs. On the other hand, the design of effective ASIPs is still time-consuming, because existing methodologies largely rely on designers' knowledge for design space exploration. The paper describes a methodology to help design ASIPs. An iterative profiling-driven method based on detection and acceleration of application bottlenecks with specialized instructions is proposed. This method is applied to the design of an ASIP adapted for a video processing algorithm - the Wiener filter. The acceleration reached with our method on this application is very significant, with a speedup factor larger than 10 over optimized software code.
- Research Article
2
- 10.1080/21681724.2018.1477182
- May 25, 2018
- International Journal of Electronics Letters
ABSTRACTHash functions are vital tasks in many applications such as digital fingerprinting, Internet communications, bank transactions and so forth. RACE Integrity Primitives Evaluation Message Digest-160 (RIPEMD-160) is one of the most applicable hash functions that there have been several structures for designing it based on Application-Specific Integrated Circuit (ASIC) approach in the literature. Application-Specific Instruction Set Processor (ASIP) design makes compromise between ASIC and Digital Signal Processing approaches with respect to speed, cost and flexibility. Because of this unique property of ASIP method, an ASIP processor for RIPEMD-160 hash algorithm is presented in this article for the first time. A special Register Configuration (RC) for RIPEMD-160 hash algorithm is developed which its Instruction Set Architecture (ISA) includes 12 specific and 35 general instructions. Proposed ASIP is simulated with VHDL language in the behavioural level of abstraction, and a typical assembly code is written to show how the proposed ASIP performs hash function. Moreover, implementation results on Virtex5 Field Programmable Gate Array (FPGA) platform shows the superiority of the proposed processor in terms of performance against its counterparts.
- Conference Article
13
- 10.1109/rsp.2006.21
- Jun 14, 2006
Nowadays, architecture description languages (ADLs) are getting popular to achieve quick and optimal design convergence during the development of application specific instruction-set processors (ASIPs). Verification, in various stages of such ASIP development, is a major bottleneck hindering widespread acceptance of ADL-based processor design approach. Traditional verification of processors are only applied at register transfer level (RTL) or below. In the context of ADL-based ASIP design, this verification approach is often inconvenient and error-prone, since design and verification are done at different levels of abstraction. In this paper, this problem is addressed by presenting an integrated verification approach during ADL-driven processor design. Our verification flow includes the idea of automatic assertion generation during high-level synthesis and support for automatic test-generation utilizing the ADL-framework for ASIP design. We show the benefit of our approach by trapping errors in a pipelined SPARC-compliant processor architecture
- Research Article
7
- 10.1016/j.mejo.2008.05.009
- Jul 7, 2008
- Microelectronics Journal
Integrated verification approach during ADL-driven processor design
- Conference Article
1
- 10.5555/517554.825764
- Oct 15, 2000
Application-specific instruction set processor (ASIP) design is a promising technique to meet the performance and cost goals of high-performance systems. ASIPs are especially valuable for embedded computing (e.g., digital cameras, color printers, cellular phones, etc.) where a small increase in performance and decrease in cost can have a large impact on a product's viability. Sutherland, Sproull, and Molnar have proposed a processor organization called the counterflow pipeline (CFP) that is appropriate for ASIP design due to its simple and regular structure, local control and communication, and high degree of modularity. This paper describes a new CFP architecture, called the wide counterflow pipeline (WCFP) that extends the original proposal to be better suited for custom embedded instruction-level parallel processors. This work presents a novel and practical application of the CFP to automatic and quick turn-around design of ASIPs. The paper introduces the WCFP architecture and describes several microarchitecture enhancements needed to get good performance from custom WCFPs. We demonstrate that custom WCFPs have performance that is up to four times better than that of ASIPs based on the original CFP.
- Conference Article
1
- 10.1109/pact.2000.888331
- Nov 8, 2002
Application-specific instruction set processor (ASIP) design is a promising technique to meet the performance and cost goals of high-performance systems. ASIPs are especially valuable for embedded computing (e.g., digital cameras, color printers, cellular phones, etc.) where a small increase in performance and decrease in cost can have a large impact on a product's viability. Sutherland, Sproull, and Molnar have proposed a processor organization called the counterflow pipeline (CFP) that is appropriate for ASIP design due to its simple and regular structure, local control and communication, and high degree of modularity. This paper describes a new CFP architecture, called the wide counterflow pipeline (WCFP), that extends the original proposal to be better suited for custom embedded instruction-level parallel processors. This work presents a novel and practical application of the CFP to automatic and quick turnaround design of ASIPs. The paper introduces the WCFP architecture and describes several microarchitecture enhancements needed to get good performance from custom WCFPs. We demonstrate that custom WCFPs have performance that is up to 4 times better than that of ASIPs based on the original CFP.
- Research Article
8
- 10.1109/tc.2004.1261825
- Feb 1, 2004
- IEEE Transactions on Computers
Application-specific instruction set processor (ASIP) design is a promising technique to meet the performance and cost goals of high-performance systems. ASIPs are especially valuable for embedded computing applications (e.g., digital cameras, color printers, cellular phones, etc.) where a small increase in performance and decrease in cost can have a large impact on a product's viability. Sutherland, Sproull, and Molnar originally proposed a processor organization called the counterflow pipeline (CFP) as a general-purpose architecture. We observed that the CFP is appropriate for ASIP design due to its simple and regular structure, local control and communication, and high degree of modularity. We describe a new CFP architecture, called the wide counterflow pipeline (WCFP), that extends the original proposal to be better suited for custom embedded instruction-level parallel processors. This presents a novel and practical application of the CFP to automatic and quick turnaround design of ASIPs. We introduce the WCFP architecture and describe several microarchitecture capabilities needed to get good performance from custom WCFPs. We demonstrate that custom WCFPs have performance that is up to four times better than that of ASIPs based on the CFP. Using an analytic cost model, we show that custom WCFPs do not unduly increase the cost of the original counterflow pipeline architecture, yet they retain the simplicity of the CFP. We also compare custom WCFPs to custom VLIW architectures and demonstrate that the WCFP is performance competitive with traditional VLIWs without requiring complicated global interconnection of functional devices.
- Conference Article
14
- 10.1109/recosoc.2013.6581520
- Jul 1, 2013
Application Specific Instruction Set Processor (ASIP) design methodologies have not been significantly altered during the past decade, and are still based on a highly manual and iterative process. Profiling has been established as a first step to prune the design space, and gain a deep understanding of the algorithms that underpin the application for which an ASIP is to be tailored. Independently of the profiling strategy, none of the existing ASIP-oriented profiling technologies enables on-the-loop application optimization or algorithmic exploration, which are mandatory steps throughout ASIP design. An innovative multi-grained approach that enables multiple levels of profiling detail according to the ASIP design stage (i.e. hot spot identification, application optimization, algorithmic exploration and architectural design) is presented. To validate our multi-grained profiling approach, the design of an ASIP for Marker-Based Augmented Reality was undertaken, achieving a 6x speedup in application execution in two days of design time.
- Conference Article
- 10.1109/icitaet47105.2019.9170210
- Dec 1, 2019
An Application Specific Instruction Set Processor (ASIP) is widely used as a System on a Chip (SoC) Component. ASIPs possess an instruction set which is tailored to benefit a specific application. Such specialization allows ASIPs to serve as an intermediate between two dominant processor designs styles-ASICs which has high processing abilities at the cost of limited programmability and Programmable solutions such as FPGAs that provide programming flexibility at the cost of less energy efficiency. In this dissertation the goal is to design ASIP, keeping in mind a temperature sensor system. The platform used for processor design is LISA 2.0 description language and processor designing environment from CoWare. CoWare processor designer allows processor architecture to be defined at an abstract level and automatic generation of chain of software tools like assembler, linker and simulator for functional verification followed by RTL level description. RTL level description is used to generate synthesized report of the design using RTL compiler and finally the layout is created using Cadence encounter.
- Conference Article
12
- 10.1109/isvlsi.2014.10
- Jul 1, 2014
Multi objective (MO) design space exploration (DSE) in high level synthesis (HLS) is a tedious task which administers the usage of intelligent decision making strategies at multiple stages to yield quality results. The problem of DSE becomes intractable and intricate when an auxiliary variable such as loop unrolling factor plays a vital role in the decision making process. This paper successfully solves the above problem by proposing the novel DSE approach for fully automated parallel (simultaneous) exploration of optimal datapath and unrolling factor (UF) during area-performance tradeoff in HLS. The proposed DSE approach is driven by hyper-dimensional particle swarm optimization (PSO). The major sub-contributions of this proposed algorithm includes: a) deriving a model for computation of execution delay of a loop unrolled control data flow graph (CDFG) based on resource constraint, without the necessity of tediously unrolling the entire CDFG in most cases, b) Consideration of loop unrolling and its impact on: i) control states and execution delay tradeoff during loop unrolling ii) area-execution delay tradeoff during the DSE process, c) novel comparative results for area-performance tradeoff with respect to multiple DFG and CDFG benchmarks. Results of the proposed approach indicated an average improvement in Quality of Results (QoR) of > 30% and reduction in runtime of > 92% compared to recent approaches.
- Conference Article
8
- 10.1109/date.2006.243908
- Jan 1, 2006
This paper presents an Application Specific Instruction Set Processor (ASIP) design for the implementation of a class of nonlinear image processing algorithms, the Retinex-like filters. Starting from high level descriptions, first algorithmic optimization is accomplished. Then a processor architecture and an instruction set are customized with special respect to the algorithmic computations in order to achieve the specified timing at reasonable complexity. Taking advantage of the programmability of processor architectures, the flexibility of the system is increased, involving e. g. dynamic parameter adjustment and color treatment. ASIP implementation results in 0.13µm CMOS technology are presented.
- Conference Article
70
- 10.1109/iccad.2001.968726
- Nov 13, 2002
The development of application specific instruction set processors (ASIP) is currently the exclusive domain of the semiconductor houses and core vendors. This is due to the fact that building such an architecture is a difficult task that requires expertise knowledge in different domains: application software development tools, processor hardware implementation, and system integration and verification. This paper presents a retargetable framework for ASIP design which is based on machine descriptions in the LISA language. From that, software development tools can be automatically generated including HLL C-compiler, assembler, linker, simulator and debugger frontend. Moreover, synthesizable HDL code can be derived which can then be processed by standard synthesis tools. Implementation results for a low-power ASIP for DVB-T acquisition and tracking algorithms designed with the presented methodology will be given.
- Research Article
- 10.1142/s0218126625500355
- Sep 23, 2024
- Journal of Circuits, Systems and Computers
In contemporary compilers, the determination of the loop unrolling factor is traditionally based on manually crafted heuristic rules. This approach heavily relies on human intuition, which limits its ability to achieve optimized performance across diverse architectures and can sometimes even lead to performance declines. Additionally, developers face challenges in achieving cross-platform compatibility, often necessitating extensive redesign efforts. In response, this study introduces a method leveraging the XGBoost algorithm to predict the optimal loop unrolling factor for compiler optimization, thereby aiming to replace human thinking with machine learning methods and standardize development processes. Initially, the study gathers data on the loop unrolling factors as determined by profile guided optimization technology, analyzes program-specific loop feature vectors and employs cross-validation, including the Pearson correlation coefficient and feature importance ranking, to construct a dataset. Subsequent use of XGBoost to train this dataset models the decision-making process for selecting the most effective loop unrolling factor. The final step involves integrating XGBoost’s trained decision tree model into GCC to calculate the optimal loop unrolling factor during actual compilation. Empirical results on the RISC-V platform indicate that this new method, when tested against the SPEC CPU 2006 benchmark suite, offers up to 6.18% improvement in performance over the existing heuristic approach. It provides a new method for loop unrolling in compilers, and provides an innovative guide for the application of machine learning in compilers.
- Conference Article
3
- 10.1109/rsp.2007.32
- May 1, 2007
- Proceedings
Modern application specific instruction-set processors (ASIPs) face the demanding task of delivering high performance for a wide range of applications. For enhancing the performance, architectural features e.g. pipelining, VLIW etc are often employed in ASIPs, leading to high design complexity. Integrated ASIP design environments like templated-based approaches [1] and language- driven approaches [2][3] provide an answer to this growing design complexity. At the same time, increasing hardware design costs have motivated the processor designers to introduce high flexibility in the processor. Flexibility, in its most effective form, can be introduced to the ASIP by coupling a re-configurable unit to the base processor. Due to its obvious benefits, several re-configurable ASIPs (rASIPs) have been designed in the recent years. These rASIP designs lacked a generic flow from high-level specification, resulting into intuitive design decisions and hard-to-retarget processor design tools. Although a template-based approach for rASIP design is existent, a clear design methodology especially for the pre-fabrication architecture exploration is not present. In order to address this issue, a high-level specification and design methodology for partially re-configurable VLIW processors is proposed in this paper. To show the benefit of this approach a commercial VLIW processor is used as the base architecture and two domains of applications are studied for potential performance gain.