Loop unrolling in multi-pipeline ASIP design

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Application Specific Instruction-set Processor (ASIP) is one of the popular processor design techniques for embedded systems which allows customizability in processor design without overly hindering design flexibility. Multi-pipeline ASIPs were proposed to improve the performance of such systems by compromising between speed and processor area. One of the problems in the multi-pipeline design is the limited inherent instruction level parallelism (ILP) available in applications. The ILP of application programs can be improved via a compiler optimization technique known as loop unrolling. In this paper, we present how loop unrolling effects the performance of multi-pipeline ASIPs. The improvements in performance average around 15% for a number of benchmark applications with the maximum improvement of around 30%. In addition, we analyzed the variable of performance against loop unrolling factor, which is the amount of unrolling we perform.

Similar Papers
  • Conference Article
  • 10.1109/iscas.2007.378780
Low Power ASIP Architecture Optimization based on Target Application Profiling
  • May 1, 2007
  • Sung Dae Kim + 1 more

This paper describes design of application specific instruction-set processors (ASIP). We implement three ASIPs including signal processor for OFDM communication systems (SPOCS), video specific instruction-set processor (VSIP) and digital audio specific instruction-set processor (DASIP). Our ASIPs have novel instructions and optimized hardware architectures for specific applications. Our ASIPs can have much smaller area and dramatically reduce the numbers of memory accesses compared with commercial DSP chips, which result in low power consumption. All of the proposed ASIPs have been thoroughly verified using the Xilinx XC2v6000 FPGA and one of the implemented ASIPs has been already employed in a digital home theater SoC.

  • Conference Article
  • Cite Count Icon 8
  • 10.1109/iscas.2005.1465387
Application Specific Instruction-Set Processor Generation for Video Processing Based on Loop Optimization
  • May 23, 2005
  • M Mbaye + 3 more

Until recently, application specific instruction-set processor (ASIP) design was very costly and complex. Now, ASIP circuits are much easier to develop with technologies like Tensilica and Altera configurable processors that provide tools enabling effective generation of RTL (register transfer level) code for ASIPs. On the other hand, the design of effective ASIPs is still time-consuming, because existing methodologies largely rely on designers' knowledge for design space exploration. The paper describes a methodology to help design ASIPs. An iterative profiling-driven method based on detection and acceleration of application bottlenecks with specialized instructions is proposed. This method is applied to the design of an ASIP adapted for a video processing algorithm - the Wiener filter. The acceleration reached with our method on this application is very significant, with a speedup factor larger than 10 over optimized software code.

  • Research Article
  • Cite Count Icon 2
  • 10.1080/21681724.2018.1477182
Synthesis of an Application Specific Instruction Set Processor (ASIP) for RIPEMD-160 Hash Algorithm
  • May 25, 2018
  • International Journal of Electronics Letters
  • Yavar Safaei Mehrabani

ABSTRACTHash functions are vital tasks in many applications such as digital fingerprinting, Internet communications, bank transactions and so forth. RACE Integrity Primitives Evaluation Message Digest-160 (RIPEMD-160) is one of the most applicable hash functions that there have been several structures for designing it based on Application-Specific Integrated Circuit (ASIC) approach in the literature. Application-Specific Instruction Set Processor (ASIP) design makes compromise between ASIC and Digital Signal Processing approaches with respect to speed, cost and flexibility. Because of this unique property of ASIP method, an ASIP processor for RIPEMD-160 hash algorithm is presented in this article for the first time. A special Register Configuration (RC) for RIPEMD-160 hash algorithm is developed which its Instruction Set Architecture (ISA) includes 12 specific and 35 general instructions. Proposed ASIP is simulated with VHDL language in the behavioural level of abstraction, and a typical assembly code is written to show how the proposed ASIP performs hash function. Moreover, implementation results on Virtex5 Field Programmable Gate Array (FPGA) platform shows the superiority of the proposed processor in terms of performance against its counterparts.

  • Conference Article
  • Cite Count Icon 13
  • 10.1109/rsp.2006.21
Integrated Verification Approach during ADL-Driven Processor Design
  • Jun 14, 2006
  • A Chattopadhyay + 5 more

Nowadays, architecture description languages (ADLs) are getting popular to achieve quick and optimal design convergence during the development of application specific instruction-set processors (ASIPs). Verification, in various stages of such ASIP development, is a major bottleneck hindering widespread acceptance of ADL-based processor design approach. Traditional verification of processors are only applied at register transfer level (RTL) or below. In the context of ADL-based ASIP design, this verification approach is often inconvenient and error-prone, since design and verification are done at different levels of abstraction. In this paper, this problem is addressed by presenting an integrated verification approach during ADL-driven processor design. Our verification flow includes the idea of automatic assertion generation during high-level synthesis and support for automatic test-generation utilizing the ADL-framework for ASIP design. We show the benefit of our approach by trapping errors in a pipelined SPARC-compliant processor architecture

  • Research Article
  • Cite Count Icon 7
  • 10.1016/j.mejo.2008.05.009
Integrated verification approach during ADL-driven processor design
  • Jul 7, 2008
  • Microelectronics Journal
  • Anupam Chattopadhyay + 5 more

Integrated verification approach during ADL-driven processor design

  • Conference Article
  • Cite Count Icon 1
  • 10.5555/517554.825764
Custom Wide Counterflow Pipelines for High-Performance Embedded Applications
  • Oct 15, 2000
  • Bruce R Childers + 1 more

Application-specific instruction set processor (ASIP) design is a promising technique to meet the performance and cost goals of high-performance systems. ASIPs are especially valuable for embedded computing (e.g., digital cameras, color printers, cellular phones, etc.) where a small increase in performance and decrease in cost can have a large impact on a product's viability. Sutherland, Sproull, and Molnar have proposed a processor organization called the counterflow pipeline (CFP) that is appropriate for ASIP design due to its simple and regular structure, local control and communication, and high degree of modularity. This paper describes a new CFP architecture, called the wide counterflow pipeline (WCFP) that extends the original proposal to be better suited for custom embedded instruction-level parallel processors. This work presents a novel and practical application of the CFP to automatic and quick turn-around design of ASIPs. The paper introduces the WCFP architecture and describes several microarchitecture enhancements needed to get good performance from custom WCFPs. We demonstrate that custom WCFPs have performance that is up to four times better than that of ASIPs based on the original CFP.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/pact.2000.888331
Custom wide counterflow pipelines for high performance embedded applications
  • Nov 8, 2002
  • B.R Childers + 1 more

Application-specific instruction set processor (ASIP) design is a promising technique to meet the performance and cost goals of high-performance systems. ASIPs are especially valuable for embedded computing (e.g., digital cameras, color printers, cellular phones, etc.) where a small increase in performance and decrease in cost can have a large impact on a product's viability. Sutherland, Sproull, and Molnar have proposed a processor organization called the counterflow pipeline (CFP) that is appropriate for ASIP design due to its simple and regular structure, local control and communication, and high degree of modularity. This paper describes a new CFP architecture, called the wide counterflow pipeline (WCFP), that extends the original proposal to be better suited for custom embedded instruction-level parallel processors. This work presents a novel and practical application of the CFP to automatic and quick turnaround design of ASIPs. The paper introduces the WCFP architecture and describes several microarchitecture enhancements needed to get good performance from custom WCFPs. We demonstrate that custom WCFPs have performance that is up to 4 times better than that of ASIPs based on the original CFP.

  • Research Article
  • Cite Count Icon 8
  • 10.1109/tc.2004.1261825
Custom wide counterflow pipelines for high-performance embedded applications
  • Feb 1, 2004
  • IEEE Transactions on Computers
  • B.R Childers + 1 more

Application-specific instruction set processor (ASIP) design is a promising technique to meet the performance and cost goals of high-performance systems. ASIPs are especially valuable for embedded computing applications (e.g., digital cameras, color printers, cellular phones, etc.) where a small increase in performance and decrease in cost can have a large impact on a product's viability. Sutherland, Sproull, and Molnar originally proposed a processor organization called the counterflow pipeline (CFP) as a general-purpose architecture. We observed that the CFP is appropriate for ASIP design due to its simple and regular structure, local control and communication, and high degree of modularity. We describe a new CFP architecture, called the wide counterflow pipeline (WCFP), that extends the original proposal to be better suited for custom embedded instruction-level parallel processors. This presents a novel and practical application of the CFP to automatic and quick turnaround design of ASIPs. We introduce the WCFP architecture and describe several microarchitecture capabilities needed to get good performance from custom WCFPs. We demonstrate that custom WCFPs have performance that is up to four times better than that of ASIPs based on the CFP. Using an analytic cost model, we show that custom WCFPs do not unduly increase the cost of the original counterflow pipeline architecture, yet they retain the simplicity of the CFP. We also compare custom WCFPs to custom VLIW architectures and demonstrate that the WCFP is performance competitive with traditional VLIWs without requiring complicated global interconnection of functional devices.

  • Conference Article
  • Cite Count Icon 14
  • 10.1109/recosoc.2013.6581520
CoEx: A novel profiling-based algorithm/architecture co-exploration for ASIP design
  • Jul 1, 2013
  • Juan Fernando Eusse + 2 more

Application Specific Instruction Set Processor (ASIP) design methodologies have not been significantly altered during the past decade, and are still based on a highly manual and iterative process. Profiling has been established as a first step to prune the design space, and gain a deep understanding of the algorithms that underpin the application for which an ASIP is to be tailored. Independently of the profiling strategy, none of the existing ASIP-oriented profiling technologies enables on-the-loop application optimization or algorithmic exploration, which are mandatory steps throughout ASIP design. An innovative multi-grained approach that enables multiple levels of profiling detail according to the ASIP design stage (i.e. hot spot identification, application optimization, algorithmic exploration and architectural design) is presented. To validate our multi-grained profiling approach, the design of an ASIP for Marker-Based Augmented Reality was undertaken, achieving a 6x speedup in application execution in two days of design time.

  • Conference Article
  • 10.1109/icitaet47105.2019.9170210
Application Specific Instruction Set Processor Design for Embedded Application Using The CoWare Tool
  • Dec 1, 2019
  • Lopamudra Samal + 1 more

An Application Specific Instruction Set Processor (ASIP) is widely used as a System on a Chip (SoC) Component. ASIPs possess an instruction set which is tailored to benefit a specific application. Such specialization allows ASIPs to serve as an intermediate between two dominant processor designs styles-ASICs which has high processing abilities at the cost of limited programmability and Programmable solutions such as FPGAs that provide programming flexibility at the cost of less energy efficiency. In this dissertation the goal is to design ASIP, keeping in mind a temperature sensor system. The platform used for processor design is LISA 2.0 description language and processor designing environment from CoWare. CoWare processor designer allows processor architecture to be defined at an abstract level and automatic generation of chain of software tools like assembler, linker and simulator for functional verification followed by RTL level description. RTL level description is used to generate synthesized report of the design using RTL compiler and finally the layout is created using Cadence encounter.

  • Conference Article
  • Cite Count Icon 12
  • 10.1109/isvlsi.2014.10
Swarm Intelligence Driven Simultaneous Adaptive Exploration of Datapath and Loop Unrolling Factor during Area-Performance Tradeoff
  • Jul 1, 2014
  • Anirban Sengupta + 1 more

Multi objective (MO) design space exploration (DSE) in high level synthesis (HLS) is a tedious task which administers the usage of intelligent decision making strategies at multiple stages to yield quality results. The problem of DSE becomes intractable and intricate when an auxiliary variable such as loop unrolling factor plays a vital role in the decision making process. This paper successfully solves the above problem by proposing the novel DSE approach for fully automated parallel (simultaneous) exploration of optimal datapath and unrolling factor (UF) during area-performance tradeoff in HLS. The proposed DSE approach is driven by hyper-dimensional particle swarm optimization (PSO). The major sub-contributions of this proposed algorithm includes: a) deriving a model for computation of execution delay of a loop unrolled control data flow graph (CDFG) based on resource constraint, without the necessity of tediously unrolling the entire CDFG in most cases, b) Consideration of loop unrolling and its impact on: i) control states and execution delay tradeoff during loop unrolling ii) area-execution delay tradeoff during the DSE process, c) novel comparative results for area-performance tradeoff with respect to multiple DFG and CDFG benchmarks. Results of the proposed approach indicated an average improvement in Quality of Results (QoR) of > 30% and reduction in runtime of > 92% compared to recent approaches.

  • Conference Article
  • Cite Count Icon 8
  • 10.1109/date.2006.243908
ASIP Design and Synthesis for Non Linear Filtering in Image Processing
  • Jan 1, 2006
  • L Fanucci + 8 more

This paper presents an Application Specific Instruction Set Processor (ASIP) design for the implementation of a class of nonlinear image processing algorithms, the Retinex-like filters. Starting from high level descriptions, first algorithmic optimization is accomplished. Then a processor architecture and an instruction set are customized with special respect to the algorithmic computations in order to achieve the specified timing at reasonable complexity. Taking advantage of the programmability of processor architectures, the flexibility of the system is increased, involving e. g. dynamic parameter adjustment and color treatment. ASIP implementation results in 0.13µm CMOS technology are presented.

  • Conference Article
  • Cite Count Icon 70
  • 10.1109/iccad.2001.968726
A methodology for the design of application specific instruction set processors (ASIP) using the machine description language LISA
  • Nov 13, 2002
  • A Hoffmann + 5 more

The development of application specific instruction set processors (ASIP) is currently the exclusive domain of the semiconductor houses and core vendors. This is due to the fact that building such an architecture is a difficult task that requires expertise knowledge in different domains: application software development tools, processor hardware implementation, and system integration and verification. This paper presents a retargetable framework for ASIP design which is based on machine descriptions in the LISA language. From that, software development tools can be automatically generated including HLL C-compiler, assembler, linker, simulator and debugger frontend. Moreover, synthesizable HDL code can be derived which can then be processed by standard synthesis tools. Implementation results for a low-power ASIP for DVB-T acquisition and tracking algorithms designed with the presented methodology will be given.

  • Research Article
  • 10.1142/s0218126625500355
Machine Learning-Driven GCC Loop Unrolling Optimization: Compiler Performance Enhancement Strategy Based on XGBoost
  • Sep 23, 2024
  • Journal of Circuits, Systems and Computers
  • Zhaoyi Shi + 2 more

In contemporary compilers, the determination of the loop unrolling factor is traditionally based on manually crafted heuristic rules. This approach heavily relies on human intuition, which limits its ability to achieve optimized performance across diverse architectures and can sometimes even lead to performance declines. Additionally, developers face challenges in achieving cross-platform compatibility, often necessitating extensive redesign efforts. In response, this study introduces a method leveraging the XGBoost algorithm to predict the optimal loop unrolling factor for compiler optimization, thereby aiming to replace human thinking with machine learning methods and standardize development processes. Initially, the study gathers data on the loop unrolling factors as determined by profile guided optimization technology, analyzes program-specific loop feature vectors and employs cross-validation, including the Pearson correlation coefficient and feature importance ranking, to construct a dataset. Subsequent use of XGBoost to train this dataset models the decision-making process for selecting the most effective loop unrolling factor. The final step involves integrating XGBoost’s trained decision tree model into GCC to calculate the optimal loop unrolling factor during actual compilation. Empirical results on the RISC-V platform indicate that this new method, when tested against the SPEC CPU 2006 benchmark suite, offers up to 6.18% improvement in performance over the existing heuristic approach. It provides a new method for loop unrolling in compilers, and provides an innovative guide for the application of machine learning in compilers.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/rsp.2007.32
Pre- and Post-Fabrication Architecture Exploration for Partially Reconfigurable VLIW Processors
  • May 1, 2007
  • Proceedings
  • A Chattopadhyay + 6 more

Modern application specific instruction-set processors (ASIPs) face the demanding task of delivering high performance for a wide range of applications. For enhancing the performance, architectural features e.g. pipelining, VLIW etc are often employed in ASIPs, leading to high design complexity. Integrated ASIP design environments like templated-based approaches [1] and language- driven approaches [2][3] provide an answer to this growing design complexity. At the same time, increasing hardware design costs have motivated the processor designers to introduce high flexibility in the processor. Flexibility, in its most effective form, can be introduced to the ASIP by coupling a re-configurable unit to the base processor. Due to its obvious benefits, several re-configurable ASIPs (rASIPs) have been designed in the recent years. These rASIP designs lacked a generic flow from high-level specification, resulting into intuitive design decisions and hard-to-retarget processor design tools. Although a template-based approach for rASIP design is existent, a clear design methodology especially for the pre-fabrication architecture exploration is not present. In order to address this issue, a high-level specification and design methodology for partially re-configurable VLIW processors is proposed in this paper. To show the benefit of this approach a commercial VLIW processor is used as the base architecture and two domains of applications are studied for potential performance gain.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant