Innovative Hardware Accelerator Architecture for FPGA‐Based General‐Purpose RISC Microprocessors

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Reconfigurable computing (RC) theory aims to take advantage of the flexibility of general‐purpose processors (GPPs) alongside the performance of application specific integrated circuits (ASICs). Numerous RC architectures have been proposed since the 1960s, but all are struggling to become mainstream. The main factor that prevents RC to be used in general‐purpose CPUs, GPUs, and mobile devices is that it requires extensive knowledge of digital circuit design which is lacked in most software programmers. In an RC development, a processor cooperates with a reconfigurable hardware accelerator (HA) which is usually implemented on a field‐programmable gate arrays (FPGAs) chip and can be reconfigured dynamically. It implements crucial portions of software (kernels) in hardware to increase overall performance, and its design requires substantial knowledge of digital circuit design. In this paper, a novel RC architecture is proposed that provides the exact same instruction set that a standard general‐purpose RISC microprocessor (e.g., ARM Cortex‐M0) has while automating the generation of a tightly coupled RC component to improve system performance. This approach keeps the decades‐old assemblers, compilers, debuggers and library components, and programming practices intact while utilizing the advantages of RC. The proposed architecture employs the LLVM compiler infrastructure to translate an algorithm written in a high‐level language (e.g., C/C++) to machine code. It then finds the most frequent instruction pairs and generates an equivalent RC circuit that is called miniature accelerator (MA). Execution of the instruction pairs is performed by the MA in parallel with consecutive instructions. Several kernel algorithms alongside EEMBC CoreMark are used to assess the performance of the proposed architecture. Performance improvement from 4.09% to 14.17% is recorded when HA is turned on. There is a trade‐off between core performance and combination of compilation time, die area, and program startup load time which includes the time required to partially reconfigure an FPGA chip.

Similar Papers
  • Conference Article
  • Cite Count Icon 2
  • 10.1109/secon.2011.5752950
Achieving true parallelism on a High Performance Heterogeneous Computer via a threaded programming model
  • Mar 1, 2011
  • Antoinette R Anderson + 2 more

As Reconfigurable Computing (RC) closes its sixth decade, significant improvements have been made to make this technology a competitor for application-specific integrated circuits (ASICs). With the field programmable gate array (FPGA) computing power operating significantly lower in speed than that of a general purpose processor (GPP), the developer must exploit every avenue possible to attain a speedup on a heterogeneous computer. Achieveing a significant speedup is what makes the RC application development process worthwhile. The developer may reap the benefits of having better computational power at a lower cost than using a traditional ASIC. This occurs primarily through efforts to pipeline and parallelize processes on an FPGA. In addition to the traditional "three P's," this paper highlights another speedup avenue via true multilevel parallelism. In particular, it further demonstrates this concept by using a threaded programming model that allows for the GPP and the FPGA to run simultaneously. This method is realized through a threaded dot product on a heterogeneous computer.

  • Research Article
  • 10.12694/scpe.v8i4.428
High performance reconfigurable computing
  • Jan 1, 2007
  • Scalable Computing Practice and Experience
  • Dorothy Bollman + 2 more

High performance reconfigurable computing has become a crucial tool for designing application-specific processors/cores in a number of areas. Improvements in reconfigurable devices such as FPGAs (field programmable gate arrays) and their inclusion in current computing products have created new opportunities, as well as new challenges for high performance computing (HPC). An FPGA is an integrated circuit that contains tens of thousands of building blocks, known as configuration logic blocks (CLBs) connected by programmable interconnections. FPGAs tend to be an excellent choice when dealing with algorithms that can benefit from the high parallelism offered by the FPGA fine-grained architecture. In particular, one of the most valuable features of FPGAs is their reconfigurability, i.e., the fact that they can be used for different purposes at different stages of a computation and they can be, at least partially, reprogrammed at run-time. HPC applications with reconfigurable computing (RC) have the potential to deliver enormous performance, thus they are especially attractive when the main design goal is to obtain high performance at a reasonable cost. Furthermore they are suitable for use in embedded systems. This is not the case for other alternatives such as grid-computing. The problem of accelerating HPC applications with RC can be compared to that of porting uniprocessor applications to massively parallel processors (MPPs). However, MPPs are better understood to most software developers than reconfigurable devices. Moreover, tools for porting codes to reconfigurable devices are not yet as well developed as for porting sequential code to parallel code. Nevertheless, in recent years considerable progress has been in developing HPC applications with RC in such areas as signal processing, robotics, graphics, cryptography, bioinformatics, evolvable and biologically-inspired hardware, network processors, real-time systems, rapid ASIC prototyping, interactive multimedia, machine vision, computer graphics, robotics, and embedded applications, to name a few. This special issue contains a sampling of the progress made in some of these areas. In the first paper, Lieu My Chuong, Lam Siew Kei, and Thambillai Srikanthan propose a framework that can rapidly and accurately estimate the hardware area- time measures for implementing C-applications on FPGAs. Their method is able to predict the delays with average accuracy of the 97%. The estimation computation of this approach can be done in the order of milliseconds. This is an essential step to facilitate rapid design exploration for FGPA implementations and significantly helps in the implementation of FPGA systems using high-level description languages. In the second paper, T. Hausert, A. Dsu, A. Sudarsanam, and S. Young design an FPGA based system to solve linear systems for scientific applications. They analyze the FPGA performance per wait (MFLOPS/W) and compare the performance with microprocessor-based approaches. Finally as the main outcome of this analysis, they propose helpful recommendations for speeding up FPGA computations with low power consumption. In the third paper, S. Mota, E. Ros, and F. de Toro describe a computing architecture that finely pipelines all the processing stages of a space variant mapping strategy to reduce the distortion effect on a motion-detection based vision system. As an example, they describe the results of correcting perspective distortion in a monitoring system for vehicle overtaking processes. In the fourth paper, Sadaf R. Alam, Pratul K. Agarwal, Melissa C. Smith, and Jeffrey S. Vetter describe an FPGA acceleration of molecular dynamics using the Particle-Mesh Ewald method. Their results show that time-to-solution of medium scale biological system simulations are reduced by a factor of 3X and they predict that future FPGA devices will reduce the time-to-solution by a factor greater than 15X for large scale biological systems. In the fifth paper, Nazar A. Saqib presents a space complexity analysis of two Karatsuba-Ofman multiplier variants. He studies the number of FPGA hardware resources employed by those two multipliers as a function of the operands' bitlength. He also provides a comparison table against the school (classical) multiplier method, where he shows that the Karatsuba-Ofman method is much more economical than the classical method for operand bitlengths greater than thirty two bits. The complexity analysis presented in this paper is validated experimentally by implementing the multiplier designs on FPGA devices. The help of the follow reviewers, who ensured the quality of this issue, is gratefully acknowledged: Mancia Anguita, University of Granada, Spain, Beatriz Aparico, Andalucia Astrophysics Institute, CSIC, Spain, AbdSamad Benkrid, Queen's University, Northern Ireland, Eunjung Cho, Georgia State University, USA, Nareli Cruz-Cortes,CIC-IPN, Mexico, Sergio Cuenca, University of Alicante, Spain, Jean-Pierre Deschamps, University Rey Juan Carlos, Spain, Edgar Ferrer, University of Puerto Rico at Mayaguez, Luis Gerardo de la Fraga, CINVESTAV-IPN, Mexico, Antonio Garcia, University of Granada, Spain, Javier Garrigos, University of Cartagena, Spain, Miguel Angel Leon-Chavez, BUAP, Mexico, Adriano de Luca-Pennacchia CINVESTAV-IPN, Mexico, Antonio Martinez, University of Alicante, Spain, Christian Morillas, University of Granada, Spain, Daniel Ortiz-Arroyo, Aalborg University, Denmark. Dorothy Bollman, University of Puerto Rico at Mayaguez. Javier Diaz, University of Granada, Spain. Francisco Rodriguez-Henriquez, Center for Research and Advanced Study, National Polytechnical Institute, Mexico.

  • Book Chapter
  • Cite Count Icon 2
  • 10.1007/978-3-030-54932-9_9
Reconfigurable Computing and Hardware Acceleration in Health Informatics
  • Oct 8, 2020
  • Mehdi Hasan Chowdhury + 1 more

Health informatics connects biomedical engineering with information technology to devise a modern eHealth system which often requires precise biosignal processing. This “biosignal” is essentially an electrophysiological signal from a living organism. In practice, these signals are frequently used to assess patients’ health and to discover bio-physiological anonymities. However, as most of the biosignal processing units are multichannel systems with extensive datasets, conventional computation techniques often fail to offer immediate execution of data processing. Reconfigurable architecture offers a tangible solution to this problem by utilizing fast parallel computation based on the Field Programmable Gate Array (FPGA). This computation technique ensures “Hardware Acceleration” which essentially means the exclusive utilization of hardware resources to expedite computational tasks. This is the technique of designing application-specific circuits rather than using the general purpose processors to do the signal processing. Because of its low cost and fast computation property, reconfigurable architecture is characteristically suitable for Health Informatics and has become one of the fastest growing research fields of recent years. In literature, several works are found focusing on the efficient use of FPGAs as the biomedical computation units. Some of these researches involve fundamental spatiotemporal signal analysis like Fourier transform, power spectrum density measurement, and identifying significant signal peaks. In other studies, hardware acceleration is used to compress and predict the signal for data storage, processing, and transmission. Some of the works include digital filter designing for denoising the acquired signal, while a few of the advanced research projects incorporated reconfigurable architectures to develop artificial bio-organs and high-level prosthesis as a part of rehabilitation. In this chapter, these works will be briefly reviewed to find out the state-of-the-art research trends in this research field.

  • Conference Article
  • Cite Count Icon 10
  • 10.1109/iwrsp.2001.933836
Singular value decomposition on distributed reconfigurable systems
  • Jun 25, 2001
  • C Bobda + 1 more

The use of FPGAs (field programmable gate arrays) in the area of rapid prototyping and reconfigurable computing has been successful in the past. Although many experiments have shown FPGAs to be faster than general-purpose processors and more flexible than ASICs (application-specific integrated circuits) on some classes of problems, few experiments have offered a computing platform which exploits the reconfigurability aspect of FPGAs and combines FPGAs and processors to provide better solutions on applications. This paper shows, through an efficient implementation of the singular value decomposition (SVD) of very large matrices, the possibility of integrating FPGAs as part of a distributed reconfigurable system (DRS). A cluster of eight workstations with two FPGA boards was built for this purpose. The algorithm is currently running as a pure software solution, but we are working to integrate the FPGAs in the computation. First results are encouraging, showing that the performance of the new platform can be high compared to pure software solutions.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/icetc.2010.5529949
Reconfigurable computing technology used for modern scientific applications
  • Jun 1, 2010
  • M Aqeel Iqbal + 2 more

The inventions of modern scientific applications of recent era have introduced many new dimensions of the computing technology. Conventionally two basic types of the computing technologies have ever been in use including the computing based on the extremely high speed application specific integrated circuits (ASICs) and the computing based on the most flexible programmable general purpose processors (GPPs). So far these domains of computing have been targeting the two different types of the applications including the high speed applications with the use of the ASICs and versatile programmable applications with the use of the GPPs. But the recent scientific research has introduced many new application areas which are requiring both the high speed as well as flexibility of the computing platforms. In order to fulfill these requirements of such type of applications, the reconfigurable computing technology has been introduced. Reconfigurable computing is intended to fill the gape between the ASICs and GPPs by integrating the both technologies on a single chip. This research paper introduces the role of reconfigurable computing in the execution of such a type of newly born modern scientific applications.

  • Conference Article
  • Cite Count Icon 1
  • 10.1145/1117201.1117260
A Performance model for accelerating scientific applications on reconfigurable computers
  • Feb 22, 2006
  • Ronald Scrofano + 1 more

With advances in reconfigurable hardware, especially field-programmable gate arrays (FPGAs), it has become possible to use reconfigurable hardware to accelerate complex applications, such as those in scientific computing. There has been a resulting development of reconfigurable computers--computers which have both general purpose processors and reconfigurable hardware, as well as memory and high-performance interconnection networks. Oftentimes, reconfigurable hardware can provide fantastic speed-ups for kernels in a scientific application but when the kernels are integrated back into the complete application, the overall speed-up is not very impressive. To address this problem, we have developed a simple performance model for reconfigurable computers to facilitate the accurate evaluation of whether or not the speed-up that will be achieved by implementing an application on a reconfigurable computer justifies the implementation effort and cost. The proposed performance model captures the main features of reconfigurable computers: one or more general purpose processors; reconfigurable hardware for acceleration; memory, spread over multiple banks, that is local to the reconfigurable hardware; and limited bandwidth between the general purpose processors and the reconfigurable hardware. It also captures issues of concern for using the reconfigurable hardware, such as the relationship between off-chip memory and the amount of parallelism in a design. We have used the model to predict the performance of an implementation of a molecular dynamics simulation on an SRC 6e MAPstation. The error between the predicted performance and the actual performance is only 3.5%.

  • Dissertation
  • 10.6092/polito/porto/2616951
High-performance hardware accelerators for image processing in space applications
  • Jan 1, 2015
  • Daniele Rolfo

High-performance hardware accelerators for image processing in space applications

  • Research Article
  • Cite Count Icon 7
  • 10.3390/electronics10010073
Mutual Impact between Clock Gating and High Level Synthesis in Reconfigurable Hardware Accelerators
  • Jan 3, 2021
  • Electronics
  • Francesco Ratto + 3 more

With the diffusion of cyber-physical systems and internet of things, adaptivity and low power consumption became of primary importance in digital systems design. Reconfigurable heterogeneous platforms seem to be one of the most suitable choices to cope with such challenging context. However, their development and power optimization are not trivial, especially considering hardware acceleration components. On the one hand high level synthesis could simplify the design of such kind of systems, but on the other hand it can limit the positive effects of the adopted power saving techniques. In this work, the mutual impact of different high level synthesis tools and the application of the well known clock gating strategy in the development of reconfigurable accelerators is studied. The aim is to optimize a clock gating application according to the chosen high level synthesis engine and target technology (Application Specific Integrated Circuit (ASIC) or Field Programmable Gate Array (FPGA)). Different levels of application of clock gating are evaluated, including a novel multi level solution. Besides assessing the benefits and drawbacks of the clock gating application at different levels, hints for future design automation of low power reconfigurable accelerators through high level synthesis are also derived.

  • Single Report
  • 10.21236/ada387158
Reconfigurable and Adaptive Computing Environments
  • Mar 1, 2000
  • Dinesh Bhatia

: This Reconfigurable Computing (RC) or Adaptive Computing System (ACS) program focused on the development of both reconfigurable computing platforms and the associated programming support environments to demonstrate the viability of RC. This was demonstrated by exploring the ability to program RCs in a main/integrated C application program and by investigating new, partially reconfigurable technology. A Xilinx 4000 Field Programmable Gate Array (FPGA) series board and a Xilinx 6200 FPGA-based board were developed as pant of this effort. The C compiler technology was developed more for a hardware pragma-based implementation, which leveraged hardware macro libraries and worked quite effectively. The Xilinx 6200 board was interesting from the standpoint that the 6200 FPGAs are partially reconfigurable. The shortfalls of this product family include poor chip design/manufacture, resulting in the inability to utilize a good portion of the FPGA logic resources. Another shortfall is a lack of functional programming tools. In spite of these problems, the team was able to exercise the partial reconfigurability of the devices by developing programming tools of their own.

  • Conference Article
  • Cite Count Icon 12
  • 10.1109/fpl.2005.1515764
Towards a reconfigurable tracking system
  • Jan 1, 2005
  • Sebastien C Wong + 2 more

Robust real-time automatic detection tracking and classification of objects in imagery is one of the most computationally demanding tasks in computer vision. Historically the field of computer vision has been limited by computing power. In particular algorithms that require multiple correlations, convolutions and other complex operations can be prohibitive to implement on a microprocessor. Part of the poor performance of microprocessors is their serial nature, while many of these operations are inherently parallel. One approach to implementing these operations in parallel is to build them in hardware using application specific integrated circuits (ASIC). Another approach is to use Field Programmable Gate Arrays (FPGAs) and reconfigurable computing. Reconfigurable computing offers a trade-off between the speed of hardware and flexibility of software. This paper describes two computationally intensive tracking algorithms, investigates their implementation on a reconfigurable computer, and benchmarks their performance. From our preliminary results we find that reconfigurable computing is well suited to the implementation of real-time tracking systems.

  • Book Chapter
  • Cite Count Icon 6
  • 10.1016/b978-075067604-5/50002-7
Chapter 1 - Introduction
  • Jan 1, 2004
  • The Design Warrior's Guide to FPGAs
  • Clive &Ldmax&Rd Maxfield

Chapter 1 - Introduction

  • Conference Article
  • Cite Count Icon 5
  • 10.1145/1531743.1531764
Wave field synthesis for 3D audio
  • May 18, 2009
  • Dimitris Theodoropoulos + 2 more

In this paper, we compare the architectural perspectives of the Wave Field Synthesis (WFS) 3D-audio algorithm mapped on three different platforms: a General Purpose Processor (GPP), a Graphics Processor Unit (GPU) and a Field Programmable Gate Array (FPGA). Previous related work reveals that, up to now, WFS sound systems are based on standard PCs. However, on one hand, contemporary GPUs consist of many multiprocessors that can process data concurrently. On the other hand, recent FPGAs provide huge level of parallelism, and reasonably high performance potentials, which can be exploited very efficiently by smart designers. Furthermore, new parallel programming environments, such as the Compute Unified Device Architecture (CUDA) from NVidia and the Stream from ATI, give to the researchers full access to the GPU resources. We use the CUDA to map the WFS kernel on a GeForce 8600GT GPU. Additionally, we implement a reconfigurable and scalable hardware accelerator for the same kernel, and map it onto Virtex4 FPGAs. We compare both architectural approaches against a baseline GPP implementation on a Pentium D at 3.4 GHz. Our conclusion is that in highly demanding WFS-based audio systems, a low-cost GeForce 8600GT desktop GPU can achieve a speedup of up to 8x comparing to a modern Pentium D implementation. An FPGA-based WFS hardware accelerator consisting of a single rendering unit (RU), can provide a speedup of up 10x comparing to the Pentium D approach. It can fit into small FPGAs and consumes approximately 3 Watts. Furthermore, cascading multiple RUs into a larger FPGA, can boost processing throughput up to more than two orders of magnitude higher than a GPP-based implementation and an order of magnitude better than a low-cost GPU one.

  • Research Article
  • Cite Count Icon 17
  • 10.1142/s0218126609005034
MODERN ARCHITECTURES FOR EMBEDDED RECONFIGURABLE SYSTEMS — A SURVEY
  • Apr 1, 2009
  • Journal of Circuits, Systems and Computers
  • Lech Józwiak + 1 more

Reconfigurable systems, exploiting a mixture of the traditional CPU-centric instruction-stream-based processing with the decentralized parallel application-specific data-dominated processing, provide a drastically higher performance and lower power consumption than the traditional CPU-centric systems. They do it at much lower costs and shorter times to market than the not reconfigurable hardware solutions. They also provide the flexibility that is often required for engineering of modern robust and adaptive systems. Due to their heterogeneity, flexibility and potential for highly optimized application-specific instantiation, the reconfigurable computing (RC) systems are adequate for a very broad class of applications across different industry sectors. In this paper, the basic definitions, concepts and features of reconfigurable systems are discussed, as well as their role, purposes they serve and applications that can significantly benefit from them. Also, a comparison of the hardwired systems, RC systems and CPU-centric systems is made, and some main concepts of an effective and efficient reconfigurable computing are discussed. Subsequently, the classification of the RC systems is introduced and their various architecture classes are overviewed. This is followed by a discussion of some major drivers and requirements of the recent and future developments in the modern RC system area, and an overview of the recent and future development trends in the RC architectures. The reconfigurable system area is a very promising, but quite a new field. New opportunities have been opened for this field through introduction of the system-on-a-chip technology, and a big progress has been made in the recent years. Many different reconfigurable devices and computers became commercially available. Nevertheless, multiple aspects of the RC systems, their development and their supporting tools still belong to the open research or development topics.

  • Book Chapter
  • Cite Count Icon 1
  • 10.1016/b978-012370522-8.50015-7
Chapter 10 - Programming Data Parallel FPGA Applications Using the SIMD/Vector Model
  • Jan 1, 2008
  • Reconfigurable Computing
  • Maya B Gokhale

Chapter 10 - Programming Data Parallel FPGA Applications Using the SIMD/Vector Model

  • Research Article
  • Cite Count Icon 7
  • 10.1016/j.jpdc.2008.03.004
A pipelined-loop-compatible architecture and algorithm to reduce variable-length sets of floating-point data on a reconfigurable computer
  • Mar 23, 2008
  • Journal of Parallel and Distributed Computing
  • Gerald R Morris + 1 more

A pipelined-loop-compatible architecture and algorithm to reduce variable-length sets of floating-point data on a reconfigurable computer

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.