AppAxO : Designing App lication-specific A ppro x imate O perators for FPGA-based Embedded Systems
This paper introduces a generic FPGA-based methodology for designing application-specific approximate arithmetic operators using lookup tables and carry-chains, enabling tailored accuracy-performance trade-offs. Evaluation on biomedical, image processing, and neural network benchmarks shows the approach yields more non-dominated multipliers with improved hypervolume contributions compared to state-of-the-art designs.
Approximate arithmetic operators, such as adders and multipliers, are increasingly used to satisfy the energy and performance requirements of resource-constrained embedded systems. However, most of the available approximate operators have an application-agnostic design methodology, and the efficacy of these operators can only be evaluated by employing them in the applications. Furthermore, the various available libraries of approximate operators do not share any standard approximation-induction policy to design new operators according to an application’s accuracy and performance constraints. These limitations also hinder the utilization of machine learning models to explore and determine approximate operators according to an application’s requirements. In this work, we present a generic design methodology for implementing FPGA-based application-specific approximate arithmetic operators. Our proposed technique utilizes lookup tables and carry-chains of FPGAs to implement approximate operators according to the input configurations. For instance, for an \( \text{M}\times \text{N} \) accurate multiplier utilizing K lookup tables, our methodology utilizes K -bit configurations to design \( 2^K \) approximate multipliers. We then utilize various machine learning models to evaluate and select configurations satisfying application accuracy and performance constraints. We have evaluated our proposed methodology for three benchmark applications, i.e., biomedical signal processing, image processing, and ANNs. We report more non-dominated approximate multipliers with better hypervolume contribution than state-of-the-art designs for these benchmark applications with the proposed design methodology.
- Conference Article
9
- 10.1145/3583781.3590222
- Jun 5, 2023
The run-time reconfigurability and high parallelism offered by Field Programmable Gate Arrays (FPGAs) make them an attractive choice for implementing hardware accelerators for Machine Learning (ML) algorithms. In the quest for designing efficient FPGA-based hard-ware accelerators for ML algorithms, the inherent error-resilience of ML algorithms can be exploited to implement approximate hard-ware accelerators to trade the output accuracy with better over-all performance. As multiplication and addition are the two main arithmetic operations in ML algorithms, most state-of-the-art approximate accelerators have considered approximate architectures for these operations. However, these works have mainly considered the exploration and selection of approximate operators from an existing set of operators. To this end, we provide an efficient methodology for synthesizing and implementing novel approximate operators. Specifically, we propose a novel operator synthesis approach that supports multiple operator algorithms to provide new approximate multiplier and adder designs for AI inference applications. We report up to 27% and 25% lower power than state-of-the-art approximate designs, with equivalent error behavior, for 8-bit unsigned adders and 4-bit signed multipliers respectively. Further, we propose a correlation-aware Design Space Exploration (DSE) method that can improve the efficacy of randomized search algorithms in synthesizing novel approximate operators.
- Conference Article
- 10.23919/date51398.2021.9474239
- Feb 1, 2021
In this paper, we propose TruLook, a framework that employs approximate computing techniques for GPU acceleration through computation reuse as well as approximate arithmetic operations to eliminate redundant and unnecessary exact computations. To enable computational reuse, GPU is enhanced with small lookup tables that are placed close to the stream cores that return already computed values for exact and potential inexact matches. Inexact matching is subject to a threshold controlled by the number of mantissa bits involved in the search. Approximate arithmetic is provided by a configurable approximate multiplier that dynamically detects and approximates operations which are not significantly affected by approximation. TruLook guarantees the accuracy bound required for an application by configuring the hardware at runtime. We have evaluated TruLook efficiency on a wide range of multimedia and deep learning applications. Our evaluation shows that with 0% and less than 1% quality loss budget, TruLook yields on average 2.1× and 5.6× energy-delay product improvement over four popular networks on the ImageNet dataset.
- Research Article
- 10.1109/mpul.2013.2296811
- Mar 1, 2014
- IEEE Pulse
It is difficult to find a book that covers both biomedical image processing and biomedical signal processing that also has good coverage of the physics of biomedical signal acquisition techniques and biomedical imaging. This book, which contains much information in the field of signal and image processing techniques (the first edition appeared in 2005), has perfectly achieved this goal. The authors introduce concepts at a level appropriate for senior undergraduate-or graduate-level students. Additionally, the text has many homework and programming questions to review the explained concepts. The publisher also provides a Web site where more than 200 examples of signals, images, and exercises can be downloaded.
- Conference Article
1
- 10.1109/smc53654.2022.9945091
- Oct 9, 2022
The non-local means image filter is a non-trivial denoising algorithm for color images utilizing floating-point arithmetic operations in its reference software implementation. In order to simplify this algorithm for an on-chip implementation, we investigate the impact of various number representations and approximate arithmetic operators on the quality of image filtering. We employ Cartesian Genetic Programming (CGP) to evolve approximate implementations of a 20-bit signed multiplier which is then applied in the image filter instead of the conventional 32-bit floating-point multiplier. In addition to using several techniques that reduce the huge design cost, we propose a new mutation operator for CGP to improve the search quality and obtain better approximate multipliers than with CGP utilizing the standard mutation operator. Image filters utilizing evolved approximate multipliers can save 35% in power consumption of multiplication operations for a negligible drop in the image filtering quality.
- Conference Article
7
- 10.1109/aqtr55203.2022.9801944
- May 19, 2022
In the last decade, Approximate Computing (AxC) has been extensively employed to improve the energy efficiency of computing systems, at different abstraction levels. The main AxC goal is reducing the energy budget used to execute error-tolerant applications, at the cost of a controlled and intrinsically-tolerable quality degradation. An important amount of work has been done in proposing approximate versions of basic operations, using fewer resources. From a hardware standpoint, several approximate arithmetic operations have been proposed. Although effective, such approximate hardware operators are not tailored to a specific final application. Thus, their effectiveness will depend on the actual application using them. Taking into account the target application and the related input data distribution, the final energy efficiency can be pushed further. In this paper we showcase the advantage of considering the data distribution by designing an input-aware approximate multiplier specifically intended for a high pass FIR filter, where the input distribution pattern for one operand is not uniform. Experimental results show that we can significantly reduce the power consumption while keeping an error rate lower than state of the art approximate multipliers.
- Conference Article
15
- 10.1109/isocc50952.2020.9333013
- Oct 21, 2020
Convolutional Neural Networks (CNNs) for Artificial Intelligence (AI) algorithms have been widely used in many applications especially for image recognition. However, the growth in CNN-based image recognition applications raised challenge in executing millions of Multiply and Accumulate (MAC) operations in the state-of-the-art CNNs. Therefore, GPUs, FPGAs, and ASICs are the feasible solutions for balancing processing speed and power consumption. In this paper, we propose an efficient hardware architecture for CNN that provides high speed, low power, and small area targeting ASIC implementation of CNN accelerator. To realize low cost inference accelerator. we introduce approximate arithmetic operators for MAC operators, which comprise the key datapath components of CNNs. The proposed accelerator architecture exploits parallel memory access, and N-way high speed and approximate MAC units in the convolutional layer as well as the fully connected layers. Since CNNs are tolerant to small error due to the nature of convolutional filters, the approximate arithmetic operations incur little or no noticeable loss in the accuracy of the CNN, which we demonstrate in our test results. For the approximate MAC unit, we use Dynamic Range Unbiased Multiplier (DRUM) approximate multiplier and Approximate Adder with OR operations on LSBs (AOL) which can substantially reduce the chip area and power consumption. The configuration of the approximate MAC units within each layer affects the overall accuracy of the CNN. We implemented various configurations of approximate MAC on an FPGA, and evaluated the accuracy using an extended MNIST dataset. Our implementation and evaluation with selected approximate MACs demonstrate that the proposed CNN Accelerator reduces the area of CNN by 15% at the cost of a small accuracy loss of only 0.982% compared to the reference CNN.
- Conference Article
4
- 10.1145/3229631.3229646
- Jul 15, 2018
Approximate computing techniques have been proposed to decrease the energy consumption or improve the performance of computation. Several designs of approximate arithmetic operators are proposed in the literature, suggesting that the operators' energy can be reduced up to 46%. This paper puts existing work in perspective by studying the impact of these operators in a processor targeting the embedded systems. We augment a RISC-V processor with approximate addition and multiplication that support a variable bit-width in the range of 1 to 16 bits. We investigate the error-energy trade-off for a Sobel filter application. The results indicate that, for an acceptable output image degradation, 7.5% of the energy is saved. We conclude that the benefits of using approximate operators in embedded systems processors, have yet to be proved.
- Research Article
3
- 10.1145/3564243
- Jul 31, 2023
- ACM Transactions on Embedded Computing Systems
Domain-specific accelerators for signal processing, image processing, and machine learning are increasingly being implemented on SRAM-based field-programmable gate arrays (FPGAs). Owing to the inherent error tolerance of such applications, approximate arithmetic operations, in particular, the design of approximate multipliers, have become an important research problem. Truncation of lower bits is a widely used approximation approach; however, analyzing and limiting the effects of carry-propagation due to this approximation has not been explored in detail yet. In this article, an optimized carry-aware approximate radix-4 Booth multiplier design is presented that leverages the built-in slice look-up tables (LUTs) and carry-chain resources in a novel configuration. The proposed multiplier simplifies the computation of the upper and lower bits and provides significant benefits in terms of FPGA resource usage (LUTs saving 38.5%–42.9%), Power Delay Product (PDP saving 49.4%–53%), performance metric (LUTs × critical path delay (CPD) × PDP saving 68.9%–73.1%) and errors (70% improvement in mean relative error distance) compared to the latest state-of-the-art designs. Therefore, the proposed designs are an attractive choice to implement multiplication on FPGA-based accelerators.
- Research Article
- 10.1007/s13369-014-1539-z
- Dec 28, 2014
- Arabian Journal for Science and Engineering
Floating-point (FP) multipliers are the main energy consumers in many FP applications. Recently several FP multipliers with multiple- precision modes have been designed to trade energy consumption with output accuracy of FP multiplication operation (MOP). To effectively apply these multi-mode multipliers to FP applications, this paper presents a fast instruction precision assignment method for reducing energy consumption under accuracy and performance constraints. To easily set and check the accuracy constraint, we first build an affine arithmetic based error model to evaluate the overall output accuracy loss caused by inaccurate FP MOPs. Moreover, a simplified instruction scheduling method is also developed to quickly check the performance constraint. Based on these two check functions and the characteristics of proposed multi-mode multiplier, a fast Tabu search (TS) algorithm is then proposed to assign the precision mode of each FP MOP under the accuracy and performance constraints imposed on the given application. Experimental results show that the proposed fast TS algorithm can find the precision assignment with more energy saving and less searching time when compared to previous methods.
- Conference Article
4
- 10.1145/3218603.3218612
- Jul 23, 2018
Major advancements in building general-purpose and customized hardware have been one of the key enablers of versatility and pervasiveness of machine learning models such as deep neural networks. To sustain this ubiquitous deployment of machine learning models and cope with their computational and storage complexity, several solutions such as low-precision representation of model parameters using fixed-point representation and deploying approximate arithmetic operations have been employed. Studying the potency of such solutions in different applications requires integrating them into existing machine learning frameworks for high-level simulations as well as implementing them in hardware to analyze their effects on power/energy dissipation, throughput, and chip area. Lop is a library for design space exploration that bridges the gap between machine learning and efficient hardware realization. It comprises a Python module, which can be integrated with some of the existing machine learning frameworks and implements various customizable data representations including fixed-point and floating-point as well as approximate arithmetic operations. Furthermore, it includes a highly-parameterized Scala module, which allows synthesizing hardware based on the said data representations and arithmetic operations. Lop allows researchers and designers to quickly compare quality of their models using various data representations and arithmetic operations in Python and contrast the hardware cost of viable representations by synthesizing them on their target platforms (e.g., FPGA or ASIC). To the best of our knowledge, Lop is the first library that allows both software simulation and hardware realization using customized data representations and approximate computing techniques.
- Research Article
17
- 10.1016/j.ijar.2008.01.007
- Feb 29, 2008
- International Journal of Approximate Reasoning
Approximation by Shepard type pseudo-linear operators and applications to Image Processing
- Conference Article
14
- 10.1109/icip46576.2022.9897629
- Oct 16, 2022
Artificial intelligence has become pervasive across disciplines and fields, and biomedical image and signal processing is no exception. The growing and widespread interest on the topic has triggered a vast research activity that is reflected in an exponential research effort. Through study of massive and diverse biomedical data, machine and deep learning models have revolutionized various tasks such as modeling, segmentation, registration, classification and synthesis, outperforming traditional techniques. However, the difficulty in translating the results into biologically/clinically interpretable information is preventing their full exploitation in the field. Explainable AI (XAI) attempts to fill this translational gap by providing means to make the models interpretable and providing explanations. Different solutions have been proposed so far and are gaining increasing interest from the community. This paper aims at providing an overview on XAI in biomedical data processing and points to an upcoming Special Issue on Deep Learning in Biomedical Image and Signal Processing of the IEEE Signal Processing Magazine that is going to appear in March 2022.
- Research Article
1
- 10.1002/cnm.2574
- Aug 8, 2013
- International Journal for Numerical Methods in Biomedical Engineering
This special issue of the journal includes 11 articles related with biomedical image processing: two concern brain segmentation from MRI images; two of them use medical imaging to build patient customized 3D finite element models; one applies segmentation algorithms on CT images of the spine; one addresses the simulation of middle cerebral artery Doppler signals; two are related with the study of red blood cells from images; one uses classification procedures on cell tissue images; one addresses Monte Carlo simulations of PET imaging to estimate the activity correction according to patient specific weight; and a last one uses electromyography (EMG) and wavelet functions to study diabetic neuropathy in the lower limb muscles during gait. The main objective of this special issue on “Computational Methods for Biomedical Image Processing and Analysis” is to disseminate the recent advances in the related fields trying to identify widespread areas of potential collaboration among researchers of different sciences. The issue comprises 11 contributions from nine countries that were selected from 15 works previously presented at “III ECCOMAS Thematic Conference on Computational Vision and Medical Image Processing (VipIMAGE 2011)”, that was held in Algarve, Portugal, in 12–14 October 2011 and particularly extended for this special issue. The articles included address different topics and applications related to Biomedical Image Processing and Analysis, including medical imaging, image segmentation, modeling and simulation, biomedical signal and image processing and analysis, biomechanics, 3D reconstruction, motion tracking and analysis, optimization, software developing, assisted diagnosis, and virtual reality. Computational methods of signal processing and analysis, particularly regarding 2D, 3D, and 4D images, have been commonly used in different applications of the human society. For instances, full automated or semi-automated systems based on Image Processing and Analysis algorithms have been increasing used in surveillance, recognition, inspection, human-machine interfaces, 3D vision and motion and deformation analysis. One of the main characteristics of Image Processing and Analysis domain is its inter-multidisciplinary. In fact, methodologies of several sciences, including Informatics, Mathematics, Statistics, Psychology, Mechanics, and Physics, can be usually found in this domain. Besides this inter-multidisciplinary, one of the main reasons that contributes for the continually effort performed in this domain of the human knowledge is the number of applications that can be easily found in medicine. For example, the use on medical images of statistical, geometrical, or physical-based procedures in order to model the imaged structures and achieve different goals, such as image segmentation, image registration, shape reconstruction, simulation, motion and deformation analysis, virtual reality, computer-assisted therapy, or tissue characterization.
- Research Article
64
- 10.1109/jetcas.2020.3032495
- Oct 21, 2020
- IEEE Journal on Emerging and Selected Topics in Circuits and Systems
Libraries of approximate circuits are composed of fully characterized digital circuits that can be used as building blocks of energy-efficient implementations of hardware accelerators. They can be employed not only to speed up the accelerator development but also to analyze how an accelerator responds to introducing various approximate operations. In this paper, we present a methodology that automatically builds comprehensive libraries of approximate circuits with desired properties. Target approximate circuits are generated using Cartesian genetic programming. In addition to extending the EvoApprox8b library that contains common approximate arithmetic circuits, we show how to generate more specific approximate circuits; in particular, MxN-bit approximate multipliers that exhibit promising results when deployed in convolutional neural networks. By means of the evolved approximate multipliers, we perform a detailed error resilience analysis of five different ResNet networks. We identify the convolutional layers that are good candidates for adopting the approximate multipliers and suggest particular approximate multipliers whose application can lead to the best trade-offs between the classification accuracy and energy requirements. Experiments are reported for CIFAR-10 and CIFAR-100 data sets.
- Front Matter
2
- 10.1016/j.future.2021.05.027
- May 25, 2021
- Future Generation Computer Systems
Editorial: Special issue on Advancing on Approximate Computing: Methodologies, Architectures and Algorithms