Domain-specific Accelerators Research Articles

The exponential performance growth guaranteed by Moore’s law has started to taper in recent years. At the same time, emerging applications like image processing demand heavy computational performance. These factors inevitably lead to the emergence of domain-specific accelerators (DSA) to fill the performance void left by conventional architectures. FPGAs are rapidly evolving towards becoming an alternative to custom ASICs for designing DSAs because of their low power consumption and a higher degree of parallelism. DSA design on FPGAs requires careful calibration of the FPGA compute and memory resources towards achieving optimal throughput. Hardware Descriptive Languages (HDL) like Verilog have been traditionally used to design FPGA hardware. HDLs are not geared towards any domain, and the user has to put in much effort to describe the hardware at the register transfer level. Domain Specific Languages (DSLs) and compilers have been recently used to weave together handwritten HDLs templates targeting a specific domain. Recent efforts have designed DSAs with image-processing DSLs targeting FPGAs. Image computations in the DSL are lowered to pre-existing templates or lower-level languages like HLS-C. This approach requires expensive FPGA re-flashing for every new workload. In contrast to this fixed-function hardware approach, overlays are gaining traction. Overlays are DSAs resembling a processor, which is synthesized and flashed on the FPGA once but is flexible enough to process a broad class of computations through soft reconfiguration. Less work has been reported in the context of image processing overlays. Image processing algorithms vary in size and shape, ranging from simple blurring operations to complex pyramid systems. The primary challenge in designing an image-processing overlay is maintaining flexibility in mapping different algorithms. This paper proposes a DSL-based overlay accelerator called FlowPix for image processing applications. The DSL programs are expressed as pipelines, with each stage representing a computational step in the overall algorithm. We implement 15 image-processing benchmarks using FlowPix on a Virtex-7-690t FPGA. The benchmarks range from simple blur operations to complex pipelines like Lucas-Kande optical flow. We compare FlowPix against existing DSL-to-FPGA frameworks like Hetero-Halide and Vitis Vision library that generate fixed-function hardware. On most benchmarks, we see up to 25% degradation in latency with approximately a 1.7x to 2x increase in the FPGA LUT consumption. Our ability to execute any benchmark without incurring the high costs of hardware synthesis, place-and-route, and FPGA re-flashing justifies the slight performance loss and increased resource consumption that we experience. FlowPix achieves an average frame rate of 170 FPS on HD frames of 1920 × 1080 pixels in the implemented benchmarks.

Over the past few years, 2-D convolutional neural networks (CNNs) have demonstrated their great success in a wide range of 2-D computer vision applications, such as image classification and object detection. At the same time, 3-D CNNs, as a variant of 2-D CNNs, have shown their excellent ability to analyze 3-D data, such as video and geometric data. However, the heavy algorithmic complexity of 2-D and 3-D CNNs imposes a substantial overhead over the speed of these networks, which limits their deployment in real-life applications. Although various domain-specific accelerators have been proposed to address this challenge, most of them only focus on accelerating 2-D CNNs, without considering their computational efficiency on 3-D CNNs. In this article, we propose a unified hardware architecture to accelerate both 2-D and 3-D CNNs with high hardware efficiency. Our experiments demonstrate that the proposed accelerator can achieve up to 92.4% and 85.2% multiply-accumulate efficiency on 2-D and 3-D CNNs, respectively. To improve the hardware performance, we propose a hardware-friendly quantization approach called static block floating point (BFP), which eliminates the frequent representation conversions required in traditional dynamic BFP arithmetic. Comparing with the integer linear quantization using zero-point, the static BFP quantization can decrease the logic resource consumption of the convolutional kernel design by nearly 50% on a field-programmable gate array (FPGA). Without time-consuming retraining, the proposed static BFP quantization is able to quantize the precision to 8-bit mantissa with negligible accuracy loss. As different CNNs on our reconfigurable system require different hardware and software parameters to achieve optimal hardware performance and accuracy, we also propose an automatic tool for parameter optimization. Based on our hardware design and optimization, we demonstrate that the proposed accelerator can achieve 3.8-5.6 times higher energy efficiency than graphics processing unit (GPU) implementation. Comparing with the state-of-the-art FPGA-based accelerators, our design achieves higher generality and up to 1.4-2.2 times higher resource efficiency on both 2-D and 3-D CNNs.

Domain-specific Accelerators Research Articles

Related Topics

Articles published on Domain-specific Accelerators

Optimization of block-scaled integer GeMMs for efficient DNN deployment on scalable in-order vector processors

GAHLS: an optimized graph analytics based high level synthesis framework

FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific Compiler

Symphony: Orchestrating Sparse and Dense Tensors with Hierarchical Heterogeneous Processing

QuickFPS: Architecture and Algorithm Co-Design for Farthest Point Sampling in Large-Scale Point Clouds

Cheshire: A Lightweight, Linux-Capable RISC-V Host Platform for Domain-Specific Accelerator Plug-In

AutoMap: Automatic Mapping of Neural Networks to Deep Learning Accelerators for Edge Devices

High-Performance Acceleration of 2-D and 3-D CNNs on FPGAs Using Static Block Floating Point.

Toward Optimal Softcore Carry-aware Approximate Multipliers on Xilinx FPGAs

Trireme: Exploration of Hierarchical Multi-level Parallelism for Hardware Acceleration

A Domain-Specific Accelerator for Ultralow Latency Market Data Distribution System

Unified Buffer: Compiling Image Processing and Machine Learning Applications to Push-Memory Accelerators

Towards Machine Learning-Based FPGA Backend Flow: Challenges and Opportunities

VOTA: A Heterogeneous Multicore Visual Object Tracking Accelerator Using Correlation Filters

Optimization of a line detection algorithm for autonomous vehicles on a RISC-V with accelerator

Methodology for Structured Data-Path Implementation in VLSI Physical Design: A Case Study

HPVM: Hardware-Agnostic Programming for Heterogeneous Parallel Systems

Reconfigurable and hardware efficient adaptive quantization model-based accelerator for binarized neural network

Analysis of an Application Specific Instruction-set Processor's expansion potential, considering performance and implementation effort

Algorithm-Architecture Co-Design for Domain-Specific Accelerators in Communication and Artificial Intelligence

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Domain-specific Accelerators Research Articles

Related Topics

Articles published on Domain-specific Accelerators

Optimization of block-scaled integer GeMMs for efficient DNN deployment on scalable in-order vector processors

GAHLS: an optimized graph analytics based high level synthesis framework

FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific Compiler

Symphony: Orchestrating Sparse and Dense Tensors with Hierarchical Heterogeneous Processing

QuickFPS: Architecture and Algorithm Co-Design for Farthest Point Sampling in Large-Scale Point Clouds

Cheshire: A Lightweight, Linux-Capable RISC-V Host Platform for Domain-Specific Accelerator Plug-In

AutoMap: Automatic Mapping of Neural Networks to Deep Learning Accelerators for Edge Devices

High-Performance Acceleration of 2-D and 3-D CNNs on FPGAs Using Static Block Floating Point.

Toward Optimal Softcore Carry-aware Approximate Multipliers on Xilinx FPGAs

Trireme: Exploration of Hierarchical Multi-level Parallelism for Hardware Acceleration

A Domain-Specific Accelerator for Ultralow Latency Market Data Distribution System

Unified Buffer: Compiling Image Processing and Machine Learning Applications to Push-Memory Accelerators

Towards Machine Learning-Based FPGA Backend Flow: Challenges and Opportunities

VOTA: A Heterogeneous Multicore Visual Object Tracking Accelerator Using Correlation Filters

Optimization of a line detection algorithm for autonomous vehicles on a RISC-V with accelerator

Methodology for Structured Data-Path Implementation in VLSI Physical Design: A Case Study

HPVM: Hardware-Agnostic Programming for Heterogeneous Parallel Systems

Reconfigurable and hardware efficient adaptive quantization model-based accelerator for binarized neural network

Analysis of an Application Specific Instruction-set Processor's expansion potential, considering performance and implementation effort

Algorithm-Architecture Co-Design for Domain-Specific Accelerators in Communication and Artificial Intelligence