Abstract

Block-matching and 3D filtering (BM3D) denoising algorithm has been employed in many application fields because of its superior image processing quality. Due to the huge computational workload, real-time implementation of this algorithm is very challenging. Recently, studies on accelerating the BM3D algorithm on GPU have presented impressive speed up over CPU-based implementations. However, GPU devices are generally inefficient in energy dissipation and, thus, are not suitable for embedded application scenarios. In this paper, we propose a dedicated hardware accelerator design to efficiently boost the BM3D algorithm with reduced power consumption on FPGA device. The proposed design is based on a deeply pipelined OpenCL kernel architecture that can efficiently speed up the compute-intensive procedures of the denoising algorithm by exploiting the intrinsic parallelism and maximizing data reuse. The final design was implemented on Intel's Arria-10 GX1150 FPGA, and achieved an average 1.2× performance boost and an outstanding 8.3× reduction in energy dissipation when compared to a state-of-the-art GPU-based software design.

Highlights

  • Image denoising plays an important role in image and video processing and has become one of the most fundamental technologies in many fields, such as digital camera [1], medical image processing [2] and computer vision [3]

  • We propose an performance improved field-programmable gate array devices (FPGAs) accelerator design for real-time processing of the block-matching and 3D filtering algorithm based on our previous study of [17]

  • EXPERIMENTAL SETUP To evaluate the performance of the proposed accelerator, we have implemented the design on Intel’s A10 FPGA development board

Read more

Summary

INTRODUCTION

Image denoising plays an important role in image and video processing and has become one of the most fundamental technologies in many fields, such as digital camera [1], medical image processing [2] and computer vision [3]. The detailed contribution of this study includes: (1) we present a quantitative analysis of the complexity of each functions of the BM3D algorithm and propose a accelerator architecture based on deeply-pipelined OpenCL kernels to implement the partitioned sub-algorithms; (2) A dedicated systolic-like array architecture for parallel block-matching is developed to efficiently exploit fine-grained data-level parallelism of the algorithm through pipelining, and at the same time, save large amount of hardware resources by avoiding using very wide data-buses to support high throughput computation; (3) A parallel linebuffer-based on-chip data caching scheme is introduced such that data reuse is maximized and the demand on external memory bandwidth is greatly reduced; (4) We have implemented the proposed design on Intel’s Arria GX1150 FPGA device, and experiment results showed that our design gained more than 20% performance improvement and in the meantime achieved a significant 8.3× advantage in power consumption over state-of-the-art GPU-based design. We have verified that this algorithm optimization has no obvious impact on denoising quality

COLLABORATIVE DENOISE FILTERING
AGGREGATION
AGGREGATION KERNEL
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call