Resource Efficient Hardware Architecture for Fast Computation of Running Max/Min Filters

Cesar Torres-Huitzil

doi:10.1155/2013/108103

Abstract

Running max/min filters on rectangular kernels are widely used in many digital signal and image processing applications. Filtering with a k × k kernel requires of k 2 − 1 comparisons per sample for a direct implementation; thus, performance scales expensively with the kernel size k. Faster computations can be achieved by kernel decomposition and using constant time one-dimensional algorithms on custom hardware. This paper presents a hardware architecture for real-time computation of running max/min filters based on the van Herk/Gil-Werman (HGW) algorithm. The proposed architecture design uses less computation and memory resources than previously reported architectures when targeted to Field Programmable Gate Array (FPGA) devices. Implementation results show that the architecture is able to compute max/min filters, on 1024 × 1024 images with up to 255 × 255 kernels, in around 8.4 milliseconds, 120 frames per second, at a clock frequency of 250 MHz. The implementation is highly scalable for the kernel size with good performance/area tradeoff suitable for embedded applications. The applicability of the architecture is shown for local adaptive image thresholding.

Highlights

Running max/min filtering is an important operation that aims at selecting the maximum or minimum value from a set of signal elements
This paper proposes an efficient coarse-grain pipelined implementation of the HGW algorithm as a building block with a memory usage improvement based on distributed memory available on Field Programmable Gate Array (FPGA) compared to previous architectures that use dedicated embedded Block Ram memory
The proposed architecture is very regular and scalable with a good resourceperformance tradeoff suitable to be embedded in lowcost FPGA devices

Summary

Introduction

Running max/min filtering is an important operation that aims at selecting the maximum or minimum value from a set of signal elements. The design memory consumption is image size independent, but increasing further the parallelism, for instance, to process several image lines concurrently, is limited due to the number of Block Rams available on FPGAs. In [14], another implementation of erosion/dilation based on SE decomposition and/or efficient 1-D algorithms is proposed. The HGW algorithm is amenable for parallelism and coarse-grain pipelining; the large data buffers required to store g(x), h(x) and the pipelined computation of r(x) are identified as the most challenging aspects for a hardware implementation In this sense, the solution proposed in [13] is not fully adequate for embedded scenarios as the memory resources are substantially high. The goal is to achieve an optimized embedded implementation with a high throughput while reducing the dedicated on-chip Block Ram memory by an efficient utilization of the distributed memory resources available in current FPGA devices

Proposed Architecture

Implementation and Experimental Results

Conclusions