Parallelization Techniques for Error Diffusion with GPU Implementations

Akihiko Kasagi,Koji Nakano,Yasuaki Ito

doi:10.1109/candar.2015.95

Abstract

Error diffusion is a classical but still popular method for generating a binary image that reproduces an original gray-scale image. In error diffusion, pixel values are rounded to binary in raster scan order and the rounding error is distributed to neighboring pixels that have not yet been processed. The main contribution of this paper is to show several parallel algorithms and implementation techniques for error diffusion. We first present error collection, which collects the quantization error from neighboring pixels that have already been processed. Error collection, which outputs the same binary image as error diffusion, performs fewer memory writing operations, and thus it is more efficient than error diffusion. We also present parallel implementations for error diffusion and error collection on the asynchronous CRCW-PRAM. From the theoretical analysis, we show that parallel error diffusion must use one of the three costly sidestep techniques: lower parallelism, atomic addition operations, or extra barrier synchronization steps, while parallel error collection does not need them. We have implemented parallel error diffusion and parallel error collection designed for the asynchronous CRCW-PRAM in the GPU. Experimental results show that parallel error collection runs the fastest on the GPU. Further, we have designed parallel algorithms for error diffusion and error collection optimized for CUDA-enabled GPUs using various implementation techniques. From the theoretical point of view, our parallel algorithms are global memory access optimal. Our parallel error collection algorithm for 256M pixels on GeForce GTX 780Ti runs only 46.75ms and achieves a speedup factor of 43.9 over the best sequential error collection algorithm running on Intel Core-i7 3770K CPU.

Full Text