A Spark image processing toolkit

Arturo Téllez‐Velázquez,Raúl Cruz‐Barbosa

doi:10.1002/cpe.5283

Abstract

SummaryThe main drawback of conventional tools for digital image processing is the long processing time due to the high complexity of their algorithms. This gets worse when these algorithms need to be sequentially processed with large image sets. To alleviate part of this situation, this paper introduces a general‐purpose tool for massively processing large digital image sets by using Apache Spark. The proposed tool allows users to extract the image rasters and store them in any of Spark's basic distributed data representations, namely, Resilient Distributed Datasets (RDD) and DataFrame (DF), to treat all the subsequent image operations as RDD/DF transformations. Our experiments reveal that, with our proposal, it is possible to schedule and execute distributed image processing tasks in less time, in comparison with another Spark‐based massive image processing tool. In these experiments, we applied several algorithms to 25 000 images (the MIRFLICKR‐25000 set), reaching a maximum speedup of 54x. In addition, it was discovered that the number of images also influences the speedup, as the cluster memory is fully occupied. Therefore, we can claim that, using our proposal, more complex image processing workflows can be built and applied massively to large image sets, achieving competitive speedups.

Full Text