On Divide&amp;Conquer in Image Processing of Data Monster

Hermann Heßling,Marco Strutz,Peter Hufnagl,Elsa Irmgard Buchholz

doi:10.1016/j.bdr.2021.100214

Abstract

The steadily improving resolution power of sensors results in larger and larger data objects, which cannot be analysed in a reasonable amount of time on single workstations. To speed up the analysis the Divide and Conquer method can be used by splitting (large) data objects into smaller pieces where each piece is analysed on a single node and, finally, the partial results are collected and combined. We apply this method to the validated bio–medical framework Ki67–Analysis that determines the amount of cancer cells in high–resolution images from breast examinations. In previous work, we observed an anomalous behaviour when the framework is applied to subtiles of an image. To this end, we determined for each subtile a so–called Ki67–Analysis score parameter, which is given by the ratio of the number of identified cancer cells and the total number of cells. This parameter turns out to be underestimated the more the smaller the subtiles. The anomaly prevents a direct application of the Divide and Conquer method. In this work, we suggest a novel grey–box testing method for understanding the origin of the anomaly. It allows to identify a class of subtiles for which the Ki67–Analysis score parameter can be determined reasonably well, i.e. for which the Divide and Conquer method can be applied. By demanding the stability of the framework with regard to small additive noise in brightness, “ghost cells” are identified that turn out to be an artefact of the framework. Finally, the challenge of analysing huge single data objects is considered. The upcoming observatory Square Kilometre Array (SKA) will consist of thousands of antennas and telescopes. Due to the exceptional resolution power of SKA, single images from the Universe may be as large as one Petabyte. “Data monster” of that huge size cannot be analysed reasonably fast on traditional computing architectures. The relatively small throughput rates when reading data from disks is a serious bottleneck (memory–wall problem). Memory–based computing offers a change in paradigm: the current processor–centric architecture is replaced by a memory–based architecture. Hewlett Packard Enterprise (HPE) developed a prototype with 48 Terabyte of memory, called Sandbox . Counting words in large files can be considered as a first step towards simulating image processing of “Data Monster” at SKA. We run the big data framework Thrill on the Sandbox and determine the speedup of different setups for distributed word counting.

Full Text