Performance-efficient distributed transfer and transformation of big spatial histopathology datasets in the cloud

Esma Yildirim

doi:10.1186/s40537-021-00546-3

Abstract

Whole Slide Image (WSI) datasets are giga-pixel resolution, unstructured histopathology datasets that consist of extremely big files (each can be as large as multiple GBs in compressed format). These datasets have utility in a wide range of diagnostic and investigative pathology applications. However, the datasets present unique challenges: The size of the files, propriety data formats, and lack of efficient parallel data access libraries limit the scalability of these applications. Commercial clouds provide dynamic, cost-effective, scalable infrastructure to process these datasets, however, we lack the tools and algorithms that will transfer/transform them onto the cloud seamlessly, providing faster speeds and scalable formats. In this study, we present novel algorithms that transfer these datasets onto the cloud while at the same time transforming them into symmetric scalable formats. Our algorithms use intelligent file size distribution, and pipelining transfer and transformation tasks without introducing extra overhead to the underlying system. The algorithms, tested in the Amazon Web Services (AWS) cloud, outperform the widely used transfer tools and algorithms, and also outperform our previous work. The data access to the transformed datasets provides better performance compared to the related work. The transformed symmetric datasets are fed into three different analytics applications: a distributed implementation of a content-based image retrieval (CBIR) application for prostate carcinoma datasets, a deep convolutional neural network application for classification of breast cancer datasets, and to show that the algorithms can work with any spatial dataset, a Canny Edge Detection application on satellite image datasets. Although different in nature, all of the applications can easily work with our new symmetric data format and performance results show near-linear speed-ups as the number of processors increases.

Highlights

Whole Slide Image (WSI) datasets are very large tissue slide images in multi-giga-pixel resolution, produced by digital scanners and they have utility in a wide range of diagnostic and investigative pathology applications [1]
Parallelism is provided for one single WSI image at a time and does not consider large datasets that consist of several WSIs
Amazon Web Services (AWS) S3 storage system and AWS EMR service are used for the experiments.Our algorithms will work on any system with a Hadoop installation and we support different URL types (s3://, hdfs://, http://)

Summary

Introduction

WSI datasets are very large tissue slide images in multi-giga-pixel resolution, produced by digital scanners and they have utility in a wide range of diagnostic and investigative pathology applications [1]. Bueno et al [2] used a parallel data access method to bring the images into the memory of a node using an MPI-based approach. Parallelism is provided for one single WSI image at a time and does not consider large datasets that consist of several WSIs. The data size loaded into the memory depends on the size of the node memory. Fixing the tile size brought into the memory of the node based on the size of the WSI and memory limit of the node may increase algorithm overhead. Their performance results show that it can only scale up to 17 cores. Another downside is that the analytics application has to be written in MPI to use this access method

Objectives

Results

Conclusion