Abstract Digital pathology images potentially contain novel patterns that may be perceived by modern deep learning models, but not humans. Prior unsupervised pattern recognition approaches have been used to reveal prognostically-relevant subtypes of glioblastoma (PMID: 28984190) and breast density segmentation (PMID: 26915120), and may complement supervised machine learning models trained using labeled data. In the Cancer Prevention Study II (CPS-II) cohort (PMID: 12015775), high-resolution, digitized hemotoxylin and eosin diagnostic slides are available for approximately 1,700 breast cancer cases providing an opportunity to perform unsupervised pattern recognition image analysis for epidemiologic breast cancer studies. Given the size of the dataset and complexity of the models, we constructed an end-to-end analytical pipeline, including preprocessing, feature engineering, and clustering, using cloud-based technologies that enable analysis at scale. Prior to training the unsupervised models, we faced issues converting raw images with open-source software. Specifically, OpenSlides could not open the Leica Versa SCN files due to their proprietary format while BioFormats inverted colors. To fix these issues, we altered the BioFormats library to successfully convert the files into a TIFF format. Since this issue likely affects other researchers, we are in discussions to provide the fix under a public license. TIFF formatted images were then denoised through color normalization to reduce hue variance and artifact detection to remove unwanted features such as pathologist annotations. Due to the computational complexity of analyzing the full image, images were padded with white space to ensure divisibility and broken into nine tiles of a predefined size. To further reduce computation time, uninformative tiles were filtered based on a predetermined threshold of artifact and white space composition. The remaining tiles were input to the unsupervised models. We used convolutional autoencoders, specifically a modified VGG-16 model without pretrained weights and a deep embedded clustering algorithm. These models learn representations of the images called ‘feature vectors’ and encode the images’ salient patterns. The final model was chosen based on iterative testing on a subsample of 100 images (N=21,472 tiles) and performance comparison of various VGG-inspired autoencoders. The feature vectors were clustered by K-means to summarize the information in a format suitable for statistical analyses. Our initial results show that the system captures macro-scale tissue patterns at lower magnifications (1x and 5x) and produces clusters that can be integrated into epidemiological studies of breast cancer etiology and prognosis. Citation Format: Jacob L. Evans, William Seo, Mary Macheski-Preston, Michelle Fritz, Samantha Puvanesarajah, James Hodge, Ted Gansler, Susan Gapstur, Mia M. Gaudet, Michelle Yi. A scalable, cloud-based, unsupervised deep learning system for identification, extraction, and summarization of potentially imperceptible patterns in whole-slide images of breast cancer tissue [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2019; 2019 Mar 29-Apr 3; Atlanta, GA. Philadelphia (PA): AACR; Cancer Res 2019;79(13 Suppl):Abstract nr 1635.
Read full abstract