Picasso: An Open-Source Machine Learning Schema for Annotating Images in Hematology

Vikram Dhillon,Suresh Kumar Balasubramanian

doi:10.1182/blood-2022-171089

Abstract

Background: Prediction models to support clinical decision-making are an integral part of medicine. Image recognition and diagnosis are essential in multiple diseases, especially malignant hematologic diseases, where histopathological images are reviewed manually, and an opportunity to automate image analysis with diagnosis-supporting tools exists. Machine learning (ML) is a process where an algorithm creates a predictive model by learning from training data and uncovering relationships between input variables (X) and output variables (Y). For this, a training dataset with labeled examples is necessary. After the learning has occurred, a predictive model is generated that can now be used to make predictions on new input that the model has never seen before. To train a machine learning model appropriately, all input should be annotated in a standard manner, and this process is very time-consuming. One way to make it simpler is by using a schema. A schema is a collected set of rules that must be followed each time in describing the relevant features of an image. For every training image in a dataset, the relevant components are labeled manually to fit the schema and reviewed by trained human examiners, making the process tedious, time-consuming, and difficult to standardize. We created a programmatic approach to annotation where a machine learning algorithm can detect significant features in AML images, annotate the images with features that fit a schema, and feed them forward into training data. Moreover, we have created this on an open-source platform that can be widely used in a resource-limited setting across the globe. Methods: A supervised machine learning approach with a convolutional neural network (CNN) was used for image processing based on a dataset that contained 270 pathological images (confirmed AML) and 30 non-pathological images (no AML) from the public Munich AML Morphology Dataset. The pathological images were from peripheral blood smears of patients diagnosed with AML at Munich University Hospital between 2014 and 2017. The non-pathological controls were taken from patients without hematological malignancy. Initially, for training, all annotation features from the Munich AML dataset are included for both pathological and non-pathological images. The neural network was written and trained on open-source software, TensorFlow using the open-source Keras library with parallel threading and on a multicore GPU machine. The resulting predictive model answers two questions: Does a given image have cells that resemble a blast character, or if the image has cells that belong to non-pathological blood smears. Results: The neural network trained on 300 total images, and for external validation, we used another subset of 100 unlabeled images from the Munich AML Morphology Dataset. Our model could predict blast-like features accurately in 89% of the new images, 6% of new images were unclassifiable, the false positive rate was 3%, and the false negative rate was 2%. After initial prediction, the neural network could accurately annotate unlabeled images 86% of the time in pathological samples and 96% in non-pathological samples. Our training model achieved an AUC of 0.82. Conclusion: A CNN trained on a modest dataset with supervised learning, enhanced with ensemble learning and K-fold cross-validation, can be used to recognize features such as blast cells from histopathological images and label images with a high degree of accuracy. In a data-driven machine learning algorithm such as a neural network, classification performance significantly increases with more available training sample images. Therefore, a more extensive training dataset with more robust hardware is necessary to generate a more sophisticated predictive model. Using CNN-enhancing methodologies can allow for model training in a resource-limiting setting. Future work will focus on using a more extensive pre-trained database to evaluate the performance of our network in a real-world setting. The annotation framework can be expanded to include disease-associated features for use in other domains, such as hematological education, patient resources, and patient education. This open-source platform can support several niches in the future that are currently only served by expensive commercial applications. Figure 1View largeDownload PPTFigure 1View largeDownload PPT Close modal

Full Text