Abstract

BackgroundSupervised classifiers for digital pathology can improve the ability of physicians to detect and diagnose diseases such as cancer. Generating training data for classifiers is problematic, since only domain experts (e.g. pathologists) can correctly label ground truth data. Additionally, digital pathology datasets suffer from the "minority class problem", an issue where the number of exemplars from the non-target class outnumber target class exemplars which can bias the classifier and reduce accuracy. In this paper, we develop a training strategy combining active learning (AL) with class-balancing. AL identifies unlabeled samples that are "informative" (i.e. likely to increase classifier performance) for annotation, avoiding non-informative samples. This yields high accuracy with a smaller training set size compared with random learning (RL). Previous AL methods have not explicitly accounted for the minority class problem in biomedical images. Pre-specifying a target class ratio mitigates the problem of training bias. Finally, we develop a mathematical model to predict the number of annotations (cost) required to achieve balanced training classes. In addition to predicting training cost, the model reveals the theoretical properties of AL in the context of the minority class problem.ResultsUsing this class-balanced AL training strategy (CBAL), we build a classifier to distinguish cancer from non-cancer regions on digitized prostate histopathology. Our dataset consists of 12,000 image regions sampled from 100 biopsies (58 prostate cancer patients). We compare CBAL against: (1) unbalanced AL (UBAL), which uses AL but ignores class ratio; (2) class-balanced RL (CBRL), which uses RL with a specific class ratio; and (3) unbalanced RL (UBRL). The CBAL-trained classifier yields 2% greater accuracy and 3% higher area under the receiver operating characteristic curve (AUC) than alternatively-trained classifiers. Our cost model accurately predicts the number of annotations necessary to obtain balanced classes. The accuracy of our prediction is verified by empirically-observed costs. Finally, we find that over-sampling the minority class yields a marginal improvement in classifier accuracy but the improved performance comes at the expense of greater annotation cost.ConclusionsWe have combined AL with class balancing to yield a general training strategy applicable to most supervised classification problems where the dataset is expensive to obtain and which suffers from the minority class problem. An intelligent training strategy is a critical component of supervised classification, but the integration of AL and intelligent choice of class ratios, as well as the application of a general cost model, will help researchers to plan the training process more quickly and effectively.

Highlights

  • Supervised classifiers for digital pathology can improve the ability of physicians to detect and diagnose diseases such as cancer

  • (2) Class ratios are addressed in this training strategy to prevent the training set from being biased toward the majority class

  • We applied these techniques to the task of quantitatively analyzing digital prostate tissue samples for presence of cancer, where the class-balanced AL training strategy (CBAL) training method yielded a classifier with accuracy and area under the curve (AUC) values similar to those obtained with the full training set using fewer samples than the unbalanced active learning (AL), class-balanced random learning, or unbalanced random learning methods

Read more

Summary

Introduction

Supervised classifiers for digital pathology can improve the ability of physicians to detect and diagnose diseases such as cancer. AL identifies unlabeled samples that are “informative” (i.e. likely to increase classifier performance) for annotation, avoiding non-informative samples This yields high accuracy with a smaller training set size compared with random learning (RL). In this case, the goal of the supervised classifier is to identify regions of carcinoma of the prostate (CaP, the target class). CaP often appears within and around non-CaP areas, and the boundary between these regions is not always clear (even to a trained expert) These factors increase the time, effort, and overall cost associated with training a supervised classifier in the context of digital pathology. A strategy known as active learning (AL) was developed to select only “informative” exemplars for annotation [9,10]

Objectives
Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.