Deep active learning for classifying cancer pathology reports

Kevin De Angeli,Eric B Durbin,Xiao-Cheng Wu,Antoinette Stroup,Lynne Penberthy,Georgia Tourassi,Jennifer Doherty,Noah Schaefferkoetter,Linda Coyle,Shang Gao,Hong-Jun Yoon,Mohammed Alawad

doi:10.1186/s12859-021-04047-1

Abstract

BackgroundAutomated text classification has many important applications in the clinical setting; however, obtaining labelled data for training machine learning and deep learning models is often difficult and expensive. Active learning techniques may mitigate this challenge by reducing the amount of labelled data required to effectively train a model. In this study, we analyze the effectiveness of 11 active learning algorithms on classifying subsite and histology from cancer pathology reports using a Convolutional Neural Network as the text classification model.ResultsWe compare the performance of each active learning strategy using two differently sized datasets and two different classification tasks. Our results show that on all tasks and dataset sizes, all active learning strategies except diversity-sampling strategies outperformed random sampling, i.e., no active learning. On our large dataset (15K initial labelled samples, adding 15K additional labelled samples each iteration of active learning), there was no clear winner between the different active learning strategies. On our small dataset (1K initial labelled samples, adding 1K additional labelled samples each iteration of active learning), marginal and ratio uncertainty sampling performed better than all other active learning techniques. We found that compared to random sampling, active learning strongly helps performance on rare classes by focusing on underrepresented classes.ConclusionsActive learning can save annotation cost by helping human annotators efficiently and intelligently select which samples to label. Our results show that a dataset constructed using effective active learning techniques requires less than half the amount of labelled data to achieve the same performance as a dataset constructed using random sampling.

Highlights

Automated text classification has many important applications in the clinical setting; obtaining labelled data for training machine learning and deep learning models is often difficult and expensive
After accounting for the confidence intervals, all active learning strategies implemented in this paper except for the diversity-based methods performed significantly better than the baseline of no active learning, i.e., random sampling
These results suggest that the document embeddings generated by the Convolutional Neural Networks (CNNs), which are optimized for classification, may not adequately capture the information necessary to distinguish informative documents

Summary

Introduction

Automated text classification has many important applications in the clinical setting; obtaining labelled data for training machine learning and deep learning models is often difficult and expensive. A common drawback of DL models is that they tend to require a large amount of training data to achieve high performance This is a significant problem in clinical applications where obtaining gold-standard labels is difficult and subject to constraints. Compared to randomly labelling additional data, active learning enables the model to reach higher performance using fewer additional labelled samples, thereby increasing the efficiency and effectiveness of human annotators [4] This approach is especially useful for applications such as clinical text classification where annotated data is expensive and time-consuming to obtain

Methods

Results

Discussion

Conclusion