A Comparative Analysis of Active Learning for Biomedical Text Mining

Usman Naseem,Kamran Shaukat,Matloob Khushi,Shah Khalid Khan,Mohammad Ali Moni

doi:10.3390/asi4010023

Abstract

An enormous amount of clinical free-text information, such as pathology reports, progress reports, clinical notes and discharge summaries have been collected at hospitals and medical care clinics. These data provide an opportunity of developing many useful machine learning applications if the data could be transferred into a learn-able structure with appropriate labels for supervised learning. The annotation of this data has to be performed by qualified clinical experts, hence, limiting the use of this data due to the high cost of annotation. An underutilised technique of machine learning that can label new data called active learning (AL) is a promising candidate to address the high cost of the label the data. AL has been successfully applied to labelling speech recognition and text classification, however, there is a lack of literature investigating its use for clinical purposes. We performed a comparative investigation of various AL techniques using ML and deep learning (DL)-based strategies on three unique biomedical datasets. We investigated random sampling (RS), least confidence (LC), informative diversity and density (IDD), margin and maximum representativeness-diversity (MRD) AL query strategies. Our experiments show that AL has the potential to significantly reducing the cost of manual labelling. Furthermore, pre-labelling performed using AL expediates the labelling process by reducing the time required for labelling.

Highlights

The wide-spread utilisation of capacity and digitising advancements, the digitisation of clinical records, presents numerous information examination chances.Notwithstanding, to arrive at their maximum capacity, such investigation frameworks need to remove organised information from unstructured content reports
The results (Tables 2–13) show that the DDI dataset, which applies BERT for feature extraction, has the best performance in accuracy when we apply an support vector machines (SVMs) algorithm with an active learning (AL) framework which builds based on maximum representativenessdiversity (MRD) query strategies
Our results showed that most AL algorithms outperformed the passive learning method when we assume equal annotation cost for each sentence

Summary

Introduction

The wide-spread utilisation of capacity and digitising advancements, the digitisation of clinical records, presents numerous information examination chances.Notwithstanding, to arrive at their maximum capacity, such investigation frameworks need to remove organised information from unstructured content reports. The wide-spread utilisation of capacity and digitising advancements, the digitisation of clinical records, presents numerous information examination chances. An expanding volume of unstructured clinical information about patients is put away electronically by clinics and medical services. Organised data is fundamental for applications, for example, reporting, reasoning, and retrieving, for instance, malignancy observations from medical reports and death certificates [1], checking radiology reports to forestall missed fractures [2], and clinical data retrieval [3]. Late advancements of Natural Language Processing (NLP) and information extraction (IE) have confronted fundamental difficulties in adequately catching valuable data from this free-text resources [4]. IE is a nontrivial interaction for extricating helpful, organised data like examples and different connections from unstructured info text

Objectives

Methods

Conclusion