Active Learning for Biomedical Text Classification Based on Automatically Generated Regular Expressions

Christopher A Flores,Rosa L Figueroa,Jorge E Pezoa

doi:10.1109/access.2021.3064000

Christopher A Flores, Rosa L Figueroa + Show 1 more

Open Access

https://doi.org/10.1109/access.2021.3064000

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2021
Citations: 56	License type: CC BY 4.0

Affiliation: University of Concepción

Abstract

Biomedical text classification algorithms, which currently support clinical decision-making processes, call for expensive training texts due to the low availability of labeled corpus and the cost of manual annotation by specialized professionals. The active learning (AL) approach to classification heavily lessens such cost by reducing the number of labeled documents required to achieve specified performance. This article introduces a query strategy and a stopping criterion that transform CREGEX, a regular-expressions-based text classification algorithm, in an AL biomedical text classifier. The query strategy samples the training dataset, trading off the greedy learning achieved by the regular expressions classification precision and the conservative learning induced by text sequence alignment classification. The sustained reduction in the variance of the query strategy scores is used as a stopping criterion. The AL classifier was compared with Support Vector Machine (SVM), Naïve Bayes (NB), and a classifier based on Bidirectional Encoder Representations from Transformers (BERT), using three datasets with biomedical information in Spanish on smoking habits, obesity, and obesity types. The learning curve results indicate that AL in CREGEX allowed to efficiently reduce the number of training examples for equal performance than the rest of the classifiers, obtaining areas under the learning curve greater than 85% in all cases. The stopping criterion applied to the AL process allowed to use, on average, approximately 32% to 50% of the total training examples with differences in performance concerning the maximum value of the learning curve not exceeding 2%. This performance demonstrates the effectiveness of using AL in a biomedical text classifier based on regular expressions, which is attributable to such expressions' ability to represent intricate sequential patterns in training texts considered most informative.

Highlights

Text classification has become one of the most widely used machine learning techniques to organize the growing accumulation of unstructured digital information [1]–[3]
The active learning (AL) query strategy samples the training dataset trading off the greedy learning achieved by the regular expressions classification precision and the conservative learning induced by text sequence alignment classification
It has been shown that Bidirectional Encoder Representations from Transformers (BERT) may not work properly representing numbers, while regular expressions allow representing complex sequential patterns, including numerical attributes [18], [22], [23]

Summary

Introduction

Text classification has become one of the most widely used machine learning techniques to organize the growing accumulation of unstructured digital information [1]–[3]. Classification algorithms such as Support Vector Machine (SVM) and Naïve Bayes (NB) have been extensively used due to the simplicity of their implementation, and the accurate results obtained [4]. Resources, and specialized annotators are needed to carry out the labeling tasks [8] In this scenario, the active learning (AL) approach to classification offers an alternative to reducing annotation efforts.

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Active Learning for Biomedical Text Classification Based on Automatically Generated Regular Expressions

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

A Generic Semi-Supervised and Active Learning Framework for Biomedical Text Classification.
Christopher A Flores ... Rodrigo Verschae
Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference | VOL. 2022
Christopher A Flores, et. al.Christopher A Flores ... Rodrigo Verschae
11 Jul 2022
11 Jul 2022

When BERT meets Bilbo: a learning curve analysis of pretrained language model on disease classification
Xuedong Li ... Qiaozhu Mei
BMC Medical Informatics and Decision Making | VOL. 21
Xuedong Li, et. al.Xuedong Li ... Qiaozhu Mei
01 Nov 2021
BMC Medical Informatics and Decision Making | VOL. 21

Bidirectional encoders to state-of-the-art: a review of BERT and its transformative impact on natural language processing
Rajesh Gupta
Информатика. Экономика. Управление - Informatics. Economics. Management | VOL. 3
Rajesh GuptaRajesh Gupta
02 Mar 2024
Информатика. Экономика. Управление - Informatics. Economics. Management | VOL. 3

5 - Utilizing BERT for biomedical and clinical text mining
Runjie Zhu ... Jimmy Xiangji Huang
Data Analytics in Biomedical Engineering and Healthcare | VOL. -
Runjie Zhu, et. al.Runjie Zhu ... Jimmy Xiangji Huang
23 Oct 2020
Data Analytics in Biomedical Engineering and Healthcare | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Active Learning for Biomedical Text Classification Based on Automatically Generated Regular Expressions

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access