Abstract
Biomedical text classification algorithms, which currently support clinical decision-making processes, call for expensive training texts due to the low availability of labeled corpus and the cost of manual annotation by specialized professionals. The active learning (AL) approach to classification heavily lessens such cost by reducing the number of labeled documents required to achieve specified performance. This article introduces a query strategy and a stopping criterion that transform CREGEX, a regular-expressions-based text classification algorithm, in an AL biomedical text classifier. The query strategy samples the training dataset, trading off the greedy learning achieved by the regular expressions classification precision and the conservative learning induced by text sequence alignment classification. The sustained reduction in the variance of the query strategy scores is used as a stopping criterion. The AL classifier was compared with Support Vector Machine (SVM), Naïve Bayes (NB), and a classifier based on Bidirectional Encoder Representations from Transformers (BERT), using three datasets with biomedical information in Spanish on smoking habits, obesity, and obesity types. The learning curve results indicate that AL in CREGEX allowed to efficiently reduce the number of training examples for equal performance than the rest of the classifiers, obtaining areas under the learning curve greater than 85% in all cases. The stopping criterion applied to the AL process allowed to use, on average, approximately 32% to 50% of the total training examples with differences in performance concerning the maximum value of the learning curve not exceeding 2%. This performance demonstrates the effectiveness of using AL in a biomedical text classifier based on regular expressions, which is attributable to such expressions' ability to represent intricate sequential patterns in training texts considered most informative.
Highlights
Text classification has become one of the most widely used machine learning techniques to organize the growing accumulation of unstructured digital information [1]–[3]
The active learning (AL) query strategy samples the training dataset trading off the greedy learning achieved by the regular expressions classification precision and the conservative learning induced by text sequence alignment classification
It has been shown that Bidirectional Encoder Representations from Transformers (BERT) may not work properly representing numbers, while regular expressions allow representing complex sequential patterns, including numerical attributes [18], [22], [23]
Summary
Text classification has become one of the most widely used machine learning techniques to organize the growing accumulation of unstructured digital information [1]–[3]. Classification algorithms such as Support Vector Machine (SVM) and Naïve Bayes (NB) have been extensively used due to the simplicity of their implementation, and the accurate results obtained [4]. Resources, and specialized annotators are needed to carry out the labeling tasks [8] In this scenario, the active learning (AL) approach to classification offers an alternative to reducing annotation efforts.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.