Screening PubMed abstracts: is class imbalance always a challenge to machine learning?

Corrado Lanera,Paola Berchialla,Ileana Baldi,Abhinav Sharma,Clara Minto,Dario Gregori

doi:10.1186/s13643-019-1245-8

Corrado Lanera, Paola Berchialla + Show 4 more

Open Access

https://doi.org/10.1186/s13643-019-1245-8

Copy DOI

Abstract

BackgroundThe growing number of medical literature and textual data in online repositories led to an exponential increase in the workload of researchers involved in citation screening for systematic reviews. This work aims to combine machine learning techniques and data preprocessing for class imbalance to identify the outperforming strategy to screen articles in PubMed for inclusion in systematic reviews.MethodsWe trained four binary text classifiers (support vector machines, k-nearest neighbor, random forest, and elastic-net regularized generalized linear models) in combination with four techniques for class imbalance: random undersampling and oversampling with 50:50 and 35:65 positive to negative class ratios and none as a benchmark. We used textual data of 14 systematic reviews as case studies. Difference between cross-validated area under the receiver operating characteristic curve (AUC-ROC) for machine learning techniques with and without preprocessing (delta AUC) was estimated within each systematic review, separately for each classifier. Meta-analytic fixed-effect models were used to pool delta AUCs separately by classifier and strategy.ResultsCross-validated AUC-ROC for machine learning techniques (excluding k-nearest neighbor) without preprocessing was prevalently above 90%. Except for k-nearest neighbor, machine learning techniques achieved the best improvement in conjunction with random oversampling 50:50 and random undersampling 35:65.ConclusionsResampling techniques slightly improved the performance of the investigated machine learning techniques. From a computational perspective, random undersampling 35:65 may be preferred.

Highlights

The growing number of medical literature and textual data in online repositories led to an exponential increase in the workload of researchers involved in citation screening for systematic reviews (SRs)
This study examines to which extent class imbalance challenges the performance of four traditional machine learning techniques (MLT) for automatic binary text classification of PubMed abstracts
The application of no balancing technique resulted in a high performance only for the k-nearest neighbors (k-NN) classifiers

Summary

Introduction

The growing number of medical literature and textual data in online repositories led to an exponential increase in the workload of researchers involved in citation screening for systematic reviews. The growing number of medical literature and textual data in online repositories led to an exponential increase in the workload of researchers involved in citation screening for systematic reviews (SRs). When searching through PubMed by using keyword queries, researchers usually retrieve a minimal number of papers relevant to the review question and a higher number of irrelevant papers In such a situation of imbalance, most common machine learning classifiers, used to differentiate relevant and irrelevant texts without human assistance, are biased towards the majority class and perform poorly on the minority one [8, 9]. Third approaches are represented by the set of ensemble methods, which apply to boosting and bagging classifiers both resampling techniques and penalties for misclassification of cases in the minority class [12, 13]

Objectives

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Systematic Reviews	Publication Date: Dec 1, 2019
Citations: 23	License type: open-access

R Discovery Prime

R Discovery Prime

Screening PubMed abstracts: is class imbalance always a challenge to machine learning?

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Systematic Reviews

Lead the way for us

Similar Papers

Using machine learning to predict bacteremia in urgent care patients on the basis of triage data and laboratory results
Chung-Ping Chiu ... Sun-Yuan Hsieh
American Journal of Emergency Medicine | VOL. 85
Chung-Ping Chiu, et. al.Chung-Ping Chiu ... Sun-Yuan Hsieh
02 Sep 2024
American Journal of Emergency Medicine | VOL. 85

The performance of VCS(volume, conductivity, light scatter) parameters in distinguishing latent tuberculosis and active tuberculosis by using machine learning algorithm
Lijiao Chen ... Shaoli Deng
BMC Infectious Diseases | VOL. 23
Lijiao Chen, et. al.Lijiao Chen ... Shaoli Deng
16 Dec 2023
BMC Infectious Diseases | VOL. 23

Analyzing Resampling Techniques for Addressing the Class Imbalance in NIDS using SVM with Random Forest Feature Selection
K Swarnalatha ... Nirmalajyothi Narisetty
International Journal of Experimental Research and Review | VOL. 43
K Swarnalatha, et. al.K Swarnalatha ... Nirmalajyothi Narisetty
30 Sep 2024
International Journal of Experimental Research and Review | VOL. 43

Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage
Jianxiang Tang ... Hongli Wan
BMC Medical Informatics and Decision Making | VOL. 22
Jianxiang Tang, et. al.Jianxiang Tang ... Hongli Wan
25 Oct 2022
BMC Medical Informatics and Decision Making | VOL. 22

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Screening PubMed abstracts: is class imbalance always a challenge to machine learning?

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Systematic Reviews