Statistical Analysis of Imbalanced Classification with Training Size Variation and Subsampling on Datasets of Research Papers in Biomedical Literature

Jose Dixon,Md Rahman

doi:10.3390/make5040095

Abstract

The overall purpose of this paper is to demonstrate how data preprocessing, training size variation, and subsampling can dynamically change the performance metrics of imbalanced text classification. The methodology encompasses using two different supervised learning classification approaches of feature engineering and data preprocessing with the use of five machine learning classifiers, five imbalanced sampling techniques, specified intervals of training and subsampling sizes, statistical analysis using R and tidyverse on a dataset of 1000 portable document format files divided into five labels from the World Health Organization Coronavirus Research Downloadable Articles of COVID-19 papers and PubMed Central databases of non-COVID-19 papers for binary classification that affects the performance metrics of precision, recall, receiver operating characteristic area under the curve, and accuracy. One approach that involves labeling rows of sentences based on regular expressions significantly improved the performance of imbalanced sampling techniques verified by performing statistical analysis using a t-test documenting performance metrics of iterations versus another approach that automatically labels the sentences based on how the documents are organized into positive and negative classes. The study demonstrates the effectiveness of ML classifiers and sampling techniques in text classification datasets, with different performance levels and class imbalance issues observed in manual and automatic methods of data processing.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Statistical Analysis of Imbalanced Classification with Training Size Variation and Subsampling on Datasets of Research Papers in Biomedical Literature

Abstract

Talk to us

Similar Papers

More From: Machine Learning and Knowledge Extraction

Lead the way for us

Journal: Machine Learning and Knowledge Extraction	Publication Date: Dec 11, 2023
License type: CC BY 4.0

Similar Papers

Exploring Symmetry of Binary Classification Performance Metrics
Amalia Luque ... Alejandro Carrasco
Symmetry | VOL. 11
Amalia Luque, et. al.Amalia Luque ... Alejandro Carrasco
03 Jan 2019
Symmetry | VOL. 11

Computational Identification of Lungs Cancer Causing Genes by Machine Learning (Ml) Classifiers
Muntaha Saleem ... Muhammad Sohaib Akram
VFAST Transactions on Software Engineering | VOL. 9
Muntaha Saleem, et. al.Muntaha Saleem ... Muhammad Sohaib Akram
30 Mar 2021
VFAST Transactions on Software Engineering | VOL. 9

Towards Predicting Student’s Dropout in University Courses Using Different Machine Learning Techniques
Janka Kabathova ... Martin Drlik
Applied Sciences | VOL. 11
Janka Kabathova, et. al.Janka Kabathova ... Martin Drlik
01 Apr 2021
Applied Sciences | VOL. 11

Analysis of Colorectal and Gastric Cancer Classification: A Mathematical Insight Utilizing Traditional Machine Learning Classifiers
Hari Mohan Rai ... Joon Yoo
Mathematics | VOL. 11
Hari Mohan Rai, et. al.Hari Mohan Rai ... Joon Yoo
12 Dec 2023
Mathematics | VOL. 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Statistical Analysis of Imbalanced Classification with Training Size Variation and Subsampling on Datasets of Research Papers in Biomedical Literature

Abstract

Talk to us

Similar Papers

More From: Machine Learning and Knowledge Extraction