Feature selection for classification using WGCNA and Spread Sub-Sample for an imbalanced rheumatoid arthritis RNASEQ data

Consolata Gakii,Victoria Mukami,Boaz Too

doi:10.1016/j.imu.2023.101402

Consolata Gakii, Victoria Mukami + Show 1 more

Open Access

https://doi.org/10.1016/j.imu.2023.101402

Copy DOI

Journal: Informatics in Medicine Unlocked	Publication Date: Jan 1, 2023
License type: cc-by-nc-nd

Affiliation: University of Embu

Abstract

An imbalanced classification problem occurs when the distribution of samples among different classes is uneven or biased. Handling small and imbalanced training datasets poses a notable challenge in machine learning, especially in domains such as bioinformatics and medical research. These challenges can result in biased models, leading to poor performance on under-represented classes and an overemphasis on specific features, failing to capture the genuine patterns present in the data. The present study proposes a feature selection approach-based on genes connectivity and a class balancing technique for building a machine leaning model using imbalanced gene expression data. Rheumatic arthritis data composed of 28 normal samples and 152 rheumatic samples was used in testing our proposed model. Through the weighted gene co-expression network analysis (WGCNA) approach, features were reduced to 601 from 27,991 original features. The reduced features were used to build machine learning classification models with imbalanced and later balanced classes using Spread Sub-Sample technique. According to our findings, two classifiers reported higher accuracy with imbalanced data as compared to the balanced data set. This is an indication that most classifiers are biased when trained using imbalanced dataset. Logistic regression returned improved accuracy of 95%. The other two machine learning algorithms used in this study were decision tree and IBK returned reduced accuracy of 81% and 91% respectively. In conclusion, feature selection and class balancing approaches are important in reducing model execution time and accuracy especially for RNASeq gene expression data.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Feature selection for classification using WGCNA and Spread Sub-Sample for an imbalanced rheumatoid arthritis RNASEQ data

Abstract

Talk to us

Similar Papers

More From: Informatics in Medicine Unlocked

Lead the way for us

Similar Papers

Two novelty learning models developed based on deep cascade forest to address the environmental imbalanced issues: A case study of drinking water quality prediction
Xingguo Chen ... Da Chen
Environmental Pollution | VOL. 291
Xingguo Chen, et. al.Xingguo Chen ... Da Chen
11 Sep 2021
Environmental Pollution | VOL. 291

A Method for Analyzing the Performance Impact of Imbalanced Binary Data on Machine Learning Models
Ming Zheng ... Yuhao Miao
Axioms | VOL. 11
Ming Zheng, et. al.Ming Zheng ... Yuhao Miao
01 Nov 2022
Axioms | VOL. 11

AI federated learning based improvised random Forest classifier with error reduction mechanism for skewed data sets
Anjali More ... Dipti Rana
International Journal of Pervasive Computing and Communications | VOL. 20
Anjali More, et. al.Anjali More ... Dipti Rana
19 Aug 2022
International Journal of Pervasive Computing and Communications | VOL. 20

Learning from Imbalanced Multi-label Data Sets by Using Ensemble Strategies
Fatemeh Shamsezat ... Mohammad Masoud Javidi
Computer Engineering and Applications Journal | VOL. 4
Fatemeh Shamsezat, et. al.Fatemeh Shamsezat ... Mohammad Masoud Javidi
18 Feb 2015
Computer Engineering and Applications Journal | VOL. 4

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Feature selection for classification using WGCNA and Spread Sub-Sample for an imbalanced rheumatoid arthritis RNASEQ data

Abstract

Talk to us

Similar Papers

More From: Informatics in Medicine Unlocked