Abstract

An imbalanced classification problem occurs when the distribution of samples among different classes is uneven or biased. Handling small and imbalanced training datasets poses a notable challenge in machine learning, especially in domains such as bioinformatics and medical research. These challenges can result in biased models, leading to poor performance on under-represented classes and an overemphasis on specific features, failing to capture the genuine patterns present in the data. The present study proposes a feature selection approach-based on genes connectivity and a class balancing technique for building a machine leaning model using imbalanced gene expression data. Rheumatic arthritis data composed of 28 normal samples and 152 rheumatic samples was used in testing our proposed model. Through the weighted gene co-expression network analysis (WGCNA) approach, features were reduced to 601 from 27,991 original features. The reduced features were used to build machine learning classification models with imbalanced and later balanced classes using Spread Sub-Sample technique. According to our findings, two classifiers reported higher accuracy with imbalanced data as compared to the balanced data set. This is an indication that most classifiers are biased when trained using imbalanced dataset. Logistic regression returned improved accuracy of 95%. The other two machine learning algorithms used in this study were decision tree and IBK returned reduced accuracy of 81% and 91% respectively. In conclusion, feature selection and class balancing approaches are important in reducing model execution time and accuracy especially for RNASeq gene expression data.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call