SMOTE for high-dimensional class-imbalanced data

Rok Blagus,Lara Lusa

doi:10.1186/1471-2105-14-106

Abstract

BackgroundClassification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class-balanced data. Generally undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling but its behavior on high-dimensional data has not been thoroughly investigated. In this paper we investigate the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data.ResultsWhile in most cases SMOTE seems beneficial with low-dimensional data, it does not attenuate the bias towards the classification in the majority class for most classifiers when data are high-dimensional, and it is less effective than random undersampling. SMOTE is beneficial for k-NN classifiers for high-dimensional data if the number of variables is reduced performing some type of variable selection; we explain why, otherwise, the k-NN classification is biased towards the minority class. Furthermore, we show that on high-dimensional data SMOTE does not change the class-specific mean values while it decreases the data variability and it introduces correlation between samples. We explain how our findings impact the class-prediction for high-dimensional data.ConclusionsIn practice, in the high-dimensional setting only k-NN classifiers based on the Euclidean distance seem to benefit substantially from the use of SMOTE, provided that variable selection is performed before using SMOTE; the benefit is larger if more neighbors are used. SMOTE for k-NN without variable selection should not be used, because it strongly biases the classification towards the minority class.

Highlights

Classification using class-imbalanced data is biased in favor of the majority class
Synthetic Minority Oversampling TEchnique (SMOTE) did not seem to impact the performance of these classifiers, while it reduced the bias towards the majority class for Nearest neighbor classifier with k neighbors (k-NN), Penalized logistic regression (PLR)-L1, PLR-L2 and Prediction analysis of microarrays (PAM), performing well when the sample size was small (n = 40) and increasing the overall predictive accuracy (PA) in the alternative case
A similar but attenuated effect was observed for the other classifiers (CART, Support vector machines (SVM), Random forests (RF)) where SMOTE decreased the difference between classspecific PA, most notably for large sample sizes, but did not remove it

Summary

Introduction

The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem of learning from class-imbalanced data has been receiving a growing attention in many different fields [2]. Data are nowadays increasingly often high-dimensional: the number of variables is very large and greatly exceeds the number of samples. High-throughput technologies are popular in the biomedical field, where it is possible to measure simultaneously the expression of all the known genes (>20,000) but the number of subjects included in the study is rarely larger than few hundreds. Many papers attempted to develop classification rules using high-dimensional gene expression data that were class-imbalanced (see for example [4,5,6])

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Mar 22, 2013
Citations: 682	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

SMOTE for high-dimensional class-imbalanced data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Evaluation of SMOTE for High-Dimensional Class-Imbalanced Microarray Data
Rok Blagus ... Lara Lusa
-
Rok Blagus, et. al.Rok Blagus ... Lara Lusa
01 Dec 2012
01 Dec 2012

Modeling of class imbalance using an empirical approach with spambase dataset and random forest classification
Kiranmayi Kotipalli ... Shan Suthaharan
-
Kiranmayi Kotipalli, et. al.Kiranmayi Kotipalli ... Shan Suthaharan
13 Oct 2014
13 Oct 2014

Comparative Multinomial Text Classification Analysis of Naïve Bayes and XGBoost with SMOTE on Imbalanced Dataset
Ashish Chaturvedi ... Santosh Yadav
-
Ashish Chaturvedi, et. al.Ashish Chaturvedi ... Santosh Yadav
05 Sep 2021
05 Sep 2021

A Novel Algorithm for Imbalance Data Classification Based on Genetic Algorithm Improved SMOTE
Kun Jiang ... Jing Lu
Arabian Journal for Science and Engineering | VOL. 41
Kun Jiang, et. al.Kun Jiang ... Jing Lu
12 May 2016
Arabian Journal for Science and Engineering | VOL. 41

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SMOTE for high-dimensional class-imbalanced data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics