Abstract

BackgroundClassification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class-balanced data. Generally undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling but its behavior on high-dimensional data has not been thoroughly investigated. In this paper we investigate the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data.ResultsWhile in most cases SMOTE seems beneficial with low-dimensional data, it does not attenuate the bias towards the classification in the majority class for most classifiers when data are high-dimensional, and it is less effective than random undersampling. SMOTE is beneficial for k-NN classifiers for high-dimensional data if the number of variables is reduced performing some type of variable selection; we explain why, otherwise, the k-NN classification is biased towards the minority class. Furthermore, we show that on high-dimensional data SMOTE does not change the class-specific mean values while it decreases the data variability and it introduces correlation between samples. We explain how our findings impact the class-prediction for high-dimensional data.ConclusionsIn practice, in the high-dimensional setting only k-NN classifiers based on the Euclidean distance seem to benefit substantially from the use of SMOTE, provided that variable selection is performed before using SMOTE; the benefit is larger if more neighbors are used. SMOTE for k-NN without variable selection should not be used, because it strongly biases the classification towards the minority class.

Highlights

  • Classification using class-imbalanced data is biased in favor of the majority class

  • Synthetic Minority Oversampling TEchnique (SMOTE) did not seem to impact the performance of these classifiers, while it reduced the bias towards the majority class for Nearest neighbor classifier with k neighbors (k-NN), Penalized logistic regression (PLR)-L1, PLR-L2 and Prediction analysis of microarrays (PAM), performing well when the sample size was small (n = 40) and increasing the overall predictive accuracy (PA) in the alternative case

  • A similar but attenuated effect was observed for the other classifiers (CART, Support vector machines (SVM), Random forests (RF)) where SMOTE decreased the difference between classspecific PA, most notably for large sample sizes, but did not remove it

Read more

Summary

Introduction

The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem of learning from class-imbalanced data has been receiving a growing attention in many different fields [2]. Data are nowadays increasingly often high-dimensional: the number of variables is very large and greatly exceeds the number of samples. High-throughput technologies are popular in the biomedical field, where it is possible to measure simultaneously the expression of all the known genes (>20,000) but the number of subjects included in the study is rarely larger than few hundreds. Many papers attempted to develop classification rules using high-dimensional gene expression data that were class-imbalanced (see for example [4,5,6])

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call