Class prediction for high-dimensional class-imbalanced data

Rok Blagus,Lara Lusa

doi:10.1186/1471-2105-11-523

Abstract

BackgroundThe goal of class prediction studies is to develop rules to accurately predict the class membership of new samples. The rules are derived using the values of the variables available for each subject: the main characteristic of high-dimensional data is that the number of variables greatly exceeds the number of samples. Frequently the classifiers are developed using class-imbalanced data, i.e., data sets where the number of samples in each class is not equal. Standard classification methods used on class-imbalanced data often produce classifiers that do not accurately predict the minority class; the prediction is biased towards the majority class. In this paper we investigate if the high-dimensionality poses additional challenges when dealing with class-imbalanced prediction. We evaluate the performance of six types of classifiers on class-imbalanced data, using simulated data and a publicly available data set from a breast cancer gene-expression microarray study. We also investigate the effectiveness of some strategies that are available to overcome the effect of class imbalance.ResultsOur results show that the evaluated classifiers are highly sensitive to class imbalance and that variable selection introduces an additional bias towards classification into the majority class. Most new samples are assigned to the majority class from the training set, unless the difference between the classes is very large. As a consequence, the class-specific predictive accuracies differ considerably. When the class imbalance is not too severe, down-sizing and asymmetric bagging embedding variable selection work well, while over-sampling does not. Variable normalization can further worsen the performance of the classifiers.ConclusionsOur results show that matching the prevalence of the classes in training and test set does not guarantee good performance of classifiers and that the problems related to classification with class-imbalanced data are exacerbated when dealing with high-dimensional data. Researchers using class-imbalanced data should be careful in assessing the predictive accuracy of the classifiers and, unless the class imbalance is mild, they should always use an appropriate method for dealing with the class imbalance problem.

Highlights

The goal of class prediction studies is to develop rules to accurately predict the class membership of new samples
The classifiers were developed on the training sets, while the predictive accuracy (PA, overall and class specific: predictive accuracy for Class 1 (PA1) for Class 1 and predictive accuracy of Class 2 (PA2) for Class 2), predictive values (PV1 and predictive value for Class 2 (PV2)) and area under the receiver operating characteristic (ROC) curve (AUC) were evaluated on the test sets
The extent of this bias depends on the classification method, on the magnitude of the difference between classes, and on the level of class imbalance, and it is further increased when variable selection methods are used; variable normalization generally increases the bias and it should be avoided, unless the class imbalance is equal in training and test set

Summary

Introduction

The goal of class prediction studies is to develop rules to accurately predict the class membership of new samples. The rules are derived using the values of the variables available for each subject: the main characteristic of high-dimensional data is that the number of variables greatly exceeds the number of samples. The classifiers are developed using class-imbalanced data, i.e., data sets where the number of samples in each class is not equal. Microarrays are frequently used for class prediction (classification) In these studies the goal is to develop a rule based on the measurements (variables) obtained from the microarrays from samples (observations) that belong to distinct and well-defined groups (classes); these rules can be used to predict the class membership of new samples for which the values of the variables are known but the class-membership is unknown. Some of the classification methods most frequently used for microarray data are discriminant analysis methods, nearest neighbor (k-NN, [6]) and nearest centroid classifiers [7], classification trees [8], random forests (RF, [9]) and support vector machines (SVM, [10]) (see [11] or [12] for an introduction to these methods)

Objectives

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Oct 20, 2010
Citations: 232	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Class prediction for high-dimensional class-imbalanced data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Feature selection for high dimensional imbalanced class data based on F-measure optimization
Chunkai Zhang ... Xuan Wang
-
Chunkai Zhang, et. al.Chunkai Zhang ... Xuan Wang
01 Dec 2017
01 Dec 2017

K Means Cluster Based Undersampling Ensemble for Imbalanced Data Classification
S Santha Subbulaxmi ... G Arumugam
International Journal of Engineering and Advanced Technology | VOL. 9
S Santha Subbulaxmi, et. al.S Santha Subbulaxmi ... G Arumugam
28 Feb 2020
International Journal of Engineering and Advanced Technology | VOL. 9

Cost-sensitive ensemble of stacked denoising autoencoders for class imbalance problems in business domain
Man Leung Wong ... Pak Kan Wong
Expert Systems with Applications | VOL. 141
Man Leung Wong, et. al.Man Leung Wong ... Pak Kan Wong
02 Sep 2019
Expert Systems with Applications | VOL. 141

Semi-random partitioning of data into training and test sets in granular computing context
Han Liu ... Mihaela Cocea
Granular Computing | VOL. 2
Han Liu, et. al.Han Liu ... Mihaela Cocea
09 Aug 2017
Granular Computing | VOL. 2

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Class prediction for high-dimensional class-imbalanced data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics