Abstract

BackgroundPAM, a nearest shrunken centroid method (NSC), is a popular classification method for high-dimensional data. ALP and AHP are NSC algorithms that were proposed to improve upon PAM. The NSC methods base their classification rules on shrunken centroids; in practice the amount of shrinkage is estimated minimizing the overall cross-validated (CV) error rate.ResultsWe show that when data are class-imbalanced the three NSC classifiers are biased towards the majority class. The bias is larger when the number of variables or class-imbalance is larger and/or the differences between classes are smaller. To diminish the class-imbalance problem of the NSC classifiers we propose to estimate the amount of shrinkage by maximizing the CV geometric mean of the class-specific predictive accuracies (g-means).ConclusionsThe results obtained on simulated and real high-dimensional class-imbalanced data show that our approach outperforms the currently used strategy based on the minimization of the overall error rate when NSC classifiers are biased towards the majority class. The number of variables included in the NSC classifiers when using our approach is much smaller than with the original approach. This result is supported by experiments on simulated and real high-dimensional class-imbalanced data.

Highlights

  • prediction for microarrays (PAM), a nearest shrunken centroid method (NSC), is a popular classification method for high-dimensional data

  • We present a modified approach for threshold estimation aimed at reducing the class-imbalance problem for NSC classifiers, and show its effectiveness on simulated and real high dimensional class-imbalanced data

  • For the sake of simplicity let us focus on diagonal linear discriminant analysis (DLDA); we consider a two-class classification problem and assume that there is no real difference between the classes and that class 1 is the minority class

Read more

Summary

Introduction

PAM, a nearest shrunken centroid method (NSC), is a popular classification method for high-dimensional data. The NSC methods base their classification rules on shrunken centroids; in practice the amount of shrinkage is estimated minimizing the overall cross-validated (CV) error rate. The objective of class prediction (classification) is to develop a rule based on variables measured on a group of samples with known class membership (training set), which can be used to assign the class membership to new samples (test set). Nowadays classification rules are increasingly often developed using data that are high-dimensional (the number of variables greatly exceeds the number of samples) and class-imbalanced (the number of samples belonging to each class is not the same). Many researchers attempted to develop geneexpression classifiers based on microarray experiments for prognostic and predictive purposes in breast cancer [2].

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.