Abstract

For statistical classification problems where the total sample size is slightly greater than the feature dimension, regularized statistical discriminant rules may reduce classification error rates. We review ten dispersion-matrix regularization approaches, four for the pooled sample covariance matrix, four for the inverse pooled sample covariance matrix, and two for a diagonal covariance matrix, for use in Anderson’s (1951) linear discriminant function (LDF). We compare these regularized classifiers against the traditional LDF for a variety of parameter configurations, and use the estimated expected error rate (EER) to assess performance. We also apply the regularized LDFs to a well-known real-data example on colon cancer. We found that no regularized classifier uniformly outperformed the others. However, we found that the more contemporary classifiers (e.g., Thomaz and Gillies, 2005; Tong et al., 2012; and Xu et al., 2009) tended to outperform the older classifiers, and that certain simple methods (e.g., Pang et al., 2009; Thomaz and Gillies, 2005; and Tong et al., 2012) performed very well, questioning the need for involved cross-validation in estimating regularization parameters. Nonetheless, an older regularized classifier proposed by Smidt and McDonald (1976) yielded consistently low misclassification rates across all scenarios, despite the shape of the true covariance matrix. Finally, our simulations showed that regularized classifiers that relied primarily on asymptotic approximations with respect to the training sample size rarely outperformed the traditional LDF, and are thus not recommended. We discuss our results as they pertain to the effect of high dimension, and offer general guidelines for choosing a regularization method for poorly-posed problems.

Highlights

  • For classification problems with high dimensional sets of features, feature selection methods are often used to locate the features most important for separating classes

  • Feature selection methods can be very successful in reducing the classification problem to a size where conventional methods may be used

  • We examine the performance of these regularization techniques, along with the linear discriminant function (LDF), for the two-class problem using several types of mean configurations and covariance matrices, for both small and moderate training-sample sizes relative to p

Read more

Summary

Introduction

For classification problems with high dimensional sets of features, feature selection methods are often used to locate the features most important for separating classes. It can be important to use a regularization method after feature selection when building classifiers for high dimensional data sets. In Guo et al.’s (2007) ”Shrunken Centroids” regularized classifier, the class means of each gene are shrunk toward their grand means, based on a threshold, to the extent that the class-specific means may be determined not to differ across classes This procedure selects the most important genes for classification.

Notational Background
A Poorly-posed Linear Discriminant Function
Covariance Matrix Estimator Regularization
Shrinkage Estimator Regularization
Diagonal Estimators of the Covariance Matrix
Simulation Description and Results
Configuration A
Configuration B
Configuration C
Configuration D
Colon Cancer Example
Yang and Wu’s Regularized Complete Linear Discriminant Classifier
Discussion
General Guidelines
Other Types of Regularization
10 Appendix
10.1.2. Configuration B
10.1.3. Configuration C
10.1.4. Configuration D
10.2.1. Configuration A
10.2.2. Configuration B
10.2.3. Configuration C
10.2.4. Configuration D
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.