Abstract

Given the relatively small number of microarrays typically used in gene-expression-based classification, all of the data must be used to train a classifier and therefore the same training data is used for error estimation. The key issue regarding the quality of an error estimator in the context of small samples is its accuracy, and this is most directly analyzed via the deviation distribution of the estimator, this being the distribution of the difference between the estimated and true errors. Past studies indicate that given a prior set of features, cross-validation does not perform as well in this regard as some other training-data-based error estimators. The purpose of this study is to quantify the degree to which feature selection increases the variation of the deviation distribution in addition to the variation in the absence of feature selection. To this end, we propose the coefficient of relative increase in deviation dispersion (CRIDD), which gives the relative increase in the deviation-distribution variance using feature selection as opposed to using an optimal feature set without feature selection. The contribution of feature selection to the variance of the deviation distribution can be significant, contributing to over half of the variance in many of the cases studied. We consider linear-discriminant analysis, 3-nearest-neighbor, and linear support vector machines for classification; sequential forward selection, sequential forward floating selection, and the t-test for feature selection; and k-fold and leave-one-out cross-validation for error estimation. We apply these to three feature-label models and patient data from a breast cancer study. In sum, the cross-validation deviation distribution is significantly flatter when there is feature selection, compared with the case when cross-validation is performed on a given feature set. This is reflected by the observed positive values of the CRIDD, which is defined to quantify the contribution of feature selection towards the deviation variance.

Highlights

  • R2 P RN Given the relatively small number of microarrays typically used in expression-based classification for diagnosis and prognosis, all the data must be used to train a classifier and the same training data is used for error estimation

  • We have introduced the coefficient of relative increase in deviation dispersion to quantify the effect of feature selection on cross-validation error estimation

  • The coefficient measures the relative increase in the variance of the deviation distribution due to feature selection

Read more

Summary

Introduction

R2 P RN Given the relatively small number of microarrays typically used in expression-based classification for diagnosis and prognosis, all the data must be used to train a classifier and the same training data is used for error estimation. A classifier is designed according to a classification rule, with the rule being applied to sample data to yield a classifier. There are two possibilities: either the features are given prior to the data, in which case the classification rule yields a classifier with the given features constituting its argument, or both the features and classifier are determined by the classification rule. If cross-validation error estimation is used, the approximate unbiasedness of cross-validation applies to the classification rule, and since feature selection is part of the classification rule, feature selection must be accounted for within the cross-validation procedure to maintain the approximate unbiasedness [1]. This paper concerns the quality of such a cross-validation estimation procedure

Objectives
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.