Variable Selection and Parameter Tuning in High-Dimensional Prediction

Christoph Bernau ,Anne‐Laure Boulesteix

doi:10.5282/ubm/epub.11337

Abstract

In the context of classification using high-dimensional data such as microarray gene expression data, it is often useful to perform preliminary variable selection. For example, the k-nearest-neighbors classification procedure yields a much higher accuracy when applied on variables with high discriminatory power. Typical (univariate) variable selection methods for binary classification are, e.g., the two-sample t-statistic or the Mann-Whitney test. In small sample settings, the classification error rate is often estimated using cross-validation (CV) or related approaches. The variable selection procedure has then to be applied for each considered training set anew, i.e. for each iteration successively. Performing variable selection based on the whole sample before the procedure would yield a downwardly biased error rate estimate. may also be used to tune parameters involved in a classification method. For instance, the penalty parameter in penalized regression or the cost in support vector machines are most often selected using CV. This type of is usually denoted as CV in contrast to the CV performed to estimate the error rate, while the term nested CV refers to the whole procedure embedding two loops. While variable selection and parameter tuning have been widely investigated in the context of high-dimensional classification, it is still unclear how they should be combined if a classification method involves both variable selection and parameter tuning. For example, the k-nearest-neighbors method usually requires variable selection and involves a tuning parameter: the number k of neighbors. It is well-known that variable selection should be repeated for each external iteration. But should we also repeat variable selection for each it internal iteration or rather perform tuning based on fixed subset of variables? While the first variant seems more natural, it implies a huge computational expense and its benefit in terms of error rate remains unknown. In this paper, we assess both variants quantitatively using real microarray data sets. We focus on two representative examples: k-nearest-neighbors (with k as tuning parameter) and Partial Least Squares dimension reduction followed by linear discriminant analysis (with the number of components as tuning parameter). We conclude that the more natural but computationally expensive variant with repeated variable selection does not necessarily lead to better accuracy and point out the potential pitfalls of both variants.

Full Text