High-dimensional Model Selection Research Articles

Over the last two decades, many exciting variable selection methods have been developed for finding a small group of covariates that are associated with the response from a large pool. Can the discoveries by such data mining approaches be spurious due to high dimensionality and limited sample size? Can our fundamental assumptions on exogeneity of covariates needed for such variable selection be validated with the data? To answer these questions, we need to derive the distributions of the maximum spurious correlations given certain number of predictors, namely, the distribution of the correlation of a response variable Y with the best s linear combinations of p covariates X, even when X and Y are independent. When the covariance matrix of X possesses the restricted eigenvalue property, we derive such distributions for both finite s and diverging s, using Gaussian approximation and empirical process techniques. However, such a distribution depends on the unknown covariance matrix of X. Hence, we use the multiplier bootstrap procedure to approximate the unknown distributions and establish the consistency of such a simple bootstrap approach. The results are further extended to the situation where residuals are from regularized fits. Our approach is then applied to construct the upper confidence limit for the maximum spurious correlation and testing exogeneity of covariates. The former provides a baseline for guarding against false discoveries due to data mining and the latter tests whether our fundamental assumptions for high-dimensional model selection are statistically valid. Our techniques and results are illustrated by both numerical examples and real data analysis.

Read full abstract

DNA microarrays open up a broad new horizon for investigators interested in studying the genetic determinants of disease. The high throughput nature of these arrays, where differential expression for thousands of genes can be measured simultaneously, creates an enormous wealth of information, but also poses a challenge for data analysis because of the large multiple testing problem involved. The solution has generally been to focus on optimizing false-discovery rates while sacrificing power. The drawback of this approach is that more subtle expression differences will be missed that might give investigators more insight into the genetic environment necessary for a disease process to take hold. We introduce a new method for detecting differentially expressed genes based on a high-dimensional model selection technique, Bayesian ANOVA for microarrays (BAM), which strikes a balance between false rejections and false nonrejections. The basis of the new approach involves a weighted average of generalized ridge regression estimates that provides the benefits of using shrinkage estimation combined with model averaging. A simple graphical tool based on the amount of shrinkage is developed to visualize the trade-off between low false-discovery rates and finding more genes. Simulations are used to illustrate BAM's performance, and the method is applied to a large database of colon cancer gene expression data. Our working hypothesis in the colon cancer analysis is that large differential expressions may not be the only ones contributing to metastasis—in fact, moderate changes in expression of genes may be involved in modifying the genetic environment to a sufficient extent for metastasis to occur. A functional biological analysis of gene effects found by BAM, but not other false-discovery-based approaches, lends support to this hypothesis.

Read full abstract

High-dimensional Model Selection Research Articles

Articles published on High-dimensional Model Selection

Information criteria for structured parameter selection in high-dimensional tree and graph models

Multiple-hypothesis testing rules for high-dimensional model selection and sparse-parameter estimation

Feature Selection in High-Dimensional Models via EBIC with Energy Distance Correlation.

An expectation maximization algorithm for high-dimensional model selection for the Ising model with misclassified states*

Variable Selection Using Nonlocal Priors in High-Dimensional Generalized Linear Models With Application to fMRI Data Analysis.

On model selection from a finite family of possibly misspecified time series models

ARE DISCOVERIES SPURIOUS? DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS AND THEIR APPLICATIONS.

Accumulation Tests for FDR Control in Ordered Hypothesis Testing

High-dimensional Bayesian Variable Selection Methods: A Comparison Study

Model selection of hierarchically structured covariates using elastic net

Spline estimation and variable selection for single-index prediction models with diverging number of index parameters

A Selective Review of Group Selection in High-Dimensional Models

Penalized Composite Quasi-Likelihood for Ultrahigh-Dimensional Variable Selection.

High-dimensional variable selection

Detecting Differentially Expressed Genes in Microarrays Using Bayesian Model Selection

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

High-dimensional Model Selection Research Articles

Articles published on High-dimensional Model Selection

Information criteria for structured parameter selection in high-dimensional tree and graph models

Multiple-hypothesis testing rules for high-dimensional model selection and sparse-parameter estimation

Feature Selection in High-Dimensional Models via EBIC with Energy Distance Correlation.

An expectation maximization algorithm for high-dimensional model selection for the Ising model with misclassified states*

Variable Selection Using Nonlocal Priors in High-Dimensional Generalized Linear Models With Application to fMRI Data Analysis.

On model selection from a finite family of possibly misspecified time series models

ARE DISCOVERIES SPURIOUS? DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS AND THEIR APPLICATIONS.

Accumulation Tests for FDR Control in Ordered Hypothesis Testing

High-dimensional Bayesian Variable Selection Methods: A Comparison Study

Model selection of hierarchically structured covariates using elastic net

Spline estimation and variable selection for single-index prediction models with diverging number of index parameters

A Selective Review of Group Selection in High-Dimensional Models

Penalized Composite Quasi-Likelihood for Ultrahigh-Dimensional Variable Selection.

High-dimensional variable selection

Detecting Differentially Expressed Genes in Microarrays Using Bayesian Model Selection