Thresholded Lasso for high dimensional variable selection
Thresholded Lasso for high dimensional variable selection
- Research Article
12
- 10.1111/rssc.12252
- Nov 17, 2017
- Journal of the Royal Statistical Society Series C: Applied Statistics
SummaryWe discuss scalar-on-function regression models where all parameters of the assumed response distribution can be modelled depending on covariates. We thus combine signal regression models with generalized additive models for location, scale and shape. Our approach is motivated by a time series of stock returns, where it is of interest to model both the expectation and the variance depending on lagged response values and functional liquidity curves. We compare two fundamentally different methods for estimation, a gradient boosting and a penalized-likelihood-based approach, and address practically important points like identifiability and model choice. Estimation by a componentwise gradient boosting algorithm allows for high dimensional data settings and variable selection. Estimation by a penalized-likelihood-based approach has the advantage of directly provided statistical inference.
- Research Article
3
- 10.1080/00949655.2019.1584198
- Feb 27, 2019
- Journal of Statistical Computation and Simulation
ABSTRACTIn many medical studies patients are nested or clustered within doctor. With many explanatory variables, variable selection with clustered data can be challenging. We propose a method for variable selection based on random forest that addresses clustered data through stratified binary splits. Our motivating example involves the detection orthopedic device components from a large pool of candidates, where each patient belongs to a surgeon. Simulations compare the performance of survival forests grown using the stratified logrank statistic to conventional and robust logrank statistics, as well as a method to select variables using a threshold value based on a variable's empirical null distribution. The stratified logrank test performs superior to conventional and robust methods when data are generated to have cluster-specific effects, and when cluster sizes are sufficiently large, perform comparably to the splitting alternatives in the absence of cluster-specific effects. Thresholding was effective at distinguishing between important and unimportant variables.
- Research Article
91
- 10.1111/j.1467-9868.2011.01023.x
- Feb 15, 2012
- Journal of the Royal Statistical Society Series B: Statistical Methodology
SummaryThe paper considers variable selection in linear regression models where the number of covariates is possibly much larger than the number of observations. High dimensionality of the data brings in many complications, such as (possibly spurious) high correlations between the variables, which result in marginal correlation being unreliable as a measure of association between the variables and the response. We propose a new way of measuring the contribution of each variable to the response which takes into account high correlations between the variables in a data-driven way. The proposed tilting procedure provides an adaptive choice between the use of marginal correlation and tilted correlation for each variable, where the choice is made depending on the values of the hard thresholded sample correlation of the design matrix. We study the conditions under which this measure can successfully discriminate between the relevant and the irrelevant variables and thus be used as a tool for variable selection. Finally, an iterative variable screening algorithm is constructed to exploit the theoretical properties of tilted correlation, and its good practical performance is demonstrated in a comparative simulation study.
- Research Article
27
- 10.4172/2155-6180.s1-005
- Jan 1, 2013
- Journal of Biometrics & Biostatistics
In this article, we present a selective overview of some recent developments in Bayesian model and variable selection methods for high dimensional linear models. While most of the reviews in literature are based on conventional methods, we focus on recently developed methods, which have proven to be successful in dealing with high dimensional variable selection. First, we give a brief overview of the traditional model selection methods (viz. Mallow's Cp, AIC, BIC, DIC), followed by a discussion on some recently developed methods (viz. EBIC, regularization), which have occupied the minds of many statisticians. Then, we review high dimensional Bayesian methods with a particular emphasis on Bayesian regularization methods, which have been used extensively in recent years. We conclude by briefly addressing the asymptotic behaviors of Bayesian variable selection methods for high dimensional linear models under different regularity conditions.
- Research Article
12
- 10.1155/2016/8209453
- Jan 1, 2016
- BioMed Research International
Background. The iterative sure independence screening (ISIS) is a popular method in selecting important variables while maintaining most of the informative variables relevant to the outcome in high throughput data. However, it not only is computationally intensive but also may cause high false discovery rate (FDR). We propose to use the FDR as a screening method to reduce the high dimension to a lower dimension as well as controlling the FDR with three popular variable selection methods: LASSO, SCAD, and MCP. Method. The three methods with the proposed screenings were applied to prostate cancer data with presence of metastasis as the outcome. Results. Simulations showed that the three variable selection methods with the proposed screenings controlled the predefined FDR and produced high area under the receiver operating characteristic curve (AUROC) scores. In applying these methods to the prostate cancer example, LASSO and MCP selected 12 and 8 genes and produced AUROC scores of 0.746 and 0.764, respectively. Conclusions. We demonstrated that the variable selection methods with the sequential use of FDR and ISIS not only controlled the predefined FDR in the final models but also had relatively high AUROC scores.
- Research Article
- 10.1088/1742-6596/2026/1/012012
- Sep 1, 2021
- Journal of Physics: Conference Series
With the progress and development of science and technology, high dimensional data has become a hot topic in scientific research. Lasso is one of the most commonly used methods for solving high dimensional variable selection so far. In the process of performing Lasso solution, the least angle regression algorithm and coordinate descent algorithm are often used. In this paper, Lasso is applied to reduce the dimensionality of the gasoline octane problem, and three algorithms are used to compare, and it is found that the mean square error of the model obtained by using the generalized path search algorithm is the smallest.
- Dataset
- 10.3410/f.732703320.793559523
- May 8, 2019
Faculty Opinions recommendation of Panning for gold: ‘model‐X’ knockoffs for high dimensional controlled variable selection.
- Research Article
1
- 10.1007/s00180-023-01426-5
- Dec 9, 2023
- Computational Statistics
High dimensional controlled variable selection with model-X knockoffs in the AFT model
- Research Article
- 10.12677/sa.2019.86096
- Jan 1, 2019
- Statistics and Application
高维数据变量选择中MCP正则化参数选择研究
- Dissertation
1
- 10.31274/etd-180810-3479
- Sep 24, 2013
Two sample inference for high dimensional data and nonparametric variable selection for census data
- Research Article
3
- 10.1002/env.2852
- May 3, 2024
- Environmetrics
SummaryAnalyzing the effect of chemical and local meteorological variables over the behaviour in concentrations in the Abruzzo region (Italy), with the objective of forecasting and controlling air quality, motivates the current work. Given that the available data are curves that represent the day‐to‐day variations, a multiple function‐on‐function linear regression (MFFLR) model is considered. By assuming the Karhunen‐Loève expansion, MFFLR model can be reduced to a classical linear regression model for each principal component of the functional response in terms of all principal components (PCs) of the functional predictors. In this sense, a regularization approach for functional principal component regression based on the merge of functional data analysis with group Lasso is proposed. This novel methodology allows to estimate the model and, simultaneously, select those relevant functional predictors with the functional response, where each functional independent variable is represented by a group of input variables derived by the PCs.
- Research Article
16
- 10.5555/2627435.2697054
- Jan 1, 2014
- Journal of Machine Learning Research
Consider a linear model Y = Xβ+ωz, where X has n rows and p columns and z - N(0, In). We assume both p and n are large, including the case of p ≫ n. The unknown signal vector β is assumed to be spa...
- Research Article
25
- 10.1111/rssb.12279
- Jun 25, 2018
- Journal of the Royal Statistical Society Series B: Statistical Methodology
Missing data are frequently encountered in high dimensional problems, but they are usually difficult to deal with by using standard algorithms, such as the expectation-maximization algorithm and its variants. To tackle this difficulty, some problem-specific algorithms have been developed in the literature, but there still lacks a general algorithm. This work is to fill the gap: we propose a general algorithm for high dimensional missing data problems. The algorithm works by iterating between an imputation step and a regularized optimization step. At the imputation step, the missing data are imputed conditionally on the observed data and the current estimates of parameters and, at the regularized optimization step, a consistent estimate is found via the regularization approach for the minimizer of a Kullback-Leibler divergence defined on the pseudocomplete data. For high dimensional problems, the consistent estimate can be found under sparsity constraints. The consistency of the averaged estimate for the true parameter can be established under quite general conditions. The algorithm is illustrated by using high dimensional Gaussian graphical models, high dimensional variable selection and a random-coefficient model.
- Research Article
8
- 10.3390/math11030759
- Feb 2, 2023
- Mathematics
In many applications, interest focuses on assessing relationships between covariates and the extremes of the distribution of a continuous response. For example, in climate studies, a usual approach to assess climate change has been based on the analysis of annual maximum data. Using the generalized extreme value (GEV) distribution, we can model trends in the annual maximum temperature using the high number of available atmospheric covariates. However, there is typically uncertainty in which of the many candidate covariates should be included. Bayesian methods for variable selection are very useful to identify important covariates. However, such methods are currently very limited for moderately high dimensional variable selection in GEV regression. We propose a Bayesian method for variable selection based on a stochastic search variable selection (SSVS) algorithm proposed for posterior computation. The method is applied to the selection of atmospheric covariates in annual maximum temperature series in three Spanish stations.
- Database
- 10.17863/cam.62808
- Apr 28, 2021
In molecular biology, advances in high-throughput technologies have made it possible to study complex multivariate phenotypes and their simultaneous associations with high-dimensional genomic and other omics data, a problem that can be studied with high-dimensional multi-response regression, where the response variables are potentially highly correlated. To this purpose, we recently introduced several multivariate Bayesian variable and covariance selection models, e.g., Bayesian estimation methods for sparse seemingly unrelated regression for variable and covariance selection. Several variable selection priors have been implemented in this context, in particular the hotspot detection prior for latent variable inclusion indicators, which results in sparse variable selection for associations between predictors and multiple phenotypes. We also propose an alternative, which uses a Markov random field (MRF) prior for incorporating prior knowledge about the dependence structure of the inclusion indicators. Inference of Bayesian seemingly unrelated regression (SUR) by Markov chain Monte Carlo methods is made computationally feasible by factorisation of the covariance matrix amongst the response variables. In this paper we present BayesSUR, an R package, which allows the user to easily specify and run a range of different Bayesian SUR models, which have been implemented in C++ for computational efficiency. The R package allows the specification of the models in a modular way, where the user chooses the priors for variable selection and for covariance selection separately. We demonstrate the performance of sparse SUR models with the hotspot prior and spike-and-slab MRF prior on synthetic and real data sets representing eQTL or mQTL studies and in vitro anti-cancer drug screening studies as examples for typical applications.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.