Robust Bayesian variable selection for the quantile varying coefficient model
Robust Bayesian variable selection for the quantile varying coefficient model
- Research Article
57
- 10.1002/gepi.20353
- Jul 10, 2008
- Genetic Epidemiology
Variable selection is growing in importance with the advent of high throughput genotyping methods requiring analysis of hundreds to thousands of single nucleotide polymorphisms (SNPs) and the increased interest in using these genetic studies to better understand common, complex diseases. Up to now, the standard approach has been to analyze the genotypes for each SNP individually to look for an association with a disease. Alternatively, combinations of SNPs or haplotypes are analyzed for association. Another added complication in studying complex diseases or phenotypes is that genetic risk for the disease is often due to multiple SNPs in various locations on the chromosome with small individual effects that may have a collectively large effect on the phenotype. Hence, multi-locus SNP models, as opposed to single SNP models, may better capture the true underlying genotypic-phenotypic relationship. Thus, innovative methods for determining which SNPs to include in the model are needed. The goal of this article is to describe several methods currently available for variable and model selection using Bayesian approaches and to illustrate their application for genetic association studies using both real and simulated candidate gene data for a complex disease. In particular, Bayesian model averaging (BMA), stochastic search variable selection (SSVS), and Bayesian variable selection (BVS) using a reversible jump Markov chain Monte Carlo (MCMC) for candidate gene association studies are illustrated using a study of age-related macular degeneration (AMD) and simulated data.
- Research Article
- 10.1037/met0000837
- Apr 16, 2026
- Psychological methods
Modern regularization and variable selection methods, such as least absolute shrinkage and selection operator (lasso) and Bayesian variable selection, are important tools for psychological researchers to reduce the risk of overfitting, improve prediction in future samples, and increase model interpretability. Although missing data are common in psychological data, it is not straightforward to combine principled methods for addressing missing data with these modern variable selection methods. This challenge is well illustrated in a recent article by Gunn et al. (2023) with a comparison of three approaches for combining lasso with multiple imputation to address missing data. Each of the surveyed approaches results in markedly different results in terms of predictors selected. Their findings underscore limitations of the lasso for the purpose of variable selection. In this article, we show how to implement a Bayesian variable selection method, stochastic search variable selection (SSVS), with multiply imputed data. SSVS is a principled and consistent method for variable selection, and we demonstrate advantages relative to lasso in an example data set and simulation study. It is straightforward to apply an ITS strategy for SSVS using existing software. (PsycInfo Database Record (c) 2026 APA, all rights reserved).
- Research Article
18
- 10.1007/s00184-014-0491-y
- May 14, 2014
- Metrika
The varying coefficient model is widely used as an extension of the linear regression model. Many procedures have been developed for the model estimation, and recently efficient variable selection procedures for the varying coefficient model have been proposed as well. However, those variable selection approaches are mainly built on the least-squares (LS) type method. Although the LS method is a successful and standard choice in the varying coefficient model fitting and variable selection, it may suffer when the errors follow a heavy-tailed distribution or in the presence of outliers. To overcome this issue, we start by developing a novel robust estimator, termed rank-based spline estimator, which combines the ideas of rank inference and polynomial spline. Furthermore, we propose a robust variable selection method, incorporating the smoothly clipped absolute deviation penalty into the rank-based spline loss function. Under mild conditions, we theoretically show that the proposed rank-based spline estimator is highly efficient across a wide spectrum of distributions. Its asymptotic relative efficiency with respect to the LS-based method is closely related to that of the signed-rank Wilcoxon test with respect to the t test. Moreover, the proposed variable selection method can identify the true model consistently, and the resulting estimator can be as efficient as the oracle estimator. Simulation studies show that our procedure has better performance than the LS-based method when the errors deviate from normality.
- Research Article
6
- 10.1080/10618600.2015.1035438
- Apr 2, 2016
- Journal of Computational and Graphical Statistics
In this article, we propose a new Bayesian variable selection (BVS) approach via the graphical model and the Ising model, which we refer to as the “Bayesian Ising graphical model” (BIGM). The BIGM is developed by showing that the BVS problem based on the linear regression model can be considered as a complete graph and described by an Ising model with random interactions. There are several advantages of our BIGM: it is easy to (i) employ the single-site updating and cluster updating algorithm, both of which are suitable for problems with small sample sizes and a larger number of variables, (ii) extend this approach to nonparametric regression models, and (iii) incorporate graphical prior information. In our BIGM, the interactions are determined by the linear model coefficients, so we systematically study the performance of different scale normal mixture priors for the model coefficients by adopting the global-local shrinkage strategy. Our results indicate that the best prior for the model coefficients in terms of variable selection should place substantial weight on small, nonzero shrinkage. The methods are illustrated with simulated and real data. Supplementary materials for this article are available online.
- Database
3
- 10.17863/cam.62808
- Apr 28, 2021
- Apollo (University of Cambridge)
In molecular biology, advances in high-throughput technologies have made it possible to study complex multivariate phenotypes and their simultaneous associations with high-dimensional genomic and other omics data, a problem that can be studied with high-dimensional multi-response regression, where the response variables are potentially highly correlated. To this purpose, we recently introduced several multivariate Bayesian variable and covariance selection models, e.g., Bayesian estimation methods for sparse seemingly unrelated regression for variable and covariance selection. Several variable selection priors have been implemented in this context, in particular the hotspot detection prior for latent variable inclusion indicators, which results in sparse variable selection for associations between predictors and multiple phenotypes. We also propose an alternative, which uses a Markov random field (MRF) prior for incorporating prior knowledge about the dependence structure of the inclusion indicators. Inference of Bayesian seemingly unrelated regression (SUR) by Markov chain Monte Carlo methods is made computationally feasible by factorisation of the covariance matrix amongst the response variables. In this paper we present BayesSUR, an R package, which allows the user to easily specify and run a range of different Bayesian SUR models, which have been implemented in C++ for computational efficiency. The R package allows the specification of the models in a modular way, where the user chooses the priors for variable selection and for covariance selection separately. We demonstrate the performance of sparse SUR models with the hotspot prior and spike-and-slab MRF prior on synthetic and real data sets representing eQTL or mQTL studies and in vitro anti-cancer drug screening studies as examples for typical applications.
- Research Article
13
- 10.18637/jss.v100.i11
- Jan 1, 2021
- Journal of Statistical Software
In molecular biology, advances in high-throughput technologies have made it possible to study complex multivariate phenotypes and their simultaneous associations with high-dimensional genomic and other omics data, a problem that can be studied with high-dimensional multi-response regression, where the response variables are potentially highly correlated. To this purpose, we recently introduced several multivariate Bayesian variable and covariance selection models, e.g., Bayesian estimation methods for sparse seemingly unrelated regression for variable and covariance selection. Several variable selection priors have been implemented in this context, in particular the hotspot detection prior for latent variable inclusion indicators, which results in sparse variable selection for associations between predictors and multiple phenotypes. We also propose an alternative, which uses a Markov random field (MRF) prior for incorporating prior knowledge about the dependence structure of the inclusion indicators. Inference of Bayesian seemingly unrelated regression (SUR) by Markov chain Monte Carlo methods is made computationally feasible by factorisation of the covariance matrix amongst the response variables. In this paper we present BayesSUR, an R package, which allows the user to easily specify and run a range of different Bayesian SUR models, which have been implemented in C++ for computational efficiency. The R package allows the specification of the models in a modular way, where the user chooses the priors for variable selection and for covariance selection separately. We demonstrate the performance of sparse SUR models with the hotspot prior and spike-and-slab MRF prior on synthetic and real data sets representing eQTL or mQTL studies and in vitro anti-cancer drug screening studies as examples for typical applications.
- Research Article
5
- 10.1080/02664763.2013.804040
- Sep 1, 2013
- Journal of Applied Statistics
The varying coefficient model (VCM) is an important generalization of the linear regression model and many existing estimation procedures for VCM were built on L 2 loss, which is popular for its mathematical beauty but is not robust to non-normal errors and outliers. In this paper, we address the problem of both robustness and efficiency of estimation and variable selection procedure based on the convex combined loss of L 1 and L 2 instead of only quadratic loss for VCM. By using local linear modeling method, the asymptotic normality of estimation is driven and a useful selection method is proposed for the weight of composite L 1 and L 2. Then the variable selection procedure is given by combining local kernel smoothing with adaptive group LASSO. With appropriate selection of tuning parameters by Bayesian information criterion (BIC) the theoretical properties of the new procedure, including consistency in variable selection and the oracle property in estimation, are established. The finite sample performance of the new method is investigated through simulation studies and the analysis of body fat data. Numerical studies show that the new method is better than or at least as well as the least square-based method in terms of both robustness and efficiency for variable selection.
- Research Article
11
- 10.1016/j.csda.2023.107808
- Jun 23, 2023
- Computational statistics & data analysis
The Bayesian regularized quantile varying coefficient model
- Research Article
9
- 10.1080/02664763.2018.1432576
- Feb 7, 2018
- Journal of Applied Statistics
ABSTRACTIn many real applications, such as econometrics, biological sciences, radio-immunoassay, finance, and medicine, the usual assumption of constant error variance may be unrealistic. Ignoring heteroscedasticity (non-constant error variance), if it is present in the data, may lead to incorrect inferences and inefficient estimation. In this paper, a simple and effcient Gibbs sampling algorithm is proposed, based on a heteroscedastic linear regression model with an penalty. Then, a Bayesian stochastic search variable selection method is proposed for subset selection. Simulations and real data examples are used to compare the performance of the proposed methods with other existing methods. The results indicate that the proposal performs well in the simulations and real data examples. R code is available upon request.
- Research Article
3
- 10.1007/s00180-014-0540-z
- Dec 4, 2014
- Computational Statistics
Selecting a small number of relevant genes for classification has received a great deal of attention in microarray data analysis. While the development of methods for microarray data with only two classes is relevant, developing more efficient algorithms for classification with any number of classes is important. In this paper, we propose a Bayesian stochastic search variable selection approach for multi-class classification, which can identify relevant genes by assessing sets of genes jointly. We consider a multinomial probit model with a generalized $$g$$g-prior for the regression coefficients. An efficient algorithm using simulation-based MCMC methods are developed for simulating parameters from the posterior distribution. This algorithm is robust to the choice of initial value, and produces posterior probabilities of relevant genes for biological interpretation. We demonstrate the performance of the approach with two well-known gene expression profiling data: leukemia data, lymphoma data, SRBCTs data and NCI60 data. Compared with other classification approaches, our approach selects smaller numbers of relevant genes and obtains competitive classification accuracy based on obtained results.
- Research Article
25
- 10.1186/s12711-014-0057-5
- Oct 1, 2014
- Genetics, Selection, Evolution : GSE
BackgroundThe prediction accuracy of several linear genomic prediction models, which have previously been used for within-line genomic prediction, was evaluated for multi-line genomic prediction.MethodsCompared to a conventional BLUP (best linear unbiased prediction) model using pedigree data, we evaluated the following genomic prediction models: genome-enabled BLUP (GBLUP), ridge regression BLUP (RRBLUP), principal component analysis followed by ridge regression (RRPCA), BayesC and Bayesian stochastic search variable selection. Prediction accuracy was measured as the correlation between predicted breeding values and observed phenotypes divided by the square root of the heritability. The data used concerned laying hens with phenotypes for number of eggs in the first production period and known genotypes. The hens were from two closely-related brown layer lines (B1 and B2), and a third distantly-related white layer line (W1). Lines had 1004 to 1023 training animals and 238 to 240 validation animals. Training datasets consisted of animals of either single lines, or a combination of two or all three lines, and had 30 508 to 45 974 segregating single nucleotide polymorphisms.ResultsGenomic prediction models yielded 0.13 to 0.16 higher accuracies than pedigree-based BLUP. When excluding the line itself from the training dataset, genomic predictions were generally inaccurate. Use of multiple lines marginally improved prediction accuracy for B2 but did not affect or slightly decreased prediction accuracy for B1 and W1. Differences between models were generally small except for RRPCA which gave considerably higher accuracies for B2. Correlations between genomic predictions from different methods were higher than 0.96 for W1 and higher than 0.88 for B1 and B2. The greater differences between methods for B1 and B2 were probably due to the lower accuracy of predictions for B1 (~0.45) and B2 (~0.40) compared to W1 (~0.76).ConclusionsMulti-line genomic prediction did not affect or slightly improved prediction accuracy for closely-related lines. For distantly-related lines, multi-line genomic prediction yielded similar or slightly lower accuracies than single-line genomic prediction. Bayesian variable selection and GBLUP generally gave similar accuracies. Overall, RRPCA yielded the greatest accuracies for two lines, suggesting that using PCA helps to alleviate the “n ≪ p” problem in genomic prediction.Electronic supplementary materialThe online version of this article (doi:10.1186/s12711-014-0057-5) contains supplementary material, which is available to authorized users.
- Research Article
54
- 10.1186/s12711-016-0225-x
- Jun 29, 2016
- Genetics Selection Evolution
BackgroundUse of whole-genome sequence data is expected to increase persistency of genomic prediction across generations and breeds but affects model performance and requires increased computing time. In this study, we investigated whether the split-and-merge Bayesian stochastic search variable selection (BSSVS) model could overcome these issues. BSSVS is performed first on subsets of sequence-based variants and then on a merged dataset containing variants selected in the first step.ResultsWe used a dataset that included 4,154,064 variants after editing and de-regressed proofs for 3415 reference and 2138 validation bulls for somatic cell score, protein yield and interval first to last insemination. In the first step, BSSVS was performed on 106 subsets each containing ~39,189 variants. In the second step, 1060 up to 472,492 variants, selected from the first step, were included to estimate the accuracy of genomic prediction. Accuracies were at best equal to those achieved with the commonly used Bovine 50k-SNP chip, although the number of variants within a few well-known quantitative trait loci regions was considerably enriched. When variant selection and the final genomic prediction were performed on the same data, predictions were biased. Predictions computed as the average of the predictions computed for each subset achieved the highest accuracies, i.e. 0.5 to 1.1 % higher than the accuracies obtained with the 50k-SNP chip, and yielded the least biased predictions. Finally, the accuracy of genomic predictions obtained when all sequence-based variants were included was similar or up to 1.4 % lower compared to that based on the average predictions across the subsets. By applying parallelization, the split-and-merge procedure was completed in 5 days, while the standard analysis including all sequence-based variants took more than three months.ConclusionsThe split-and-merge approach splits one large computational task into many much smaller ones, which allows the use of parallel processing and thus efficient genomic prediction based on whole-genome sequence data. The split-and-merge approach did not improve prediction accuracy, probably because we used data on a single breed for which relationships between individuals were high. Nevertheless, the split-and-merge approach may have potential for applications on data from multiple breeds.Electronic supplementary materialThe online version of this article (doi:10.1186/s12711-016-0225-x) contains supplementary material, which is available to authorized users.
- Book Chapter
1
- 10.1057/9780230280830_39
- Jan 1, 2010
One of the most interesting forms of nonlinear regression models is the varying coefficient model (VCM). Unlike the linear regression model, VCMs were introduced by Hastie and Tibshirani (1993) to allow the regression coefficients to vary systematically and smoothly in more than one dimension. It is worth noting the distinction between the VCM and the so-called random coefficients model, which assumes that the coefficients vary non-systematically (randomly). Versions of the VCM are encountered in the literature as functional coefficient models (see Cai, Fan and Yao, 2000) and smooth coefficient models (see Li et al, 2002).
- Research Article
21
- 10.1080/10618600.2012.680826
- Jul 1, 2012
- Journal of Computational and Graphical Statistics
In this article, we consider nonparametric smoothing and variable selection in varying-coefficient models. Varying-coefficient models are commonly used for analyzing the time-dependent effects of covariates on responses measured repeatedly (such as longitudinal data). We present the P-spline estimator in this context and show its estimation consistency for a diverging number of knots (or B-spline basis functions). The combination of P-splines with nonnegative garrote (which is a variable selection method) leads to good estimation and variable selection. Moreover, we consider APSO (additive P-spline selection operator), which combines a P-spline penalty with a regularization penalty, and show its estimation and variable selection consistency. The methods are illustrated with a simulation study and real-data examples. The proofs of the theoretical results as well as one of the real-data examples are provided in the online supplementary materials.
- Research Article
43
- 10.1214/12-ejs709
- Jan 1, 2012
- Electronic Journal of Statistics
Varying coefficient (VC) models are commonly used to study dynamic patterns in many scientific areas. In particular, VC models in quantile regression are known to provide a more complete description of the response distribution than in mean regression. In this paper, we develop a variable selection method for VC models in quantile regression using a shrinkage idea. The proposed method is based on the basis expansion of each varying coefficient and the regularization penalty on the Euclidean norm of the corresponding coefficient vector. We show that our estimator is obtained as an optimal solution to the second order cone programming (SOCP) problem and that the proposed procedure has consistency in variable selection under suitable conditions. Further, we show that the estimated relevant coefficients converge to the true functions at the univariate optimal rate. Finally, the method is illustrated with numerical simulations including the analysis of forced expiratory volume (FEV) data.