Abstract
We introduce a simple new approach to variable selection in linear regression, with a particular focus on quantifying uncertainty in which variables should be selected. The approach is based on a new model - the "Sum of Single Effects" (SuSiE) model - which comes from writing the sparse vector of regression coefficients as a sum of "single-effect" vectors, each with one non-zero element. We also introduce a corresponding new fitting procedure - Iterative Bayesian Stepwise Selection (IBSS) - which is a Bayesian analogue of stepwise selection methods. IBSS shares the computational simplicity and speed of traditional stepwise methods, but instead of selecting a single variable at each step, IBSS computes a distribution on variables that captures uncertainty in which variable to select. We provide a formal justification of this intuitive algorithm by showing that it optimizes a variational approximation to the posterior distribution under the SuSiE model. Further, this approximate posterior distribution naturally yields convenient novel summaries of uncertainty in variable selection, providing a Credible Set of variables for each selection. Our methods are particularly well-suited to settings where variables are highly correlated and detectable effects are sparse, both of which are characteristics of genetic fine-mapping applications. We demonstrate through numerical experiments that our methods outperform existing methods for this task, and illustrate their application to fine-mapping genetic variants influencing alternative splicing in human cell-lines. We also discuss the potential and challenges for applying these methods to generic variable selection problems.
Highlights
The need to identify, or “select”, relevant variables in regression models arises in a diverse range of applications, and has spurred development of a correspondingly diverse range of methods
We provide a principled justification for this intuitive algorithm by showing that it optimizes a variational approximation to the posterior distribution under the Sum of Single Effects” (SuSiE) model
Some non-trivial differences in posterior inclusion probability (PIP) are clearly visible from Figure 2A. Visual inspection of these differences suggests that the SuSiE PIPs may better distinguish effect variables from non-effect variables, in that there appears a higher ratio of red-gray points below the diagonal than above the diagonal
Summary
The need to identify, or “select”, relevant variables in regression models arises in a diverse range of applications, and has spurred development of a correspondingly diverse range of methods (e.g., see O’Hara and Sillanpaa ̈ , 2009; Fan and Lv, 2010; Desboulets, 2018; George and McCulloch, 1997, for reviews). This requires methods that can draw conclusions such as “either x1 or x2 is relevant and we cannot decide which” rather than methods that arbitrarily select one of the variables and ignore the other While this may seem a simple goal, in practice most existing variable selection methods do not satisfactorily address this problem (see Section 2 for further discussion). A key feature of our method, which distinguishes it from most existing BVSR methods, is that it produces “Credible Sets” of variables that quantify uncertainty in which variable should be selected when multiple, highly correlated variables compete with one another These Credible Sets are designed to be as small as possible while still each capturing a relevant variable. We end with a discussion highlighting avenues for further work
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Journal of the Royal Statistical Society Series B: Statistical Methodology
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.