Abstract

We introduce a simple new approach to variable selection in linear regression, with a particular focus on quantifying uncertainty in which variables should be selected. The approach is based on a new model - the "Sum of Single Effects" (SuSiE) model - which comes from writing the sparse vector of regression coefficients as a sum of "single-effect" vectors, each with one non-zero element. We also introduce a corresponding new fitting procedure - Iterative Bayesian Stepwise Selection (IBSS) - which is a Bayesian analogue of stepwise selection methods. IBSS shares the computational simplicity and speed of traditional stepwise methods, but instead of selecting a single variable at each step, IBSS computes a distribution on variables that captures uncertainty in which variable to select. We provide a formal justification of this intuitive algorithm by showing that it optimizes a variational approximation to the posterior distribution under the SuSiE model. Further, this approximate posterior distribution naturally yields convenient novel summaries of uncertainty in variable selection, providing a Credible Set of variables for each selection. Our methods are particularly well-suited to settings where variables are highly correlated and detectable effects are sparse, both of which are characteristics of genetic fine-mapping applications. We demonstrate through numerical experiments that our methods outperform existing methods for this task, and illustrate their application to fine-mapping genetic variants influencing alternative splicing in human cell-lines. We also discuss the potential and challenges for applying these methods to generic variable selection problems.

Highlights

  • The need to identify, or “select”, relevant variables in regression models arises in a diverse range of applications, and has spurred development of a correspondingly diverse range of methods

  • We provide a principled justification for this intuitive algorithm by showing that it optimizes a variational approximation to the posterior distribution under the Sum of Single Effects” (SuSiE) model

  • Some non-trivial differences in posterior inclusion probability (PIP) are clearly visible from Figure 2A. Visual inspection of these differences suggests that the SuSiE PIPs may better distinguish effect variables from non-effect variables, in that there appears a higher ratio of red-gray points below the diagonal than above the diagonal

Read more

Summary

INTRODUCTION

The need to identify, or “select”, relevant variables in regression models arises in a diverse range of applications, and has spurred development of a correspondingly diverse range of methods (e.g., see O’Hara and Sillanpaa ̈ , 2009; Fan and Lv, 2010; Desboulets, 2018; George and McCulloch, 1997, for reviews). This requires methods that can draw conclusions such as “either x1 or x2 is relevant and we cannot decide which” rather than methods that arbitrarily select one of the variables and ignore the other While this may seem a simple goal, in practice most existing variable selection methods do not satisfactorily address this problem (see Section 2 for further discussion). A key feature of our method, which distinguishes it from most existing BVSR methods, is that it produces “Credible Sets” of variables that quantify uncertainty in which variable should be selected when multiple, highly correlated variables compete with one another These Credible Sets are designed to be as small as possible while still each capturing a relevant variable. We end with a discussion highlighting avenues for further work

A motivating toy example
Credible Sets
The single effect regression model
Posterior under SER model
Empirical Bayes for SER model
THE SUM OF SINGLE EFFECTS REGRESSION MODEL
Fitting SuSiE
IBSS computes a variational approximation to the SuSiE posterior distribution
Contrast to previous variational approximations
Posterior inference: posterior inclusion probabilities and Credible Sets
Choice of L
Identifiability and label-switching
NUMERICAL COMPARISONS
Illustrative example
Posterior inclusion probabilities
15 IBSS after 10 iterations
APPLICATION TO FINE-MAPPING SPLICING QTLS
AN EXAMPLE BEYOND FINE-MAPPING
DISCUSSION
DATA AND RESOURCES
Bayesian simple linear regression
Computing Credible Sets
Estimating hyperparameters
Empirical Bayes as a single optimization problem
Variational approximation
The additive effects model
Special case of SuSiE model
Proof of Corollary 1
Proof of Proposition 2
Computing the evidence lower bound
C CONNECTING SUSIE TO STANDARD BVSR
Simulated data
Software and hardware specifications for numerical comparisons study
Findings
E FUNCTIONAL ENRICHMENT OF SPLICE QTL FINE MAPPING
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call