A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank.

Junyang Qian,Wenfei Du,Trevor Hastie,Yosuke Tanigawa,Manuel A Rivas,Chris Chang,Robert Tibshirani,Matthew Aguirre,Xiaofeng Zhu

doi:10.1371/journal.pgen.1009141

Junyang Qian, Wenfei Du + Show 7 more

Open Access

https://doi.org/10.1371/journal.pgen.1009141

Copy DOI

Journal: PLoS Genetics	Publication Date: Oct 23, 2020
Citations: 84	License type: CC BY 4.0

Affiliation: Stanford University, Grail (United States)

Abstract

The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso, since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large-scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports ℓ1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with ℓ1/ℓ2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve competitive predictive performance for all four phenotypes considered (height, body mass index, asthma, high cholesterol) using only a small fraction of the variants compared with other established polygenic risk score methods.

Highlights

The past two decades have witnessed rapid growth in the amount of data available to us
With the advent and evolution of large-scale and comprehensive biobanks, there come up unprecedented opportunities for researchers to further uncover the complex landscape of human genetics
We look at two quantitative traits: standing height and body mass index (BMI), which are defined as a non-NA median of up to 3 measurements [25], and two qualitative traits: asthma and high cholesterol (HC) [22]

Summary

Introduction

The past two decades have witnessed rapid growth in the amount of data available to us Many areas such as genomics, neuroscience, economics and Internet services are producing big datasets that have high dimension, large sample size, or both. In high-dimensional regression problems, we have a large number of predictors, and it is likely that only a subset of them have a relationship with the response and will be useful for prediction. Identifying such a subset is desirable for both scientific interests and the ability to predict outcomes in the future. Given a continuous response y 2 Rn and a model matrix X 2 Rn p, it solves the following regularized regression problem

Objectives

Methods

Results

Discussion

Conclusion