Abstract

The investigation of associations between rare genetic variants and diseases or phenotypes has two goals. Firstly, the identification of which genes or genomic regions are associated, and secondly, discrimination of associated variants from background noise within each region. Over the last few years, many new methods have been developed which associate genomic regions with phenotypes. However, classical methods for high-dimensional data have received little attention. Here we investigate whether several classical statistical methods for high-dimensional data: ridge regression (RR), principal components regression (PCR), partial least squares regression (PLS), a sparse version of PLS (SPLS), and the LASSO are able to detect associations with rare genetic variants. These approaches have been extensively used in statistics to identify the true associations in data sets containing many predictor variables. Using genetic variants identified in three genes that were Sanger sequenced in 1998 individuals, we simulated continuous phenotypes under several different models, and we show that these feature selection and feature extraction methods can substantially outperform several popular methods for rare variant analysis. Furthermore, these approaches can identify which variants are contributing most to the model fit, and therefore both goals of rare variant analysis can be achieved simultaneously with the use of regression regularization methods. These methods are briefly illustrated with an analysis of adiponectin levels and variants in the ADIPOQ gene.

Highlights

  • New methods for the analysis of rare genetic variants are appearing rapidly

  • We explore whether several classic approaches for feature selection or extraction (ridge regression (RR) [22], LASSO [23], principal components regression (PCR) [24], partial least squares (PLS) regression [25,26], or sparse partial least squares regression (PLS) (SPLS) [27]) can effectively identify associations between a genetic region and a continuous trait

  • Commonly-used methods for rare variants often pool rare alleles and fit simple regression models relating the phenotype to rare allele counts

Read more

Summary

Introduction

New methods for the analysis of rare genetic variants are appearing rapidly. Resequencing efforts are identifying numerous new variants but the majority of the new variants are seen only in a very small number of individuals [1]. The problem of how best to model the relationship between a phenotype and a large set of rare genetic variants is a problem of variable selection (or feature selection) and/or dimension reduction (or feature extraction) in a sparse covariate space. We explore whether several classic approaches for feature selection or extraction (ridge regression (RR) [22], LASSO [23], principal components regression (PCR) [24], partial least squares (PLS) regression [25,26], or sparse PLS (SPLS) [27]) can effectively identify associations between a genetic region and a continuous trait. Using genetic variants identified by Sanger sequencing on three genes in 1998 individuals, we simulated phenotypes under a range of models, and compared the ability to identify the causal variants using these regression regularization methods. One additional advantage of feature selection methods is that they can identify associations, but can point towards which variants are likely the truly-associated ones

Results
Method Dimension reduction
II.5 Birectional
Discussion
Method
Conclusions
Methods
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call