Abstract

Single-locus analysis is often used to analyze genome-wide association (GWA) data, but such analysis is subject to severe multiple comparisons adjustment. Multivariate logistic regression is proposed to fit a multi-locus model for case-control data. However, when the sample size is much smaller than the number of single-nucleotide polymorphisms (SNPs) or when correlation among SNPs is high, traditional multivariate logistic regression breaks down. To accommodate the scale of data from a GWA while controlling for collinearity and overfitting in a high dimensional predictor space, we propose a variable selection procedure using Bayesian logistic regression. We explored a connection between Bayesian regression with certain priors and L1 and L2 penalized logistic regression. After analyzing large number of SNPs simultaneously in a Bayesian regression, we selected important SNPs for further consideration. With much fewer SNPs of interest, problems of multiple comparisons and collinearity are less severe. We conducted simulation studies to examine probability of correctly selecting disease contributing SNPs and applied developed methods to analyze Genetic Analysis Workshop 16 North American Rheumatoid Arthritis Consortium data.

Highlights

  • Single-locus analysis is a widely used approach to analyze genome-wide association (GWA) data, but it may not be adequate to capture complex pattern of disease etiology [1] and is subject to severe multiple comparisons adjustment, especially in a GWA, in which the typical number of comparisons made is hundreds of thousands

  • A challenge of applying such approaches to GWA data is that the sample size is usually much smaller than the number of single-nucleotide polymorphisms (SNPs)

  • Traditional multivariate logistic regression breaks down in this case. Another disadvantage of such an approach is that when the correlation between SNPs is high due to linkage disequilibrium (LD), the estimated coefficients are highly variable and the method performs poorly

Read more

Summary

Introduction

Single-locus analysis is a widely used approach to analyze genome-wide association (GWA) data, but it may not be adequate to capture complex pattern of disease etiology [1] and is subject to severe multiple comparisons adjustment, especially in a GWA, in which the typical number of comparisons made is hundreds of thousands. Methods to handle large number of single-nucleotide polymorphisms (SNPs) simultaneously are in demand. Logistic regression is a popular tool to assess association between a dichotomous trait and SNP genotypes. To analyze multiple SNPs simultaneously by logistic regression, one can include all SNPs of interest as predictors. A challenge of applying such approaches to GWA data is that the sample size is usually much smaller than the number of SNPs. Traditional multivariate logistic regression breaks down in this case. Traditional multivariate logistic regression breaks down in this case Another disadvantage of such an approach is that when the correlation between SNPs is high due to linkage disequilibrium (LD), the estimated coefficients are highly variable and the method performs poorly

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call