Association Test Based on SNP Set: Logistic Kernel Machine Based Test vs. Principal Component Analysis

Yang Zhao,Nancy Diao,David C Christiani,Xihong Lin,Feng Chen,Rihong Zhai,Frank Emmert-Streib

doi:10.1371/journal.pone.0044978

Abstract

GWAS has facilitated greatly the discovery of risk SNPs associated with complex diseases. Traditional methods analyze SNP individually and are limited by low power and reproducibility since correction for multiple comparisons is necessary. Several methods have been proposed based on grouping SNPs into SNP sets using biological knowledge and/or genomic features. In this article, we compare the linear kernel machine based test (LKM) and principal components analysis based approach (PCA) using simulated datasets under the scenarios of 0 to 3 causal SNPs, as well as simple and complex linkage disequilibrium (LD) structures of the simulated regions. Our simulation study demonstrates that both LKM and PCA can control the type I error at the significance level of 0.05. If the causal SNP is in strong LD with the genotyped SNPs, both the PCA with a small number of principal components (PCs) and the LKM with kernel of linear or identical-by-state function are valid tests. However, if the LD structure is complex, such as several LD blocks in the SNP set, or when the causal SNP is not in the LD block in which most of the genotyped SNPs reside, more PCs should be included to capture the information of the causal SNP. Simulation studies also demonstrate the ability of LKM and PCA to combine information from multiple causal SNPs and to provide increased power over individual SNP analysis. We also apply LKM and PCA to analyze two SNP sets extracted from an actual GWAS dataset on non-small cell lung cancer.

Highlights

Rapid progress in high throughput genotyping technology has facilitated greatly the discovery of risk single-nucleotide polymorphisms (SNPs) associated with complex disease [1,2]
Results from scenarios A2 and A3 are presented by Figure 1. Both of linear kernel machine based test (LKM) and principal components analysis based approach (PCA) have power when the causal SNP is in high linkage disequilibrium (LD) with the genotyped ones, which demonstrates their ability of ‘‘borrowing’’ information to increase the statistical power
For PCA, we present the powers of PCA using principal components (PCs) explaining at least 80%, 60% and 40% of the total variation, respectively

Summary

Introduction

Rapid progress in high throughput genotyping technology has facilitated greatly the discovery of risk single-nucleotide polymorphisms (SNPs) associated with complex disease [1,2]. The population-based case control study is one of the most commonly used designs in genome-wide association studies (GWAS), with millions of SNPs being genotyped simultaneously from more than one thousand cases and controls. A standard approach to analyze GWAS data is to regress the phenotype on each genotyped SNP. Due to the large number of SNPs, correction for multiple comparisons is necessary. It is possible that joint tests of multiple SNPs in linkage disequilibrium (LD) are more powerful than testing each SNP individually. The number of tests is reduced if SNPs are tested by group rather than individually. A joint test can examine whether a batch of biologically important SNPs are associated with the phenotype

Methods

Results

Conclusion