Abstract

Unaccounted population stratification can lead to false-positive findings and can mask the true association signals in identification of disease-related genetic variants. The computational simplicity of principal component analysis (PCA) makes it a widely used method for population stratification adjustment. However, given that genotype data are generally represented by numerical values 0, 1, and 2, corresponding to the number of minor alleles, it is more reasonable to consider genotype data as categorical data. Because PCA is inherently only suitable for continuous variables, it is not appropriate to directly apply PCA on genotype data. Second, although common variants have been extensively studied, little is known about the stratification of rare variants and its impact on association tests. Over the last decade, there has been a shift in the genome-wide association studies toward studying low-frequency (minor allele frequency [MAF] between 0.01 and 0.05) and rare (MAF less than 0.01) variants, which are now widely reputed as complex trait determinants. The fact that rare variants are not stratified in the same way as common variants necessitates the development of statistical methods that can capture stratification patterns for low-frequency and rare variants. To address these limitations, we investigate performances of generalized PCA and similarity-matrix-based PCA methods to detect underlying structures for rare and common variants. We demonstrate, through simulated and real datasets, that a special case of generalized PCA (i.e., logistic PCA) is able to adjust for population stratification in rare variants much more effectively than standard PCA while their performances are comparable for common variants.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call