Abstract

BackgroundPopulation stratification is a known confounder of genome-wide association studies, as it can lead to false positive results. Principal component analysis (PCA) method is widely applied in the analysis of population structure with common variants. However, it is still unclear about the analysis performance when rare variants are used.ResultsWe derive a mathematical expectation of the genetic relationship matrix. Variance and covariance elements of the expected matrix depend explicitly on allele frequencies of the genetic markers used in the PCA analysis. We show that inter-population variance is solely contained in K principal components (PCs) and mostly in the largest K-1 PCs, where K is the number of populations in the samples. We propose FPC, ratio of the inter-population variance to the intra-population variance in the K population informative PCs, and d2, sum of squared distances among populations, as measures of population divergence. We show analytically that when allele frequencies become small, the ratio FPC abates, the population distance d2 decreases, and portion of variance explained by the K PCs diminishes. The results are validated in the analysis of the 1000 Genomes Project data. The ratio FPC is 93.85, population distance d2 is 444.38, and variance explained by the largest five PCs is 17.09% when using with common variants with allele frequencies between 0.4 and 0.5. However, the ratio, distance and percentage decrease to 1.83, 17.83 and 0.74%, respectively, with rare variants of frequencies between 0.0001 and 0.01.ConclusionsThe PCA of population stratification performs worse with rare variants than with common ones. It is necessary to restrict the selection to only the common variants when analyzing population stratification with sequencing data.

Highlights

  • Population stratification is a known confounder of genome-wide association studies, as it can lead to false positive results

  • We show that interpopulation variance is solely contained in K principal components (PCs) and mostly in the largest K-1 Principal component (PC), where K is the number of populations in the sample

  • We show analytically that when allele frequencies become small, the ratio FPC abates, the population distance d2 decreases, and portion of variance explained by the K PCs diminishes

Read more

Summary

Introduction

Population stratification is a known confounder of genome-wide association studies, as it can lead to false positive results. In the 1000 Genomes Project data [16, 17], there are a total of 77 million biallelic SNPs, among which 65 million are rare and 52 million are polymorphic in one of the five continental ancestry populations: East Asian (EAS), South Asian (SAS), African (AFR), European (EUR), American (AMR). It seems that rare variants are more informative in distinguishing population structure than common ones. The efficacy of using rare variants in population stratification analysis remains controversial [18,19,20,21]

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.