Abstract

The detection of cryptic relatedness in large population-based cohorts is of great importance in genome research. The usual approach for detecting closely related individuals is to plot allele sharing statistics, based on identity-by-state or identity-by-descent, in a two-dimensional scatterplot. This approach ignores that allele sharing data across individuals has in reality a higher dimensionality, and neither regards the compositional nature of the underlying counts of shared genotypes. In this paper we develop biplot methodology based on log-ratio principal component analysis that overcomes these restrictions. This leads to entirely new graphics that are essentially useful for exploring relatedness in genetic databases from homogeneous populations. The proposed method can be applied in an iterative manner, acting as a looking glass for more remote relationships that are harder to classify. Datasets from the 1,000 Genomes Project and the Genomes For Life-GCAT Project are used to illustrate the proposed method. The discriminatory power of the log-ratio biplot approach is compared with the classical plots in a simulation study. In a non-inbred homogeneous population the classification rate of the log-ratio principal component approach outperforms the classical graphics across the whole allele frequency spectrum, using only identity by state. In these circumstances, simulations show that with 35,000 independent bi-allelic variants, log-ratio principal component analysis, combined with discriminant analysis, can correctly classify relationships up to and including the fourth degree.

Highlights

  • The detection of pairs of related individuals in genomic databases is important in many areas of genetic research

  • We investigated the effect of minor allele frequency (MAF) and number of SNPs on the classification rate of our procedure, using different numbers of principal components for classification of third through sixth degree pairs (100 of each)

  • We report the false positive rates in Table S1; No unrelated pairs (UN) or 6th degree individuals were misclassified as 4th degree or lower, and only 1.8% of the 5th degree pairs are misclassified as 4th degree

Read more

Summary

Introduction

The detection of pairs of related individuals in genomic databases is important in many areas of genetic research. In population-based gene-disease association studies, the assumption of independent observations which is usually made in the statistical modeling of the data, may be violated due to related individuals. Cryptic relatedness can lead to an increased false positive rate in association studies, in particular if related individuals are oversampled (Voight and Pritchard, 2005). In conservation genetics, unrelated individuals are carefully selected in breeding programs in order to maximize genetic diversity (Oliehoek et al, 2006). In quality control of genetic variants produced by high-throughput techniques, accidental duplication of samples in genetic studies can be detected by a relatedness analysis (Abecasis et al, 2001).

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call