Pedigree reconstruction from SNP data: parentage assignment, sibship clustering and beyond.

Jisca Huisman

doi:10.1111/1755-0998.12665

Abstract

Data on hundreds or thousands of single nucleotide polymorphisms (SNPs) provide detailed information about the relationships between individuals, but currently few tools can turn this information into a multigenerational pedigree. I present the r package sequoia, which assigns parents, clusters half‐siblings sharing an unsampled parent and assigns grandparents to half‐sibships. Assignments are made after consideration of the likelihoods of all possible first‐, second‐ and third‐degree relationships between the focal individuals, as well as the traditional alternative of being unrelated. This careful exploration of the local likelihood surface is implemented in a fast, heuristic hill‐climbing algorithm. Distinction between the various categories of second‐degree relatives is possible when likelihoods are calculated conditional on at least one parent of each focal individual. Performance was tested on simulated data sets with realistic genotyping error rate and missingness, based on three different large pedigrees (N = 1000–2000). This included a complex pedigree with overlapping generations, occasional close inbreeding and some unknown birth years. Parentage assignment was highly accurate down to about 100 independent SNPs (error rate <0.1%) and fast (<1 min) as most pairs can be excluded from being parent–offspring based on opposite homozygosity. For full pedigree reconstruction, 40% of parents were assumed nongenotyped. Reconstruction resulted in low error rates (<0.3%), high assignment rates (>99%) in limited computation time (typically <1 h) when at least 200 independent SNPs were used. In three empirical data sets, relatedness estimated from the inferred pedigree was strongly correlated to genomic relatedness.

Highlights

Pedigrees have many uses in a wide variety of fields, ranging from animal breeding and human genealogy to wildlife genetics and ethology
Simulated distributions of ΛPO/∨ showed a clearer divide between true PO pairs and non-PO pairs than did ΛPO/U (Fig. 5, left panels)
As for any software, performance in real data sets will be somewhat lower, but results in three empirical data sets are favourable compared to existing pedigrees and parentage assignment only

Summary

Introduction

Pedigrees have many uses in a wide variety of fields, ranging from animal breeding and human genealogy to wildlife genetics and ethology. A plethora of methods have been developed to reconstruct pedigrees based on a dozen or so multi-allelic microsatellites (see Jones et al (2010) for an overview). The lower information content per SNP necessitates a large number of markers to obtain the same accuracy as with a dozen microsatellites. This puts a considerable strain on machinery intended to deal with variable number of alleles per marker, while the binary nature of typical SNPs allows some computational short cuts to be taken. Dealing with genotyping errors and missing data requires summation of probabilities over all possible actual genotypes

Methods

Results

Conclusion