RAFFI: Accurate and fast familial relationship inference in large scale biobank studies using RaPID.

Ardalan Naseri,Xihong Lin,Shaojie Zhang,Junjie Shi,Degui Zhi

doi:10.1371/journal.pgen.1009315

Abstract

Inference of relationships from whole-genome genetic data of a cohort is a crucial prerequisite for genome-wide association studies. Typically, relationships are inferred by computing the kinship coefficients (ϕ) and the genome-wide probability of zero IBD sharing (π0) among all pairs of individuals. Current leading methods are based on pairwise comparisons, which may not scale up to very large cohorts (e.g., sample size >1 million). Here, we propose an efficient relationship inference method, RAFFI. RAFFI leverages the efficient RaPID method to call IBD segments first, then estimate the ϕ and π0 from detected IBD segments. This inference is achieved by a data-driven approach that adjusts the estimation based on phasing quality and genotyping quality. Using simulations, we showed that RAFFI is robust against phasing/genotyping errors, admix events, and varying marker densities, and achieves higher accuracy compared to KING, the current leading method, especially for more distant relatives. When applied to the phased UK Biobank data with ~500K individuals, RAFFI is approximately 18 times faster than KING. We expect RAFFI will offer fast and accurate relatedness inference for even larger cohorts.

Highlights

Inference of hidden/cryptic relatedness in large genetic cohorts is often a prerequisite for successful genome-wide association studies (GWAS)
Inferring familial relationships has a wide range of applications
Inferring relationship is essential for unknown familial structures and can be used to correct pedigree information due to false paternity, sample switches, or unregistered adoption

Summary

Introduction

Inference of hidden/cryptic relatedness in large genetic cohorts is often a prerequisite for successful genome-wide association studies (GWAS). For family-based genomewide association studies (GWAS), complete and accurate information about the relatedness is necessary for proper adjustment of familial random effects. While GRM can be inferred from the whole genome genotypes, identifying relatedness (e.g., 3rd or 4th degree relatives) can provide a sparse GRM for which efficient association algorithms could be derived. Even well-annotated cohorts may still contain incorrect pedigree information due to falsely claimed paternity, sample switches, or unregistered/unknown adoption. In such cases, inferred relatedness from genotype data can be used for checking and correction of pedigree information [4] and sample quality control [5]

Methods

Results

Conclusion