Rapid detection of identity-by-descent tracts for mega-scale datasets

Christopher R Gignoux,Christy L Avery,Gillian M Belbin,Ruhollah Shemirani,Eimear E Kenny,José Luis Ambite

doi:10.1038/s41467-021-22910-w

Abstract

The ability to identify segments of genomes identical-by-descent (IBD) is a part of standard workflows in both statistical and population genetics. However, traditional methods for finding local IBD across all pairs of individuals scale poorly leading to a lack of adoption in very large-scale datasets. Here, we present iLASH, an algorithm based on similarity detection techniques that shows equal or improved accuracy in simulations compared to current leading methods and speeds up analysis by several orders of magnitude on genomic datasets, making IBD estimation tractable for millions of individuals. We apply iLASH to the PAGE dataset of ~52,000 multi-ethnic participants, including several founder populations with elevated IBD sharing, identifying IBD segments in ~3 minutes per chromosome compared to over 6 days for a state-of-the-art algorithm. iLASH enables efficient analysis of very large-scale datasets, as we demonstrate by computing IBD across the UK Biobank (~500,000 individuals), detecting 12.9 billion pairwise connections.

Highlights

Background and rationaleiLASH was inspired by a minhash-based realization of the Locality Sensitive Hashing (LSH) algorithm[16,17,22]
Since iLASH identifies the similarity of relatively large regions at a time, it provides significant speed-ups compared to simpler hashing methods
ILASH uses LSH that can identify, with high probability, whether two individuals match over the whole segment of interest

Summary

Results

The iLASH algorithm is a modification of the LSH algorithm designed expressly for identity-by-descent detection on genomic data. ILASH accurately recovers at least 96% of the total length of all simulated IBD haplotypes across the simulated dataset of the three populations tested, compared to lower overall accuracies of 94% for Refined IBD and 85% for GERMLINE These results were consistent for dataset sizes ranging from 1000 to 30,000 individuals for all three algorithms (cf Supplementary Fig. 1). IBD computation runtime in seconds for iLASH, Refined IBD, and GERMLINE on synthesized haplotypic data simulating all of PAGE and Puerto Rican (PR) populations IBD patterns: (A) as the number of individuals grows, (B) as the total output (total length of tracts found) grows. Short IBD segments (5–3 cM) were challenging for RaPID, which generated a larger number of false positives: 22–25% of the results across runs contained discordant genotypes between identified pairs; an issue that never occurs with GERMLINE and iLASH. The sensitivity of methods like iLASH to detect true levels of cryptic relatedness is critical in mixed effects models such as SAIGE33 and BOLT-LMM34 for efficient, appropriately-calibrated association testing

Discussion

Methods

Code availability