Abstract

The number of human genomes being genotyped or sequenced increases exponentially and efficient haplotype estimation methods able to handle this amount of data are now required. Here we present a method, SHAPEIT4, which substantially improves upon other methods to process large genotype and high coverage sequencing datasets. It notably exhibits sub-linear running times with sample size, provides highly accurate haplotypes and allows integrating external phasing information such as large reference panels of haplotypes, collections of pre-phased variants and long sequencing reads. We provide SHAPEIT4 in an open source format and demonstrate its performance in terms of accuracy and running times on two gold standard datasets: the UK Biobank data and the Genome In A Bottle.

Highlights

  • The number of human genomes being genotyped or sequenced increases exponentially and efficient haplotype estimation methods able to handle this amount of data are required

  • SHAPEIT4 works within overlapping genomic regions and proceeds as follows to update the phase of an individual in a given region: (i) it interrogates the Positional Burrows–Wheeler Transform (PBWT) arrays every eight variants to get the P haplotypes that share the longest prefixes with the current haplotype estimates at that position, (ii) it collapses the haplotypes identified across the entire region into a list of K distinct haplotypes, and (iii) it runs the Li and Stephens model (LSM) conditioning on the K haplotypes (Fig. 1a, b)

  • We present here a method for statistical haplotype estimation, SHAPEIT4, that substantially improves upon existing methods in terms of flexibility and computational efficiency

Read more

Summary

Introduction

The number of human genomes being genotyped or sequenced increases exponentially and efficient haplotype estimation methods able to handle this amount of data are required. To assemble together these blocks of phased variants, usually called phase sets, two types of approaches are explored: experimental solutions based either on Hi-C15 or strand-seq[16] or computational solutions requiring population level data[17] At this point, it becomes clear that haplotype estimation is facing two main challenges: computational efficiently to accurately process large-scale data sets and data integration to exploit simultaneously large reference panels of haplotypes and long sequencing reads. We describe and benchmark a method for haplotype estimation, SHAPEIT4, which proposes efficient solutions to these two challenges It allows processing either SNP array or sequencing data accurately with running times that are sub-linear with sample size, making it well suited for very large-scale data sets. We benchmark it on two gold standard data sets: the UKB2 to evaluate its ability to process large-scale SNP array data sets and on the Genome In A Bottle (GIAB)[23] to assess its ability to leverage long sequencing read information

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call