Abstract
As the more recent next-generation sequencing (NGS) technologies provide longer read sequences, the use of sequencing datasets for complete haplotype phasing is fast becoming a reality, allowing haplotype reconstruction of a single sequenced genome. Nearly all previous haplotype reconstruction studies have focused on diploid genomes and are rarely scalable to genomes with higher ploidy. Yet computational investigations into polyploid genomes carry great importance, impacting plant, yeast and fish genomics, as well as the studies of the evolution of modern-day eukaryotes and (epi)genetic interactions between copies of genes. In this paper, we describe a novel maximum-likelihood estimation framework, HapTree, for polyploid haplotype assembly of an individual genome using NGS read datasets. We evaluate the performance of HapTree on simulated polyploid sequencing read data modeled after Illumina sequencing technologies. For triploid and higher ploidy genomes, we demonstrate that HapTree substantially improves haplotype assembly accuracy and efficiency over the state-of-the-art; moreover, HapTree is the first scalable polyplotyping method for higher ploidy. As a proof of concept, we also test our method on real sequencing data from NA12878 (1000 Genomes Project) and evaluate the quality of assembled haplotypes with respect to trio-based diplotype annotation as the ground truth. The results indicate that HapTree significantly improves the switch accuracy within phased haplotype blocks as compared to existing haplotype assembly methods, while producing comparable minimum error correction (MEC) values. A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2–5.
Highlights
While human and other eukaryotic genomes typically contain two copies of every chromosome, plants, yeast and fish such as salmon can have strictly more than two copies of each chromosome
Among various formulations suggested for this problem, the most commonly used is an NP-hard minimum error correction (MEC) definition [14,15], which aims to identify the smallest set of nucleotide changes required within
Using simulated polyploid sequencing datasets, we demonstrate that relative likelihood (RL)-score performs significantly better at capturing haplotype assembly quality than MEC-score as ploidy increases
Summary
While human and other eukaryotic genomes typically contain two copies of every chromosome, plants, yeast and fish such as salmon can have strictly more than two copies of each chromosome. In the case of two heterozygous SNP sites, genotype calling tools cannot determine whether ‘‘mutant’’ alleles from different SNP loci are on the same or different chromosomes (i.e. compound heterozygote) While the former would be healthy, in many cases the latter can cause loss of function; it is necessary to identify the phase (phasing) —the copies of a chromosome on which the mutant alleles occur—in addition to the genotype (Figure 1). Among various formulations suggested for this problem, the most commonly used is an NP-hard minimum error correction (MEC) definition [14,15], which aims to identify the smallest set of nucleotide changes required within
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.