Hap10: reconstructing accurate and long polyploid haplotypes using linked reads

Sina Majidian,Mohammad Hossein Kahaei,Dick De Ridder

doi:10.1186/s12859-020-03584-5

Sina Majidian, Mohammad Hossein Kahaei + Show 1 more

Open Access

https://doi.org/10.1186/s12859-020-03584-5

Copy DOI

Abstract

BackgroundHaplotype information is essential for many genetic and genomic analyses, including genotype-phenotype associations in human, animals and plants. Haplotype assembly is a method for reconstructing haplotypes from DNA sequencing reads. By the advent of new sequencing technologies, new algorithms are needed to ensure long and accurate haplotypes. While a few linked-read haplotype assembly algorithms are available for diploid genomes, to the best of our knowledge, no algorithms have yet been proposed for polyploids specifically exploiting linked reads.ResultsThe first haplotyping algorithm designed for linked reads generated from a polyploid genome is presented, built on a typical short-read haplotyping method, SDhaP. Using the input aligned reads and called variants, the haplotype-relevant information is extracted. Next, reads with the same barcodes are combined to produce molecule-specific fragments. Then, these fragments are clustered into strongly connected components which are then used as input of a haplotype assembly core in order to estimate accurate and long haplotypes.ConclusionsHap10 is a novel algorithm for haplotype assembly of polyploid genomes using linked reads. The performance of the algorithms is evaluated in a number of simulation scenarios and its applicability is demonstrated on a real dataset of sweet potato.

Highlights

Haplotype information is essential for many genetic and genomic analyses, including genotype-phenotype associations in human, animals and plants
SDhaP crashes for larger datasets. This indicates that this short-read haplotyping algorithm is currently unable to directly handle linked read data generated from a polyploid genome
Hap++ Hap++ is a fast program to reconstruct haplotypes in polyploids by exploiting linked read information. It consists of three main steps: 1) extracting haplotype-relevant information from input binary sequence alignment (BAM) and variant call format (VCF) files; 2) extracting molecule-specific fragments; 3) extracting strongly connected components of fragments

Summary

Results

We have developed Hap, a novel pipeline for haplotyping polyploids based on linked-read (SLR) data. An approach in which haplotypes are calculated independently on three sized parts of the region of interest supports this: the average block length decreases, but both reconstruction rate and vector error rate improve (Supplementary information: Table S3, third row compared to the second row). This suggests that while SDhaP in principle works for haplotype assembly in polyploids, performance may be improved by pre-processing the data. Note that the max-K-cut randomized approach (part of the assembly core) is theoretically guaranteed to converge to near the optimal value

Conclusions

Background

Methods

K þ ln K K2

Conclusion