SDhaP: haplotype assembly for diploids and polyploids via semi-definite programming.

Shreepriya Das,Haris Vikalo

doi:10.1186/s12864-015-1408-5

Abstract

BackgroundThe goal of haplotype assembly is to infer haplotypes of an individual from a mixture of sequenced chromosome fragments. Limited lengths of paired-end sequencing reads and inserts render haplotype assembly computationally challenging; in fact, most of the problem formulations are known to be NP-hard. Dimensions (and, therefore, difficulty) of the haplotype assembly problems keep increasing as the sequencing technology advances and the length of reads and inserts grow. The computational challenges are even more pronounced in the case of polyploid haplotypes, whose assembly is considerably more difficult than in the case of diploids. Fast, accurate, and scalable methods for haplotype assembly of diploid and polyploid organisms are needed.ResultsWe develop a novel framework for diploid/polyploid haplotype assembly from high-throughput sequencing data. The method formulates the haplotype assembly problem as a semi-definite program and exploits its special structure – namely, the low rank of the underlying solution – to solve it rapidly and with high accuracy. The developed framework is applicable to both diploid and polyploid species. The code for SDhaP is freely available at https://sourceforge.net/projects/sdhap.ConclusionExtensive benchmarking tests on both real and simulated data show that the proposed algorithms outperform several well-known haplotype assembly methods in terms of either accuracy or speed or both. Useful recommendations for coverages needed to achieve near-optimal solutions are also provided.

Highlights

The goal of haplotype assembly is to infer haplotypes of an individual from a mixture of sequenced chromosome fragments
The most common type of variation between chromosomes in a homologous pair are single nucleotide polymorphisms (SNPs), where a single base differs between the two DNA sequences
We focus on the minimum error correction (MEC) formulation, which attempts to find the smallest number of nucleotides in reads whose flipping to a different value would resolve conflicts among the fragments from the same chromosome

Summary

Introduction

The goal of haplotype assembly is to infer haplotypes of an individual from a mixture of sequenced chromosome fragments. The most common type of variation between chromosomes in a homologous pair are single nucleotide polymorphisms (SNPs), where a single base differs between the two DNA sequences (i.e., the corresponding alleles on the homologous chromosomes are different and the individual is heterozygous at that specific locus). SNP calling is concerned with determining locations and the type of polymorphisms Once such single variant sites are determined, genotype calling associates a genotype with the individual whose genome is being analyzed. The complete information about DNA variations in an individual genome is provided by haplotypes, the list of alleles at contiguous sites in a region of a single chromosome. When the corresponding genes on homologous chromosomes contain multiple variants, they often exhibit different gene expression patterns.

Methods

Results

Conclusion