Abstract

BackgroundInference of haplotypes, or the sequence of alleles along the same chromosomes, is a fundamental problem in genetics and is a key component for many analyses including admixture mapping, identifying regions of identity by descent and imputation. Haplotype phasing based on sequencing reads has attracted lots of attentions. Diploid haplotype phasing where the two haplotypes are complimentary have been studied extensively. In this work, we focused on Polyploid haplotype phasing where we aim to phase more than two haplotypes at the same time from sequencing data. The problem is much more complicated as the search space becomes much larger and the haplotypes do not need to be complimentary any more.ResultsWe proposed two algorithms, (1) Poly-Harsh, a Gibbs Sampling based algorithm which alternatively samples haplotypes and the read assignments to minimize the mismatches between the reads and the phased haplotypes, (2) An efficient algorithm to concatenate haplotype blocks into contiguous haplotypes.ConclusionsOur experiments showed that our method is able to improve the quality of the phased haplotypes over the state-of-the-art methods. To our knowledge, our algorithm for haplotype blocks concatenation is the first algorithm that leverages the shared information across multiple individuals to construct contiguous haplotypes. Our experiments showed that it is both efficient and effective.

Highlights

  • Inference of haplotypes, or the sequence of alleles along the same chromosomes, is a fundamental problem in genetics and is a key component for many analyses including admixture mapping, identifying regions of identity by descent and imputation

  • Since many reads overlap with each other, most methods infer haplotypes by partitioning the reads into two sets corresponding to chromosomal origin in such a way that the number of conflicts between the reads and the predicted haplotypes is minimized

  • Minimum error correction (MEC) We focus on minimizing MEC between the phased haplotypes and the input read matrix, which is calculated as the total number of mismatches between the reads and their assigned haplotypes

Read more

Summary

Introduction

The sequence of alleles along the same chromosomes, is a fundamental problem in genetics and is a key component for many analyses including admixture mapping, identifying regions of identity by descent and imputation. The sequence of alleles residing on the same chromosome, is the fundamental unit of genetic variation. Inference of haplotypes plays an important role in many analyses, including identifying regions of IBD (Identityby-descent) [1,2,3], admixture mapping [4], imputation of uncollected genetic variation [5, 6]. Next-generation sequencing (NGS) technologies have been applied to haplotype phasing as each sequencing read originates from a single chromosome and alleles. Many methods have been proposed for the diploid haplotype phasing problem: HASH

Objectives
Methods
Results
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call