Abstract

BackgroundAssembling haplotypes given sequence data derived from a single individual is a well studied problem, but only recently has haplotype assembly been considered for population-sampled data. We discuss a software tool called Hapler, which is designed specifically for low-diversity, low-coverage data such as ecological samples derived from natural populations. Because such data may contain error as well as ambiguous haplotype information, we developed methods that increase confidence in these assemblies. Hapler also reconstructs full consensus sequences while minimizing and identifying possible chimeric points.ResultsExperiments on simulated data indicate that Hapler is effective at assembling haplotypes from gene-sized alignments of short reads. Further, in our tests Hapler-generated consensus sequences are less chimeric than the alternative consensus approaches of majority vote and viral quasispecies estimation regardless of error rate, read length, or population haplotype bias.ConclusionsThe analysis of genetically diverse sequence data is increasingly common, particularly in the field of ecoinformatics where transcriptome sequencing of natural populations is a cost effective alternative to genome sequencing. For such studies, it is important to consider and identify haplotype diversity. Hapler provides robust haplotype information and identifies possible phasing errors in consensus sequences, providing valuable information for population studies and downstream usage of resulting assemblies.

Highlights

  • Assembling haplotypes given sequence data derived from a single individual is a well studied problem, but only recently has haplotype assembly been considered for population-sampled data

  • The assembly and analysis of short-read sequence data presents a number of well known challenges including error correction, correct determination of repetitive regions, and accurate identification of genetic variation such as single nucleotide polymorphisms (SNPs) and insertions/deletions

  • When input reads are all sourced from highly inbred individuals, this is easy to ensure: any variation should result from sequencing error, and the popular “majority vote” mechanism will create a correct consensus [1]

Read more

Summary

Introduction

Assembling haplotypes given sequence data derived from a single individual is a well studied problem, but only recently has haplotype assembly been considered for population-sampled data. We discuss a software tool called Hapler, which is designed for low-diversity, low-coverage data such as ecological samples derived from natural populations. When reads are sourced from non-inbred individuals, the majority vote mechanism reduces to an uninformed parsimony approach; we assume the existence of a “most frequent” consensus and rely on coverage depth to identify it. This approach disregards the extant sequence diversity, and does not identify possible errors in the consensus assembly caused by it. Proper analysis of diverse data should focus on the assembly or reassembly of haplotypes–consensus sequences that match to at least one of the diverse set of chromosomes in the sample

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.