Abstract

BackgroundResolution of complex repeat structures and rearrangements in the assembly and analysis of large eukaryotic genomes is often aided by a combination of high-throughput sequencing and genome-mapping technologies (for example, optical restriction mapping). In particular, mapping technologies can generate sparse maps of large DNA fragments (150 kilo base pairs (kbp) to 2 Mbp) and thus provide a unique source of information for disambiguating complex rearrangements in cancer genomes. Despite their utility, combining high-throughput sequencing and mapping technologies has been challenging because of the lack of efficient and sensitive map-alignment algorithms for robustly aligning error-prone maps to sequences.ResultsWe introduce a novel seed-and-extend glocal (short for global-local) alignment method, OPTIMA (and a sliding-window extension for overlap alignment, OPTIMA-Overlap), which is the first to create indexes for continuous-valued mapping data while accounting for mapping errors. We also present a novel statistical model, agnostic with respect to technology-dependent error rates, for conservatively evaluating the significance of alignments without relying on expensive permutation-based tests.ConclusionsWe show that OPTIMA and OPTIMA-Overlap outperform other state-of-the-art approaches (1.6−2 times more sensitive) and are more efficient (170−200 %) and precise in their alignments (nearly 99 % precision). These advantages are independent of the quality of the data, suggesting that our indexing approach and statistical evaluation are robust, provide improved sensitivity and guarantee high precision.Electronic supplementary materialThe online version of this article (doi:10.1186/s13742-016-0110-0) contains supplementary material, which is available to authorized users.

Highlights

  • Resolution of complex repeat structures and rearrangements in the assembly and analysis of large eukaryotic genomes is often aided by a combination of high-throughput sequencing and genome-mapping technologies

  • In recent years, the availability of commercial platforms for high-throughput genome mapping

  • Several recent genome assembly projects have highlighted their utility for obtaining high-quality assemblies of large eukaryotic genomes

Read more

Summary

Introduction

Resolution of complex repeat structures and rearrangements in the assembly and analysis of large eukaryotic genomes is often aided by a combination of high-throughput sequencing and genome-mapping technologies (for example, optical restriction mapping). Mapping technologies can generate sparse maps of large DNA fragments (150 kilo base pairs (kbp) to 2 Mbp) and provide a unique source of information for disambiguating complex rearrangements in cancer genomes Despite their utility, combining high-throughput sequencing and mapping technologies has been challenging because of the lack of efficient and sensitive map-alignment algorithms for robustly aligning error-prone maps to sequences. Alternative approaches with simpler (non-model-based) scoring functions [10] are handicapped by the need to do expensive permutation-based statistical testing to evaluate the significance of alignments, and recent advances have made this testing more efficient [15], it still scales linearly with genome size These approaches work well for microbial genomes, they typically do not scale well for larger genomes, where they might have reduced sensitivity. Commercially available solutions for map-tosequence alignment (for example, Genome-Builder from OpGen) scale better and have been used for the assembly of large eukaryotic genomes [4] but tend to discard a large fraction of the mapping data (more than 90 %) due to reduced sensitivity and correspondingly lead to increased mapping costs for a project

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call