Kohdista: an efficient method to index and query possible Rmap alignments

Martin D Muggli,Christina Boucher,Simon J Puglisi

doi:10.1186/s13015-019-0160-9

Martin D Muggli, Christina Boucher + Show 1 more

Open Access

https://doi.org/10.1186/s13015-019-0160-9

Copy DOI

Abstract

BackgroundGenome-wide optical maps are ordered high-resolution restriction maps that give the position of occurrence of restriction cut sites corresponding to one or more restriction enzymes. These genome-wide optical maps are assembled using an overlap-layout-consensus approach using raw optical map data, which are referred to as Rmaps. Due to the high error-rate of Rmap data, finding the overlap between Rmaps remains challenging.ResultsWe present Kohdista, which is an index-based algorithm for finding pairwise alignments between single molecule maps (Rmaps). The novelty of our approach is the formulation of the alignment problem as automaton path matching, and the application of modern index-based data structures. In particular, we combine the use of the Generalized Compressed Suffix Array (GCSA) index with the wavelet tree in order to build Kohdista. We validate Kohdista on simulated E. coli data, showing the approach successfully finds alignments between Rmaps simulated from overlapping genomic regions.Conclusionwe demonstrate Kohdista is the only method that is capable of finding a significant number of high quality pairwise Rmap alignments for large eukaryote organisms in reasonable time.

Highlights

Genome-wide optical maps are ordered high-resolution restriction maps that give the position of occurrence of restriction cut sites corresponding to one or more restriction enzymes
There is a current resurgence in generating diverse types of data, to be used alone or in concert with short read data, in order to overcome the limitations of short read data
We show the utility of our approach on larger eukaryote genomes by demonstrating that existing published methods require more than 151 h of CPU time to find all pairwise alignments in the plum Rmap data; whereas, Kohdista requires 31 h

Summary

Results

We evaluated Kohdista against the other available optical map alignment software. Our experiments measured runtime, peak memory, and alignment quality on simulated E. coli Rmaps and experimentally generated plum Rmaps. Performance on simulated E. coli Rmap data To verify the correctness of our method, we simulated a read set from a 4.6 Mbp E. coli reference genome as follows: we started with 1,400 copies of the genome, and generated 40 random loci within each These loci form the ends of molecules that would undergo digestion. A molecule would align to itself, these are not included in the ground truth set This method of simulation was based on the E. coli statistics given by Valouev et al [12] and resulting in a molecule length distribution as observed in publicly available Rmap data from OpGen, Inc. Most methods were designed for less noisy data but in theory could address all the data error types required.

Background

Method

Conclusions