Spaced seed data structures

Inanc Birol,Rene L Warren,Anthony Raymond,Karthika Raghavan,Justin Chu,Benjamin P Vandervalk,Shaun D Jackman,Hamid Mohamadi

doi:10.1109/bibm.2014.6999305

Abstract

This past decade, genome sciences have benefitted from rapid advances in DNA sequencing technologies, and development of efficient algorithms for processing short nucleotide sequences played a key role in enabling their uptake in the field. In particular, reassembly of human genomes (de novo or reference-guided) from short DNA sequence reads had a substantial impact on health research. De novo assembly of a genome is essential in the absence of a reference genome sequence of a species. It is also gaining traction even when one is available, due to the utility of the method to resolve ambiguous or rearranged genomic regions with high specificity. With commercial high-throughput sequencing technologies increasing their throughput and their read lengths, the de Bruijn graph (DBG) paradigm used by many assembly algorithms needs to be revisited. DBG uses a table of k-mers, sequences of length k base pairs derived from the reads, and their k-1 base pair overlaps to assemble sequences. Despite longer k-mers unlocking longer genomic features for assembly, associated increases in memory usage and other compute resources are tradeoffs that limit the practicability of DBG over other assembly archetypes already designed for longer reads. Here, we introduce three data structure designs for paired k-mers, or spaced seeds, each addressing memory and run time constraints imposed by longer reads. In spaced seeds, a fixed distance separates k-mer pairs, providing increased sequence specificity with increased distance, while keeping memory usage low. Further, we describe a data structure based on Bloom filters that would be suitable to implicitly store spaced seeds, and would be tolerant to sequencing errors. Building on the spaced seeds Bloom filter, we describe a data structure for tracking the frequencies of observed spaced seeds. We expect the data structure designs we introduce in this study to have broad applications in genomics research, with niche applications in genome, transcriptome and metagenome assemblies, and in read error correction.

Full Text