Spaced Seed Data Structures for De Novo Assembly.

Inanç Birol,Shaun D Jackman,Benjamin P Vandervalk,René L Warren,Justin Chu,Anthony Raymond,Karthika Raghavan,Hamid Mohamadi

doi:10.1155/2015/196591

Inanç Birol, Shaun D Jackman + Show 6 more

Open Access

https://doi.org/10.1155/2015/196591

Copy DOI

Abstract

De novo assembly of the genome of a species is essential in the absence of a reference genome sequence. Many scalable assembly algorithms use the de Bruijn graph (DBG) paradigm to reconstruct genomes, where a table of subsequences of a certain length is derived from the reads, and their overlaps are analyzed to assemble sequences. Despite longer subsequences unlocking longer genomic features for assembly, associated increase in compute resources limits the practicability of DBG over other assembly archetypes already designed for longer reads. Here, we revisit the DBG paradigm to adapt it to the changing sequencing technology landscape and introduce three data structure designs for spaced seeds in the form of paired subsequences. These data structures address memory and run time constraints imposed by longer reads. We observe that when a fixed distance separates seed pairs, it provides increased sequence specificity with increased gap length. Further, we note that Bloom filters would be suitable to implicitly store spaced seeds and be tolerant to sequencing errors. Building on this concept, we describe a data structure for tracking the frequencies of observed spaced seeds. These data structure designs will have applications in genome, transcriptome and metagenome assemblies, and read error correction.

Highlights

For nearly a century, progressive discovery of the number and molecular structure of chromosomes and their information content have proven to be useful in the clinical domain [1, 2]
We had reported on a scalable de novo assembly tool, ABySS, that used short reads from an high throughput sequencing (HTS) platform to assemble the human genome [24], and we further demonstrated the utility of the approach to analyze transcriptome sequencing (RNA-seq) data (TransABySS) [25, 26]
The data structure reported in this paper offers a design that will be suitable for extending the utility of fast and effective de Bruijn graph (DBG) algorithms, hereby modifying the concept of k-mers by introduction of spaced seeds

Summary

Introduction

Progressive discovery of the number and molecular structure of chromosomes and their information content have proven to be useful in the clinical domain [1, 2]. We had reported on a scalable de novo assembly tool, ABySS, that used short reads from an HTS platform to assemble the human genome [24], and we further demonstrated the utility of the approach to analyze transcriptome sequencing (RNA-seq) data (TransABySS) [25, 26]. A de Bruijn graph (DBG) representation of k-mer overlaps (overlaps between sequences of k base pairs in length) was introduced with the Euler algorithm [30] and is the enabling technology behind ABySS and some of the other popular de novo assembly tools, such as Velvet [31]. With increasing read lengths in “short read” platforms like Illumina and with the gaining popularity and development of “long read” platforms like Pacific Biosciences and Oxford Nanopore, DBG based assembly algorithms need to adapt to retain their advantage. We describe primary and auxiliary data structures based on Bloom filters [33] with potential uses in genome, transcriptome and metagenome assemblies, and error correction

Spaced Seeds

Data Structures

Application Areas

Findings

Conclusions