FastEtch: A Fast Sketch-Based Assembler for Genomes.

Priyanka Ghosh,Ananth Kalyanaraman

doi:10.1109/tcbb.2017.2737999

Abstract

De novo genome assembly describes the process of reconstructing an unknown genome from a large collection of short (or long) reads sequenced from the genome. A single run of a Next-Generation Sequencing (NGS) technology can produce billions of short reads, making genome assembly computationally demanding (both in terms of memory and time). One of the major computational steps in modern day short read assemblers involves the construction and use of a string data structure called the de Bruijn graph. In fact, a majority of short read assemblers build the complete de Bruijn graph for the set of input reads, and subsequently traverse and prune low-quality edges, in order to generate genomic "contigs"-the output of assembly. These steps of graph construction and traversal, contribute to well over 90 percent of the runtime and memory. In this paper, we present a fast algorithm, FastEtch, that uses sketching to build an approximate version of the de Bruijn graph for the purpose of generating an assembly. The algorithm uses Count-Min sketch, which is a probabilistic data structure for streaming data sets. The result is an approximate de Bruijn graph that stores information pertaining only to a selected subset of nodes that are most likely to contribute to the contig generation step. In addition, edges are not stored; instead that fraction which contribute to our contig generation are detected on-the-fly. This approximate approach is intended to significantly improve performance (both execution time and memory footprint) whilst possibly compromising on the output assembly quality. We present two main versions of the assembler-one that generates an assembly, where each contig represents a contiguous genomic region from one strand of the DNA, and another that generates an assembly, where the contigs can straddle either of the two strands of the DNA. For further scalability, we have implemented a multi-threaded parallel code. Experimental results using our algorithm conducted on E. coli, Yeast, C. elegans, and Human (Chr2 and Chr2+3) genomes show that our method yields one of the best time-memory-quality trade-offs, when compared against many state-of-the-art genome assemblers.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

FastEtch: A Fast Sketch-Based Assembler for Genomes.

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM transactions on computational biology and bioinformatics

Lead the way for us

Journal: IEEE/ACM transactions on computational biology and bioinformatics	Publication Date: Sep 11, 2017
Citations: 31

Similar Papers

A Fast Sketch-based Assembler for Genomes
Priyanka Ghosh ... Ananth Kalyanaraman
-
Priyanka Ghosh, et. al.Priyanka Ghosh ... Ananth Kalyanaraman
02 Oct 2016
02 Oct 2016

HAssembler: A hybrid de novo genome assembly approach for large genomes
Amit Kairi ... Atmakuri Ramakrishna Rao
The Indian Journal of Agricultural Sciences | VOL. 90
Amit Kairi, et. al.Amit Kairi ... Atmakuri Ramakrishna Rao
04 Dec 2020
The Indian Journal of Agricultural Sciences | VOL. 90

Long-read sequencing in ecology and evolution: Understanding how complex genetic and epigenetic variants shape biodiversity.
Dan G Bock ... Polina Novikova
Molecular Ecology | VOL. 32
Dan G Bock, et. al.Dan G Bock ... Polina Novikova
01 Mar 2023
Molecular Ecology | VOL. 32

Parallelized short read assembly of large genomes using de Bruijn graphs
Yongchao Liu ... Bertil Schmidt
BMC Bioinformatics | VOL. 12
Yongchao Liu, et. al.Yongchao Liu ... Bertil Schmidt
25 Aug 2011
BMC Bioinformatics | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

FastEtch: A Fast Sketch-Based Assembler for Genomes.

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM transactions on computational biology and bioinformatics