A memory-efficient data structure representing exact-match overlap graphs with application for next-generation DNA assembly

Hieu Dinh,Sanguthevar Rajasekaran

doi:10.1093/bioinformatics/btr321

Abstract

Exact-match overlap graphs have been broadly used in the context of DNA assembly and the shortest super string problem where the number of strings n ranges from thousands to billions. The length ℓ of the strings is from 25 to 1000, depending on the DNA sequencing technologies. However, many DNA assemblers using overlap graphs suffer from the need for too much time and space in constructing the graphs. It is nearly impossible for these DNA assemblers to handle the huge amount of data produced by the next-generation sequencing technologies where the number n of strings could be several billions. If the overlap graph is explicitly stored, it would require Ω(n(2)) memory, which could be prohibitive in practice when n is greater than a hundred million. In this article, we propose a novel data structure using which the overlap graph can be compactly stored. This data structure requires only linear time to construct and and linear memory to store. For a given set of input strings (also called reads), we can informally define an exact-match overlap graph as follows. Each read is represented as a node in the graph and there is an edge between two nodes if the corresponding reads overlap sufficiently. A formal description follows. The maximal exact-match overlap of two strings x and y, denoted by ov(max)(x, y), is the longest string which is a suffix of x and a prefix of y. The exact-match overlap graph of n given strings of length ℓ is an edge-weighted graph in which each vertex is associated with a string and there is an edge (x, y) of weight ω=ℓ-|ov(max)(x, y)| if and only if ω ≤ λ, where |ov(max)(x, y)| is the length of ov(max)(x, y) and λ is a given threshold. In this article, we show that the exact-match overlap graphs can be represented by a compact data structure that can be stored using at most (2λ-1)(2⌈logn⌉+⌈logλ⌉)n bits with a guarantee that the basic operation of accessing an edge takes O(log λ) time. We also propose two algorithms for constructing the data structure for the exact-match overlap graph. The first algorithm runs in O(λℓnlogn) worse-case time and requires O(λ) extra memory. The second one runs in O(λℓn) time and requires O(n) extra memory. Our experimental results on a huge amount of simulated data from sequence assembly show that the data structure can be constructed efficiently in time and memory. Our DNA sequence assembler that incorporates the data structure is freely available on the web at http://www.engr.uconn.edu/~htd06001/assembler/leap.zip

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A memory-efficient data structure representing exact-match overlap graphs with application for next-generation DNA assembly

Abstract

Talk to us

Similar Papers

More From: Bioinformatics

Lead the way for us

Journal: Bioinformatics	Publication Date: Jun 2, 2011
Citations: 16

Similar Papers

Workshop: An efficient data structure for exact-match overlap graphs and next generation sequence assembly
Hieu Dinh ... Sanguthevar Rajasekaran
-
Hieu Dinh, et. al.Hieu Dinh ... Sanguthevar Rajasekaran
01 Feb 2012
01 Feb 2012

Packed Compact Tries: A Fast and Efficient Data Structure for Online String Processing
Takuya Takagi ... Hiroki Arimura
IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences | VOL. E100.A
Takuya Takagi, et. al.Takuya Takagi ... Hiroki Arimura
01 Jan 2017
IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences | VOL. E100.A

Simulating the DNA Overlap Graph in Succinct Space.
...
-
, et. al. ...
01 Jan 2019
01 Jan 2019

LazyB: fast and cheap genome assembly
Thomas Gatter ... Polina Drozdova
Algorithms for molecular biology : AMB | VOL. 16
Thomas Gatter, et. al.Thomas Gatter ... Polina Drozdova
01 Jun 2021
Algorithms for molecular biology : AMB | VOL. 16

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A memory-efficient data structure representing exact-match overlap graphs with application for next-generation DNA assembly

Abstract

Talk to us

Similar Papers

More From: Bioinformatics