External memory BWT and LCP computation for sequence collections with applications

Lavinia Egidi,Guilherme P Telles,Giovanni Manzini,Felipe A Louza

doi:10.1186/s13015-019-0140-0

Abstract

BackgroundSequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows–Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM.ResultsWe propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix–prefix overlaps, and the construction of succinct de Bruijn graphs.ConclusionsWe prove that our algorithm performs {mathcal {O}}(n, mathsf {maxlcp}) sequential I/Os, where n is the total length of the collection and mathsf {maxlcp} is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.

Highlights

A fundamental problem in bioinformatics is the ability to efficiently search into the billions of DNA sequences produced by NGS studies
Experiments we report on an experimental study comparing between the eGap algorithm and the other known external memory tools computing the Burrows–Wheeler Transform (BWT) and longest common prefix (LCP) arrays of sequence collections
Applications we show that the eGap algorithm, in addition to the BWT and LCP arrays, can output additional information useful to design efficient external memory algorithms for three well known problems on sequence collections: (i) the computation of maximal repeats, (ii) the all pairs suffix–prefix overlaps, and (iii) the construction of succinct de Bruijn graphs

Summary

Introduction

A fundamental problem in bioinformatics is the ability to efficiently search into the billions of DNA sequences produced by NGS studies. The BWT is often complemented with the longest common prefix (LCP) array [3] since the latter makes it possible to efficiently emulate Suffix Tree algorithms [4, 5]. The construction of such data structures is a challenging problem. In the semi-external model the main memory can grow linearly with the size of the input but part of the working data has to reside on disk. We denote by LCP(s1, s2) the length of the longest common prefix between s1 and s2

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Algorithms for Molecular Biology	Publication Date: Mar 8, 2019
Citations: 30	License type: open-access

R Discovery Prime

R Discovery Prime

External memory BWT and LCP computation for sequence collections with applications

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms for Molecular Biology

Lead the way for us

Similar Papers

Low Space External Memory Construction of the Succinct Permuted Longest Common Prefix Array
German Tischler
-
German TischlerGerman Tischler
01 Jan 2015
01 Jan 2015

String Inference from Longest-Common-Prefix Array
...
-
, et. al. ...
31 Jan 2018
31 Jan 2018

Computing the multi-string BWT and LCP array in external memory
Paola Bonizzoni ... Raffaella Rizzi
Theoretical Computer Science | VOL. 862
Paola Bonizzoni, et. al.Paola Bonizzoni ... Raffaella Rizzi
30 Nov 2020
Theoretical Computer Science | VOL. 862

Space-Time Tradeoffs for Longest-Common-Prefix Array Computation
Simon J Puglisi ... Andrew Turpin
-
Simon J Puglisi, et. al.Simon J Puglisi ... Andrew Turpin
01 Jan 2008
01 Jan 2008

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

External memory BWT and LCP computation for sequence collections with applications

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms for Molecular Biology