Prefix-free parsing for building big BWTs

Christina Boucher,Travis Gagie,Giovanni Manzini,Alan Kuhnle,Taher Mun,Ben Langmead

doi:10.1186/s13015-019-0148-5

Abstract

High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive—a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. In particular, we show that with prefix-free parsing we can build an 131-MB run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 h using 21 GB of memory, suggesting that we can build a 6.73 GB index for 1000 complete human-genome haplotypes in approximately 102 h using about 1 TB of memory.

Highlights

The money and time needed to sequence a genome have shrunk shockingly quickly and researchers’ ambitions have grown almost as quickly: the Human Genome Project cost billions of dollars and took a decade but we can sequence a genome for about a thousand dollars in about a day
Since genomic databases are often highly repetitive, we revisit the idea of applying a simple compression scheme and computing the Burrows-Wheeler Transform (BWT) from the resulting encoding in internal memory. This is far from being a novel idea—e.g., Ferragina, Gagie and Manzini’s bwtdisk software [7] could already in 2010 take advantage of its input being given compressed, and Policriti and Prezza [8] showed how to compute the BWT from the LZ77 parse of the input using O(n(log r + log z))-time and O(r + z)-space, where n is the length of the uncompressed input, r is the number of runs in the BWT and z is the number of phrases in the LZ77 parse—but we think the preprocessing step we describe here, prefix-free parsing, stands out because of its simplicity and flexibility
In “Prefix free parsing in practice” section we describe our implementation and report the results of our experiments showing that in practice the dictionary and parse often are significantly smaller than the text and so may fit in a reasonable internal memory even when the text is very large, and that this often makes the overall BWT computation both faster and smaller

Summary

Introduction

The money and time needed to sequence a genome have shrunk shockingly quickly and researchers’ ambitions have grown almost as quickly: the Human Genome Project cost billions of dollars and took a decade but we can sequence a genome for about a thousand dollars in about a day. With no compression 100,000 human genomes occupy roughly 300 terabytes of space, and genomic databases will have grown even more by the time a standard research machine has that much RAM. Other initiatives have began to study how microbial species behave and thrive in environments. These initiatives are generating public datasets, which are larger than the 100,000 Genomes Project. In recent years, there has been an initiative to move toward using whole genome sequencing to accurately identify and track foodborne pathogens (e.g. antibiotic-resistant bacteria)

Methods

Findings

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Algorithms for molecular biology : AMB	Publication Date: May 24, 2019
Citations: 48	License type: open-access

R Discovery Prime

R Discovery Prime

Prefix-free parsing for building big BWTs

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms for molecular biology : AMB

Lead the way for us

Similar Papers

Prefix-Free Parsing for Building Big BWTs
...
-
, et. al. ...
01 Jan 2018
01 Jan 2018

CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows–Wheeler transform
Yongchao Liu ... Douglas L Maskell
Bioinformatics | VOL. 28
Yongchao Liu, et. al.Yongchao Liu ... Douglas L Maskell
09 May 2012
Bioinformatics | VOL. 28

New Technologies, Tools and Approaches for Improving Crop Breeding
... José Luis Araus
Journal of Integrative Plant Biology | VOL. 54
, et. al. ... José Luis Araus
01 Apr 2012
Journal of Integrative Plant Biology | VOL. 54

Linear-time String Indexing and Analysis in Small Space
Djamal Belazzougui ... Fabio Cunial
ACM Transactions on Algorithms | VOL. 16
Djamal Belazzougui, et. al.Djamal Belazzougui ... Fabio Cunial
09 Mar 2020
ACM Transactions on Algorithms | VOL. 16

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Prefix-free parsing for building big BWTs

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms for molecular biology : AMB