Abstract

High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive—a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. In particular, we show that with prefix-free parsing we can build an 131-MB run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 h using 21 GB of memory, suggesting that we can build a 6.73 GB index for 1000 complete human-genome haplotypes in approximately 102 h using about 1 TB of memory.

Highlights

  • The money and time needed to sequence a genome have shrunk shockingly quickly and researchers’ ambitions have grown almost as quickly: the Human Genome Project cost billions of dollars and took a decade but we can sequence a genome for about a thousand dollars in about a day

  • Since genomic databases are often highly repetitive, we revisit the idea of applying a simple compression scheme and computing the Burrows-Wheeler Transform (BWT) from the resulting encoding in internal memory. This is far from being a novel idea—e.g., Ferragina, Gagie and Manzini’s bwtdisk software [7] could already in 2010 take advantage of its input being given compressed, and Policriti and Prezza [8] showed how to compute the BWT from the LZ77 parse of the input using O(n(log r + log z))-time and O(r + z)-space, where n is the length of the uncompressed input, r is the number of runs in the BWT and z is the number of phrases in the LZ77 parse—but we think the preprocessing step we describe here, prefix-free parsing, stands out because of its simplicity and flexibility

  • In “Prefix free parsing in practice” section we describe our implementation and report the results of our experiments showing that in practice the dictionary and parse often are significantly smaller than the text and so may fit in a reasonable internal memory even when the text is very large, and that this often makes the overall BWT computation both faster and smaller

Read more

Summary

Introduction

The money and time needed to sequence a genome have shrunk shockingly quickly and researchers’ ambitions have grown almost as quickly: the Human Genome Project cost billions of dollars and took a decade but we can sequence a genome for about a thousand dollars in about a day. With no compression 100,000 human genomes occupy roughly 300 terabytes of space, and genomic databases will have grown even more by the time a standard research machine has that much RAM. Other initiatives have began to study how microbial species behave and thrive in environments. These initiatives are generating public datasets, which are larger than the 100,000 Genomes Project. In recent years, there has been an initiative to move toward using whole genome sequencing to accurately identify and track foodborne pathogens (e.g. antibiotic-resistant bacteria)

Methods
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.