Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index.

Prashant Pandey,Michael Ferdman,Michael A Bender,Rob Johnson,Fatemeh Almodaresi,Rob Patro

doi:10.1016/j.cels.2018.05.021

Prashant Pandey, Michael Ferdman + Show 4 more

Open Access

https://doi.org/10.1016/j.cels.2018.05.021

Copy DOI

Journal: Cell Systems	Publication Date: Jun 20, 2018
Citations: 89	License type: publisher-specific-oa

Affiliation: Stony Brook University, Kitware (United States)

Abstract

Sequence-level searches on large collections of RNAsequencing experiments, such as the NCBI Sequence Read Archive (SRA), would enable one to ask many questions about the expression or variation of a given transcript in a population. Existing approaches, such as the sequence Bloom tree, suffer from fundamental limitations of the Bloom filter, resulting in slow build and query times, less-than-optimal space usage, and potentially large numbers of false-positives. This paper introduces Mantis, a space-efficient system that uses new data structures to index thousands of raw-read experiments and facilitates large-scale sequence searches. In our evaluation, index construction with Mantis is 6× faster and yields a 20% smaller index than the state-of-the-art split sequence Bloom tree (SSBT). For queries, Mantis is 6-108× faster than SSBT and has no false-positives or -negatives. For example, Mantis was able to search for all 200,400 known human transcripts in an index of 2,652 RNA sequencing experiments in 82min; SSBT took close to 4days.

Full Text