Substring Queries Research Articles

Given an input string, the Burrows-Wheeler Transform (BWT) can be seen as a reversible permutation of it that allows efficient compression and fast substring queries. Due to these properties, it has been widely applied in the analysis of genomic sequence data, enabling important tasks such as read alignment. Mantaci et al. [TCS2007] extended the notion of the BWT to a collection of strings by defining the extended Burrows-Wheeler Transform (eBWT). This definition requires no modification of the input collection, and has the property that the output is independent of the order of the strings in the collection. However, over the years, the term eBWT has been used more generally to describe any BWT of a collection of strings. The fundamental property of the original definition (i.e., the independence from the input order) is frequently disregarded. In this paper, we propose a simple linear-time algorithm for the construction of the original eBWT, which does not require the preprocessing of Bannai et al. [CPM 2021]. As a byproduct, we obtain the first linear-time algorithm for computing the BWT of a single string that uses neither an end-of-string symbol nor Lyndon rotations. We also combine our new eBWT construction with a variation of prefix-free parsing (PFP) [WABI 2019] to allow for construction of the eBWT on large collections of genomic sequences. We implement this combined algorithm (pfpebwt) and evaluate it on a collection of human chromosomes 19 from the 1,000 Genomes Project, on a collection of Salmonella genomes from GenomeTrakr, and on a collection of SARS-CoV2 genomes from EBI's COVID-19 data portal. We demonstrate that pfpebwt is the fastest method for all collections, with a maximum speedup of 7.6x on the second best method. The peak memory is at most 2x larger than the second best method. Comparing with methods that are also, as our algorithm, able to report suffix array samples, we obtain a 57.1x improvement in peak memory. The source code is publicly available at https://github.com/davidecenzato/PFP-eBWT.

Read full abstract

In a variety of settings from relational databases to LDAP to Web applications, there is an increasing need to quickly and accurately estimate the count of tuples (LDAP entries, Web documents, etc.) matching Boolean substring queries. In providing such selectivity estimates, the correlation between different occurrences of substrings is crucial. Selectivity estimation for generalized Boolean queries has not been studied previously; our own prior work, which is discussed and extended herein, applies to the case of one-dimensional Boolean queries [CKKM00]. Existing methods for the case of multidimensional conjunctive queries approximate selectivities by explicitly storing cross-counts of frequently co-occurring combinations of substrings; estimates are obtained by parsing the query into multidimensional substrings corresponding to stored cross-counts and applying probabilistic formulae. The major problem with these methods is that the number of cross-counts stored by known methods increases exponentially with the number of dimensions (a “space dimensionality explosion”) due to the need to capture the correlation amongst the dimensions. Hence, given a limited amount of space, none of the existing methods can reliably give accurate estimates. Moreover, these methods do not generalize to Boolean queries gracefully. We present a novel approach to selectivity estimation for generalized Boolean substring queries with a focus on the two cases of (1) conjunctive multidimensional and (2) Boolean queries. Our approach does not explicitly store cross-counts, but rather generates them on-the-fly. We employ a Monte Carlo technique called set hashing to succinctly represent the set of tuples containing a given substring as a signature vector of hash values; any combination of set hash signatures gives a cross-count when intersected. Thus, using only linear storage, a large number of cross-counts can be generated including those for complex co-occurrences of substrings. The cross-counts generated by our methods are not exact, but they are adequate for selectivity estimation. We present results from an extensive experimental evaluation of our approach on real data sets. For the case of multidimensional conjunctive queries, our approach achieves better accuracy by an order of magnitude, and scales much more gracefully to higher dimensions, than existing methods. Surprisingly, even though our approach involves generating cross-counts on-the-fly, estimation is very fast, taking 200 μs on a data set of size 6 MB . For the case of Boolean queries, our experiments also demonstrate the superiority of this approach over a straightforward independence-based approach wherein correlations are not captured.

Read full abstract

Substring Queries Research Articles

Related Topics

Articles published on Substring Queries

A novel method for designing indexes to support efficient substring queries on encrypted databases

Cardinality estimation of approximate substring queries using deep learning

Achieve Efficient and Privacy-Preserving Compound Substring Query over Cloud

Computing the original eBWT faster, simpler, and with less memory.

CHOP: haplotype-aware path indexing in population graphs

Privacy-Preserving Substring Search on Multi-Source Encrypted Gene Data

Parallel Methods for Finding k-Mismatch Shortest Unique Substrings Using GPU.

Time–space trade-offs for Lempel–Ziv compressed indexing

A simple yet time-optimal and linear-space algorithm for shortest unique substring queries

Generalized substring selectivity estimation

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Substring Queries Research Articles

Related Topics

Articles published on Substring Queries

A novel method for designing indexes to support efficient substring queries on encrypted databases

Cardinality estimation of approximate substring queries using deep learning

Achieve Efficient and Privacy-Preserving Compound Substring Query over Cloud

Computing the original eBWT faster, simpler, and with less memory.

CHOP: haplotype-aware path indexing in population graphs

Privacy-Preserving Substring Search on Multi-Source Encrypted Gene Data

Parallel Methods for Finding k-Mismatch Shortest Unique Substrings Using GPU.

Time–space trade-offs for Lempel–Ziv compressed indexing

A simple yet time-optimal and linear-space algorithm for shortest unique substring queries

Generalized substring selectivity estimation