Subpath Queries on Compressed Graphs: A Survey

Nicola Prezza

doi:10.3390/a14010014

Abstract

Text indexing is a classical algorithmic problem that has been studied for over four decades: given a text T, pre-process it off-line so that, later, we can quickly count and locate the occurrences of any string (the query pattern) in T in time proportional to the query’s length. The earliest optimal-time solution to the problem, the suffix tree, dates back to 1973 and requires up to two orders of magnitude more space than the plain text just to be stored. In the year 2000, two breakthrough works showed that efficient queries can be achieved without this space overhead: a fast index be stored in a space proportional to the text’s entropy. These contributions had an enormous impact in bioinformatics: today, virtually any DNA aligner employs compressed indexes. Recent trends considered more powerful compression schemes (dictionary compressors) and generalizations of the problem to labeled graphs: after all, texts can be viewed as labeled directed paths. In turn, since finite state automata can be considered as a particular case of labeled graphs, these findings created a bridge between the fields of compressed indexing and regular language theory, ultimately allowing to index regular languages and promising to shed new light on problems, such as regular expression matching. This survey is a gentle introduction to the main landmarks of the fascinating journey that took us from suffix trees to today’s compressed indexes for labeled graphs and regular languages.

Highlights

Consider the classic algorithmic problem of finding the occurrences of a particular string Π in a text T
By subtracting from those numbers the length of the pattern’s prefix, we discover that Π = CAC occurs at positions 11 and 9 crossing a phrase with split CA|C
In the hypertext indexing problem, the goal is to build an index over G able to quickly support locate queries on the paths of G: given a pattern Π, determine the positions in the graph where an occurrence of Π starts

Summary

Introduction

Consider the classic algorithmic problem of finding the occurrences of a particular string Π (a pattern) in a text T. While compressed text indexing has already been covered in the literature in excellent books [5,6] and surveys [7,8,9], the generalizations of these advanced techniques to labeled graphs, dating back two decades, lack a single point of reference despite having reached a mature state-of-the-art. We assume the alphabet to be totally ordered by an order we denote with ≤, and write a < b when a ≤ b and a 6= b In this survey, we consider two extensions of ≤ to strings (and, later, to labeled graphs). This paper deals with the indexed pattern matching problem: preprocess a text so that, later all text occurrences of any query pattern Π ∈ Σm of length m can be efficiently counted and located These queries can be generalized to labeled graphs; we postpone the exact definition of this generalization to Section 3

The Labeled Path Case

The Entropy Model

The Repetitive Model

Indexing Labeled Graphs and Regular Languages

Graph Compression

Conditional Lower Bounds

Hypertext Indexing

Prefix Sorting

Indexing Labeled Trees

The Prefix Array of a Labeled Tree

The XBW Transform

Inverting the XBWT

Subpath Queries

Compression

Further Generalizations

Wheeler Graphs

Sorting and Recognizing Wheeler Graphs

Wheeler Languages

Conclusions and Future Challenges

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Algorithms	Publication Date: Jan 5, 2021
Citations: 3	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Subpath Queries on Compressed Graphs: A Survey

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms

Lead the way for us

Similar Papers

PBWT: achieving succinct data structures for parameterized pattern matching and related problems
...
-
, et. al. ...
16 Jan 2017
16 Jan 2017

PBWT: Achieving Succinct Data Structures for Parameterized Pattern Matching and Related Problems
Arnab Ganguly ... Rahul Shah
-
Arnab Ganguly, et. al.Arnab Ganguly ... Rahul Shah
01 Jan 2017
01 Jan 2017

I/O-Efficient Compressed Text Indexes: From Theory to Practice
Sheng-Yuan Chiu ... Jeffrey Scott Vitter
-
Sheng-Yuan Chiu, et. al.Sheng-Yuan Chiu ... Jeffrey Scott Vitter
01 Jan 2009
01 Jan 2009

CICERO: A Domain-Specific Architecture for Efficient Regular Expression Matching
Daniele Parravicini ... Emanuele Del Sozzo
ACM Transactions in Embedded Computing Systems | VOL. 20
Daniele Parravicini, et. al.Daniele Parravicini ... Emanuele Del Sozzo
17 Sep 2021
ACM Transactions in Embedded Computing Systems | VOL. 20

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Subpath Queries on Compressed Graphs: A Survey

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms