A Fast Algorithm for the Largest Area First Parsing of Real Strings

Ivan Katanic,Strahil Ristov,Martin Rosenzweig

doi:10.1109/access.2020.3013676

Abstract

The largest area first parsing of a string often leads to the best results in grammar compression for a variety of input data. However, the fastest existing algorithm has $\Theta (N^{2} \log N)$ time complexity, which makes it impractical for real-life applications. We present a new largest area first parsing method that has $O(N^{3})$ complexity in the improbable worst case but works in the quasilinear time for most practical purposes. This result is based on the fact that in the real data, the sum of all depths of an LCP-interval tree, over all of the positions in a suffix array of an input string, is only larger than the size of the input by a small factor $\alpha $ . We present the analysis of the algorithm in terms of $\alpha $ , and the experimental results confirm that our method is practical even for genome sized inputs. We provide the C ++ 11 code for the implementation of our method. Additionally, we show that by a combination of the previous and new algorithms, the worst-case complexity of the largest area first parsing is improved by a factor of $\sqrt [{3}]{N}$ .

Highlights

Finding the repetitions in a string is among the most researched tasks in stringology
Grammar text compression is a compression procedure where repeated substrings in a string are replaced with the production rules, and the string is represented with a context-free grammar (CFG) that has the exact input string as the only product
Favoring the SA approach, we have found that in some dynamic applications, it is possible to use a suffix array without the need for changes in the array itself, and instead, the updates may be performed in the fast auxiliary structures

Summary

INTRODUCTION

Finding the repetitions in a string is among the most researched tasks in stringology. The LF method finds the current longest repeated substrings at any point in the algorithm execution and iteratively replaces them with rules [11]. The existence of a worst-case linear time algorithm for this task is an open problem. I. Katanić et al.: Fast Algorithm for the Largest Area First Parsing of Real Strings in the case of an input that consists of a repetition of only one symbol, but it exhibits approximately linear behavior in experiments on a wide range of standard test files. We show that by combining our algorithm with that from [13] into a hybrid method, we can obtain a better worst-case complexity for LAF parsing.

LARGEST AREA FIRST GRAMMAR TEXT COMPRESSION AND DYNAMIC TEXT INDEXING

THE COMPLEXITY ANALYSIS AND EXPERIMENTAL RESULTS

WORST-CASE PROOF

CONCLUSION

Findings

SOLUTION

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE access : practical innovations, open solutions	Publication Date: Jan 1, 2020
Citations: 26	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A Fast Algorithm for the Largest Area First Parsing of Real Strings

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE access : practical innovations, open solutions

Lead the way for us

Similar Papers

Synchronous Hyperedge Replacement Graph Grammars
Corey Pennycuff ... Satyaki Sikdar
-
Corey Pennycuff, et. al.Corey Pennycuff ... Satyaki Sikdar
01 Jan 2018
01 Jan 2018

Deterministic Parsing of Cyclic Strings
Bořivoj Melichar
-
Bořivoj MelicharBořivoj Melichar
01 Jan 2003
01 Jan 2003

Distributed symmetry-breaking with improved vertex-averaged complexity
Leonid Barenboim ... Yaniv Tzur
-
Leonid Barenboim, et. al.Leonid Barenboim ... Yaniv Tzur
04 Jan 2019
04 Jan 2019

A Subquadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices
Maxime Crochemore ... Gad M Landau
SIAM Journal on Computing | VOL. 32
Maxime Crochemore, et. al.Maxime Crochemore ... Gad M Landau
01 Jan 2003
SIAM Journal on Computing | VOL. 32

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Fast Algorithm for the Largest Area First Parsing of Real Strings

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE access : practical innovations, open solutions