Abstract

We consider the problem of evaluating regular spanners over compressed documents, i.e., we wish to solve evaluation tasks directly on the compressed data, without decompression. As compressed forms of the documents we use straight-line programs (SLPs) --- a lossless compression scheme for textual data widely used in different areas of theoretical computer science and particularly well-suited for algorithmics on compressed data. In data complexity, our results are as follows. For a regular spanner M and an SLP $\mathcalS $ of size $\mathbfs $ that represents a document D, we can solve the tasks of model checking and of checking non-emptiness in time $O(\mathbfs )$. Computing the set $llbracket M \rrbracket(D)$ of all span-tuples extracted from D can be done in time $O(\mathbfs |llbracket M \rrbracket(D)|)$, and enumeration of $llbracket M \rrbracket(D)$ can be done with linear preprocessing $O(\mathbfs )$ and a delay of $O(depth\mathcalS )$, where $depth\mathcalS $ is the depth of $\mathcalS $'s derivation tree. Note that $\mathbfs $ can be exponentially smaller than the document's size $|D|$; and, due to known balancing results for SLPs, we can always assume that $depth\mathcalS = O(log(|D|))$ independent of D's compressibility. Hence, our enumeration algorithm has a delay logarithmic in the size of the non-compressed data and a preprocessing time that is at best (i.e., in the case of highly compressible documents) also logarithmic, but at worst still linear. Therefore, in a big-data perspective, our enumeration algorithm for SLP-compressed documents may nevertheless beat the known linear preprocessing and constant delay algorithms for non-compressed documents.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.