Abstract

In this paper, we propose a novel approach to combine compact directed acyclic word graphs (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with $$O(\tilde{e}_T \log n)$$ bits of space allowing for $$O(\log n)$$ -time random and O(1)-time sequential accesses to edge labels, and $$O(m \log \sigma + occ)$$ -time pattern matching. Here, $$\tilde{e}_T$$ is the number of all extensions of maximal repeats in T, n and m are respectively the lengths of the text T and a given pattern, $$\sigma $$ is the alphabet size, and $$ occ $$ is the number of occurrences of the pattern in T. The repetitiveness measure $$\tilde{e}_T$$ is known to be much smaller than the text length n for highly repetitive text. For constant alphabets, our L-CDAWGs achieve $$O(m + occ )$$ pattern matching time with $$O(e_T^r \log n)$$ bits of space, which improves the pattern matching time of Belazzougui et al.’s run-length BWT-CDAWGs by a factor of $$\log \log n$$ , with the same space complexity. Here, $$e_T^r$$ is the number of right extensions of maximal repeats in T. As a byproduct, our result gives a way of constructing a straight-line program (SLP) of size $$O(\tilde{e}_T)$$ for a given text T in $$O(n + \tilde{e}_T \log \sigma )$$ time.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.